Finally managed to knuckle down and start reading some of the materials I have on Hadoop, MapReduce and Dimensional Databases.
Firstly, a little on Hadoop:
Hadoop File Distrbution (HDFS) is a cluster of distributed servers that contain vast quantities of data. These servers contain unstructured data, or semi-structured Data. Data such as Text files or images is the type of data that is usually found. Unstructured and un-normalised data. The data is batch processed and all machines perform a data fetch at the same time. This is similar to how Hard Drives would work when set up in a RAID array - but over a larger number of drives, multiplying the processing speed. It is also more reliable as data redundancy is automatically incorporated by sharing data between multiple systems. The data that is extracted from the system is parsed into a MapReduce methos which uses key set functions to process the data and merge it into a usable form.
After Thursdays meeting with Andy (@AndyCobley) we went over my readings on Dimensional databases and the physical side of the project. I am attempting to setup a few PCs in Pseudo-distributed mode to create a cluster to hold a large number of web server log files on that we will attempt to perform some MapReduce functions on.
Suitable, since according to the book I have been reading:
"A web server log is a good example of a set of records that is not normalized (for example,
the client hostnames are specified in full each time, even though the same client
may appear many times), and this is one reason that logfiles of all kinds are particularly
well-suited to analysis with MapReduce."
Hadoop - The definitive Guide, Tom White.
No comments:
Post a Comment