Hadoop Honours Project: 2011

Monday, October 31, 2011

Update

Slowly working my way through the Hadoop book in an attempt to take in as much information as possible and get a sample server setup before the christmas holidays. I could probably jump the gun and google what I want, and get something running within a day, but it would bypass actually learning the background etc... so lets see how this goes!

Friday, October 21, 2011

"Research" update.

Finally managed to knuckle down and start reading some of the materials I have on Hadoop, MapReduce and Dimensional Databases.

Firstly, a little on Hadoop:

Hadoop File Distrbution (HDFS) is a cluster of distributed servers that contain vast quantities of data. These servers contain unstructured data, or semi-structured Data. Data such as Text files or images is the type of data that is usually found. Unstructured and un-normalised data. The data is batch processed and all machines perform a data fetch at the same time. This is similar to how Hard Drives would work when set up in a RAID array - but over a larger number of drives, multiplying the processing speed. It is also more reliable as data redundancy is automatically incorporated by sharing data between multiple systems. The data that is extracted from the system is parsed into a MapReduce methos which uses key set functions to process the data and merge it into a usable form.

After Thursdays meeting with Andy (@AndyCobley) we went over my readings on Dimensional databases and the physical side of the project. I am attempting to setup a few PCs in Pseudo-distributed mode to create a cluster to hold a large number of web server log files on that we will attempt to perform some MapReduce functions on.

Suitable, since according to the book I have been reading:

"A web server log is a good example of a set of records that is not normalized (for example,

the client hostnames are specified in full each time, even though the same client

may appear many times), and this is one reason that logfiles of all kinds are particularly

well-suited to analysis with MapReduce."

Hadoop - The definitive Guide, Tom White.

Wednesday, October 5, 2011

Research research research.

I have been looking up the information available on apache's site. I believe some of the information shall prove very useful with the setting up of the project, and it also helps me with the required tech document.

Firstly a server needs to be setup.
Since Hadoop is not properly tested windows is not suported as a prodiction platform - so the server shall run Linux as this doubles as a development and production platform.

Secondly, the server needs to have the appropriate software installed:
Java 1.6+
ssh
latest stable hadoop release.

I need to find out, or decide if I am going to set up a server to run in 'pseudo-distributed' mode, (one server that is programmed to split the data up and behave as if it were a number of seperate servers in a cluster) or if I want to assemble or find a way to operate a network cluster.

There are installation instructions on the apache website. Once the server is set up, map reduce needs to be installed. The commands for this are written in Java.

Introductory Post.

Welcome to my honours project blog!

I am in 4th year at Dundee university studying Applied Computing. As part of the honours year I am required to participate in an honours project. This blog shall be the documentation/journal of the development of said project.

So, what is the project?

To summarise my understanding of it at this point in time - it is to use Hadoop to perform map-reduce functions on a dimensional database. We do not know how well this is going to work, but my task is to setup the database and use the hadoop functionality to try to analyse and extract data from the database.

My first tasks are:

Start this blog.
Write an overview of the project (Presumably this will gauge my understanding of the project so I can be steered on course.)
Perform a risk analysis of the project.
Slap together a Gantt chart. (Will involve a proper breakdown and understanding of whats involved, but this will be a rough overview of the primary tasks and estimated time scale initially)

More later.