Hadoop Honours Project: Starting Afresh

Having not had much time, or luck with trying to get sample work up and running, I have decided to start anew and see where the various tutorials out there take me, so here is what I am doing:

Sources:
http://bigdatablog.co.uk/install-hadoop
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://hadoop.apache.org/common/docs/current/single_node_setup.html

1. Install Ubuntu (11.10 at time of writing):
Firstly, Ubuntu needs to be installed. I am doing this using virtualbox. I already have a few instances installed, but this one shall be specialised just for the project to try to keep it focused. I have Ubuntu run on a second monitor, whilst the primary monitor is used for generic work such as reading tutorials, and filling in this blog.

2. Install Sun Java 1.6 JDK
This seems easiest done from the command line by the following:

Note: the method that was posted here no longer works, to see a working tutorial, go to: http://hadoopproject.blogspot.com/2012/03/installing-sun-java6-jdk-on-ubuntu-1110.html

Note: When installing it, you must accept the license. It took me a while to figure how to accept - you have to hit tab to select the 'OK' in the middle of the screen, then hit enter.

3. Install SSH
In order to access the clusters we need to have SSH installed.

:~$ sudo apt-get install ssh

4. Setup a hadoop user account
Create a group called hadoop, then create and add the hduser user account to it.

:~$ sudo addgroup hadoop
:~$ sudo adduser --ingroup hadoop hduser

Note: In order to be able to perform sudo commands - the account needs to be added as an admin from an account that already has admin privelages. To do this simply type:

:~$ sudo adduser hduser admin

Configure SSH
Switch to the hadoop user account, then create an ssh key.

:~$ su - hduser
:~$ ssh-keygen -t rsa -P ""

Enable SSH access to localmachine:

:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Finally - test the connection, which will add the localhost to list of known hosts:

:~$ ssh localhost
confirm 'yes'

Note: Some tutorials claim you should disable IPV6 at this point... I shall not, at least for now.

5. Install Hadoop.
Note, hadoop v1.0.0 is now released, but as the tutorials were all written using 0.20.2 or 0.20.203 - I will acutally use 0.20.2. Using v1.0.0 may have contributed to some of my problems, and I am not familiar enough with the system to be adapting the code to work with 1.0.0.

On the apache site, the Hadoop 0.20.2 tar.gzz was downloaded, unpackaged, renamed for easier access and had the ownership changed to the hduser account:

cd /home/stu/downloads
:~$ sudo tar xzf hadoop-0.20.2.tar.gz
:~$ sudo mv hadoop-0.20.2 hadoop
:~$ sudo chown -R hduser:hadoop hadoop

Note: For some reason... Gedit wont work from the hduser account... sure I had this problem last time around but have not been able to fix. Any advice welcome!

6. Update .bashrc
This one is slightly tricky. Files starting with a . are hidden, but by typing the gedit command you can edit the file - even though it is not visible.

:~$ gedit $HOME/.bashrc

Alternatively you can open gedit and open a file, and when on the home page, right clock on the page and selecte 'show hidden' although using this method will open a read-only copy... so in essence, is relatively pointless.

edit to the bottom of the .bashrc doc:

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

7. CONFIGURING:
7.1 Hadoop-env.sh
The JAVA_HOME variable needs configured. In my configuration this is:

:~$ sudo gedit /home/stu/Downloads/hadoop/conf/hadoop-env.sh

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Note: Sudo is being used as gedit is not working from hduser and these modifications are being made via another account. Unless sudo is used, the file is opened as read-only.

Note - at this point I copied the hadoop folder to /usr/local.

:~$ cd /home/stu/Downloads
:~$ sudo cp -r hadoop /usr/local

and repeated the export commands from command lines to set up the environmental variables (as per BigDataBlog):

:~$ export HADOO_HOME=/usr/local/hadoop
:~$ export PATH=$PATH:$HADOOP_HOME/bin
:~$ export JAVA_HOME=/usr/lib/jvm/java-6-sun

This may have already been done above, but I am just covering the bases to align my project with the tutorials for ease of use/referencing...

7.2 Standalone mode test:
From BIGDATABLOG

:~$ hadoop

Typing this should display a help message - if it does, it is correctly configured. I did it, and it does. #winning.

8. First MapReduce job:
This is where my old version stopped working... so lets set it up right.

cd /usr/local/hadoop
:~$ sudo chmod 777 hadoop-0.20.2-examples.jar

This should remove any permission issues, now we enter the following to perform a search for 'the' in the LICENCE.txt file (Which has also beem universified (chmod 777 - universified sounds quicker...)

:~$ hadoop jar /usr/local/hadoop/hadoop-0.20.2-examples.jar grep /usr/local/hadoop/LICENSE.txt outdir the
//performs functions - searching LICENSE.txt for 'the' and copying the output to the outdir folder.
:~$ cat outdir/part-00000

Output: 144 the
In other words, 144 'the's' were counted in the LICENSE.txt file. It works! Huzzah!

Now - for some notes after some playing::

It only seems to make a single file - which it puts into the outdir folder - if I try to run another check against a different word, say: 'and', then it does nothing as the file already exists. If I change from outdir to outdir2 it creates a new folder called 'outdir2' with the same part-00000 file - except when you 'cat part-00000' on this, it says '52 and' indicating 52 and's were found.

Hadoop Honours Project

Tuesday, January 24, 2012

Starting Afresh

No comments:

Post a Comment