Friday, July 17, 2015

Taking a break and playing with Hadoop

Background - Why I'm taking a break
I've been working on SAP projects for over twelve years. This year, I decided to take a slight break to catch up on some of my skills that may be lacking.

At heart I'm a Unix/Linux fan. With a cheap price point and the ability to install on nearly any platform means I can work and learn at home and at my leisure. My latest project is to get Hadoop running locally on a Debian system. The steps below are those I recorded as I installed a simple system (these are really just my crib sheets for the next time I want to install Hadoop).

As a side note, this is how I create most of my documentation. Whenever I install, upgrade or migrate a SAP system, I capture everything! I use Snag-it to grab every screen shot that appears as I go from step to step. Even if I don't capture all of the screenshots in a document, I still have a running record of what was completed which Snag-it happily sorts into daily folders. There have been a few times when I could not recall what was completed (I keep paper notes as well) but I was able to pull up a screenshot that showed the exact steps.

I. System Preparation
First I installed a new copy of Debian (barebones style with no KDE/Gnome). Then I started prepping and installing the system by following the steps on this website (a big 'thank you' to Wei Wang):

For Hadoop, we need to prepare the system with the following tasks (the Hadoop site has a great list here Single Cluster Install):
  1. Install Java (Hadoop requires at least Java 1.6)
  2. Create users and groups
  3. Setup SSH connectivity 

The first thing I did was to install Java as shown here:
sudo apt-get install openjdk-7-jre

Then I ran a quick check to make sure the right version of java is the default with this command:
java -version

Still following the steps on Dr. Wei's site, create users and groups as shown below:
addgroup hadoop
adduser --ingroup hadoop hadoopuser
adduser hadoopuser sudo

The system needs to be able to connect to itself over SSH. Generate the RSA keys as shown below to allow for local SSH connections to be completed without asking for a username and password.
ssh-keygen -t rsa -P ''

Update the authorized_keys file for user hadoopuser as shown here:
cd ~
cd .ssh
cat > authorized_keys
To test --> ssh localhost (you should not be asked for a password)

II. Download, install, and configure Hadoop
In this phase, we just complete the following:
  1. Download and extract the Hadoop binaries
  2. Rename Hadoop directory and set the ownership of the extracted files
  3. Update the environment variables for the hadoopuser account
For the download (I'm still following the steps on Dr. Wei's site), I changed to the /usr/local directory and then completed the download with the command:

Then I extracted the sources files with the command:
tar -xvf hadoop-2.7.1.tar.gz

Next, rename the directory to just 'hadoop' with the command:
mv hadoop-2.7.1 hadoop

As detailed in Dr. Wei's steps, change the ownership of the newly renamed directory with this command:
chown -R hadoopuser:hadoop hadoop
This sets the the ownership to the hadoopuser user account and the hadoop group for all files under the hadoop subdirectory.

Next, we need to setup the environment variables for the hadoopuser account so the system knows where the hadoop resources are located (such as the binary and configuration file locations).  These variables are applied when the hadoopuser account logs into the system and executes the .bashrc file (as part of the login routine).

Update the environment variables with these steps:
cd ~
vi .bashrc

Add these lines to the bottom of the .bashrc file:
# For Hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre/
export HADOOP_INSTALL=/usr/local/hadoop

Next comes some basic updates to the Hadoop configuration files. Here, the JAVA_HOME variable is updated in the script.
cd /usr/local/hadoop/etc/hadoop


Update JAVA_HOME to:

To confirm that everything is working okay,  a quick test is run to see if Hadoop starts up.
cd /usr/local/hadoop/bin
./hadoop version

Following Dr. Wei's site, here are more configuration files that need to be updated:
cd /usr/local/hadoop/etc/hadoop
vi core-site.xml

Add these lines inbetween the entries:

Update yarn-site.xml with these lines:



Update the mapred-site.xml file with these lines:

The Hadoop file system itself needs to be setup and initialized. First (following Wei's steps) create directories to hold HDFS data using these steps:
(as user hadoopuser)
cd ~
mkdir -p mydata/hdfs/namenode
mkdir -p mydata/hdfs/datanode

Then update the hdfs configuration file:
cd /usr/local/hadoop/etc/hadoop
vi hdfs-site.xml

Add these lines to the file:


Format the namenode using the hdfs command:
cd /usr/local/hadoop/bin
./hdfs namenode -format

III. Start Hadoop and run some tests!

At this point, everything has been configured so we just need to start Hadoop and run some tests to validate the system.

Start hadoop with these steps:
cd /usr/local/hadoop/sbin

Check the processes with command: ps -ax
As you can see below, the namenode, the datanode, and the secondarynamenode are running as needed.

A quick visit to port 50070 on the system shows the datanode is running and HDFS is available as needed.

Finally, run a test using the mapreduce example:

cd /usr/local/hadoop/bin
./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5

Check out the progress on port 8088

Confirm that everything complete without any issues.

So there it is. A full running Hadoop system on an old half abandoned system with half a gig of memory. Not bad!

Next I just need to setup a distributed environment, use a faster box, and finally learn the power of this sophisticated software.

No comments: