Friday, July 17, 2015

Taking a break and playing with Hadoop

Background - Why I'm taking a break
I've been working on SAP projects for over twelve years. This year, I decided to take a slight break to catch up on some of my skills that may be lacking.

At heart I'm a Unix/Linux fan. With a cheap price point and the ability to install on nearly any platform means I can work and learn at home and at my leisure. My latest project is to get Hadoop running locally on a Debian system. The steps below are those I recorded as I installed a simple system (these are really just my crib sheets for the next time I want to install Hadoop).

As a side note, this is how I create most of my documentation. Whenever I install, upgrade or migrate a SAP system, I capture everything! I use Snag-it to grab every screen shot that appears as I go from step to step. Even if I don't capture all of the screenshots in a document, I still have a running record of what was completed which Snag-it happily sorts into daily folders. There have been a few times when I could not recall what was completed (I keep paper notes as well) but I was able to pull up a screenshot that showed the exact steps.

I. System Preparation
First I installed a new copy of Debian (barebones style with no KDE/Gnome). Then I started prepping and installing the system by following the steps on this website (a big 'thank you' to Wei Wang):
http://www.drweiwang.com/install-hadoop-2-2-0-debian/

For Hadoop, we need to prepare the system with the following tasks (the Hadoop site has a great list here Single Cluster Install):
  1. Install Java (Hadoop requires at least Java 1.6)
  2. Create users and groups
  3. Setup SSH connectivity 

The first thing I did was to install Java as shown here:
sudo apt-get install openjdk-7-jre


Then I ran a quick check to make sure the right version of java is the default with this command:
java -version


Still following the steps on Dr. Wei's site, create users and groups as shown below:
addgroup hadoop
adduser --ingroup hadoop hadoopuser
adduser hadoopuser sudo





The system needs to be able to connect to itself over SSH. Generate the RSA keys as shown below to allow for local SSH connections to be completed without asking for a username and password.
ssh-keygen -t rsa -P ''


Update the authorized_keys file for user hadoopuser as shown here:
cd ~
cd .ssh
cat id_rsa.pub > authorized_keys
To test --> ssh localhost (you should not be asked for a password)



II. Download, install, and configure Hadoop
In this phase, we just complete the following:
  1. Download and extract the Hadoop binaries
  2. Rename Hadoop directory and set the ownership of the extracted files
  3. Update the environment variables for the hadoopuser account
For the download (I'm still following the steps on Dr. Wei's site), I changed to the /usr/local directory and then completed the download with the command:
wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz



Then I extracted the sources files with the command:
tar -xvf hadoop-2.7.1.tar.gz

Next, rename the directory to just 'hadoop' with the command:
mv hadoop-2.7.1 hadoop


As detailed in Dr. Wei's steps, change the ownership of the newly renamed directory with this command:
chown -R hadoopuser:hadoop hadoop
This sets the the ownership to the hadoopuser user account and the hadoop group for all files under the hadoop subdirectory.


Next, we need to setup the environment variables for the hadoopuser account so the system knows where the hadoop resources are located (such as the binary and configuration file locations).  These variables are applied when the hadoopuser account logs into the system and executes the .bashrc file (as part of the login routine).

Update the environment variables with these steps:
cd ~
vi .bashrc



Add these lines to the bottom of the .bashrc file:
# For Hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop





Next comes some basic updates to the Hadoop configuration files. Here, the JAVA_HOME variable is updated in the hadoop-env.sh script.
cd /usr/local/hadoop/etc/hadoop
vi hadoop-env.sh

"

Update JAVA_HOME to:
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre/


To confirm that everything is working okay,  a quick test is run to see if Hadoop starts up.
cd /usr/local/hadoop/bin
./hadoop version



Following Dr. Wei's site, here are more configuration files that need to be updated:
cd /usr/local/hadoop/etc/hadoop
vi core-site.xml


Add these lines inbetween the entries:

   fs.default.name
   hdfs://localhost:9000



Update yarn-site.xml with these lines:


   yarn.nodemanager.aux-services
   mapreduce_shuffle


   yarn.nodemanager.aux-services.mapreduce.shuffle.class
   org.apache.hadoop.mapred.ShuffleHandler





Update the mapred-site.xml file with these lines:

   mapreduce.framework.name
   yarn






The Hadoop file system itself needs to be setup and initialized. First (following Wei's steps) create directories to hold HDFS data using these steps:
(as user hadoopuser)
cd ~
mkdir -p mydata/hdfs/namenode
mkdir -p mydata/hdfs/datanode


Then update the hdfs configuration file:
cd /usr/local/hadoop/etc/hadoop
vi hdfs-site.xml



Add these lines to the file:

   dfs.replication
   1


   dfs.namenode.name.dir
   file:/home/hadoopuser/mydata/hdfs/namenode


   dfs.datanode.data.dir
   file:/home/hadoopuser/mydata/hdfs/datanode



Format the namenode using the hdfs command:
cd /usr/local/hadoop/bin
./hdfs namenode -format




III. Start Hadoop and run some tests!

At this point, everything has been configured so we just need to start Hadoop and run some tests to validate the system.

Start hadoop with these steps:
cd /usr/local/hadoop/sbin
./start-all.sh



Check the processes with command: ps -ax
As you can see below, the namenode, the datanode, and the secondarynamenode are running as needed.


A quick visit to port 50070 on the system shows the datanode is running and HDFS is available as needed.


Finally, run a test using the mapreduce example:

cd /usr/local/hadoop/bin
./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5



Check out the progress on port 8088


Confirm that everything complete without any issues.


So there it is. A full running Hadoop system on an old half abandoned system with half a gig of memory. Not bad!

Next I just need to setup a distributed environment, use a faster box, and finally learn the power of this sophisticated software.




1 comment:

Jhonathan said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

Hadoop Training Chennai | Big Data Training Chennai