I've been working on SAP projects for over twelve years. This year, I decided to take a slight break to catch up on some of my skills that may be lacking.
At heart I'm a Unix/Linux fan. With a cheap price point and the ability to install on nearly any platform means I can work and learn at home and at my leisure. My latest project is to get Hadoop running locally on a Debian system. The steps below are those I recorded as I installed a simple system (these are really just my crib sheets for the next time I want to install Hadoop).
As a side note, this is how I create most of my documentation. Whenever I install, upgrade or migrate a SAP system, I capture everything! I use Snag-it to grab every screen shot that appears as I go from step to step. Even if I don't capture all of the screenshots in a document, I still have a running record of what was completed which Snag-it happily sorts into daily folders. There have been a few times when I could not recall what was completed (I keep paper notes as well) but I was able to pull up a screenshot that showed the exact steps.
I. System Preparation
First I installed a new copy of Debian (barebones style with no KDE/Gnome). Then I started prepping and installing the system by following the steps on this website (a big 'thank you' to Wei Wang):
For Hadoop, we need to prepare the system with the following tasks (the Hadoop site has a great list here Single Cluster Install):
- Install Java (Hadoop requires at least Java 1.6)
- Create users and groups
- Setup SSH connectivity
The first thing I did was to install Java as shown here:
sudo apt-get install openjdk-7-jre
Then I ran a quick check to make sure the right version of java is the default with this command:
Still following the steps on Dr. Wei's site, create users and groups as shown below:
adduser --ingroup hadoop hadoopuser
adduser hadoopuser sudo
The system needs to be able to connect to itself over SSH. Generate the RSA keys as shown below to allow for local SSH connections to be completed without asking for a username and password.
ssh-keygen -t rsa -P ''
Update the authorized_keys file for user hadoopuser as shown here:
cat id_rsa.pub > authorized_keys
To test --> ssh localhost (you should not be asked for a password)
II. Download, install, and configure Hadoop
In this phase, we just complete the following:
- Download and extract the Hadoop binaries
- Rename Hadoop directory and set the ownership of the extracted files
- Update the environment variables for the hadoopuser account
Then I extracted the sources files with the command:
tar -xvf hadoop-2.7.1.tar.gz
Next, rename the directory to just 'hadoop' with the command:
mv hadoop-2.7.1 hadoop
As detailed in Dr. Wei's steps, change the ownership of the newly renamed directory with this command:
chown -R hadoopuser:hadoop hadoop
This sets the the ownership to the hadoopuser user account and the hadoop group for all files under the hadoop subdirectory.
Next, we need to setup the environment variables for the hadoopuser account so the system knows where the hadoop resources are located (such as the binary and configuration file locations). These variables are applied when the hadoopuser account logs into the system and executes the .bashrc file (as part of the login routine).
Update the environment variables with these steps:
Add these lines to the bottom of the .bashrc file:
# For Hadoop
Next comes some basic updates to the Hadoop configuration files. Here, the JAVA_HOME variable is updated in the hadoop-env.sh script.
Update JAVA_HOME to:
To confirm that everything is working okay, a quick test is run to see if Hadoop starts up.
Following Dr. Wei's site, here are more configuration files that need to be updated:
Add these lines inbetween the
Update yarn-site.xml with these lines:
Update the mapred-site.xml file with these lines:
The Hadoop file system itself needs to be setup and initialized. First (following Wei's steps) create directories to hold HDFS data using these steps:
(as user hadoopuser)
mkdir -p mydata/hdfs/namenode
mkdir -p mydata/hdfs/datanode
Then update the hdfs configuration file:
Add these lines to the file:
Format the namenode using the hdfs command:
./hdfs namenode -format
III. Start Hadoop and run some tests!
At this point, everything has been configured so we just need to start Hadoop and run some tests to validate the system.
Start hadoop with these steps:
Check the processes with command: ps -ax
As you can see below, the namenode, the datanode, and the secondarynamenode are running as needed.
A quick visit to port 50070 on the system shows the datanode is running and HDFS is available as needed.
Finally, run a test using the mapreduce example:
./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5
Check out the progress on port 8088
Confirm that everything complete without any issues.
So there it is. A full running Hadoop system on an old half abandoned system with half a gig of memory. Not bad!
Next I just need to setup a distributed environment, use a faster box, and finally learn the power of this sophisticated software.