2. What Is This Document?
A high level tutorial for setting up a small 3 node Hadoop
cluster on Amazons EC2 cloud;
It aims to be a complete, simple, and fast guide for people
new to Hadoop. Documentation on the Internet is spotty and
text dense so it can be slower to get started than it needs to
be;
It aims to cover the little stumbling blocks around SSH keys
and a few small bug workarounds that can trip you up on the
first pass through;
It does not aim to cover MapReduce and writing jobs. A
small job will be provided at the end to validate the cluster.
3. Recap - What Is Hadoop?
An open source framework for ‘reliable, scalable, distributed
computing’;
It gives you the ability process and work with large datasets
that are distributed across clusters of commodity
hardware;
It allows you to parallelize computation and ‘move processing
to the data’ using the MapReduce framework.
4. Recap - What Is Amazon EC2?
A ‘cloud’ web host that allows you to dynamically add and
remove compute server resources as you need them,
allowing you to pay for only the capacity that you need;
It is well suited for Hadoop Computation – we can bring up
enormous clusters within minutes and then spin it down when
we’ve finished to reduce costs;
EC2 is quick and cost effective for experimental and learning
purposes, as well as being proven as a production Hadoop
host.
5. Assumptions & Notes
The document assumes basic familiarity with Linux, Java,
and SSH;
The cluster will be set up manually to demonstrate concepts
of Hadoop. In real life, we would typically use a configuration
management tool such as Puppet or Chef to manage and
automate larger clusters;
The configuration shown is not production ready. Real
Hadoop clusters need much more bootstrap configuration
and security;
It assumes that you are running your cluster on Ubuntu Linux,
but are accessing the cluster from a Windows host. This is
possibly not a sensible assumption, but it’s what I had at the
time of writing!
7. 1. Start EC2 Servers
Sign up to Amazon Web Services @ http://aws.amazon.com/;
Login and navigate to Amazon EC2. Using the ‘classic
wizard’, create three micro instances running the latest 64 bit
Ubuntu Server;
If you do not already have a key pair .pem file, you will need
to create one during the process. We will later use this to
connect to the servers and to navigate around within the
cluster., so keep it in a safe place.
8. 2. Name EC2 Servers
For reference, name the instances Master, Slave 1, and
Slave 2 within the EC2 console once they are running;
Note down the host names for each of the 3 instances in the
bottom part of the management console. We will use these to
access the servers:
9. 3. Prepare Keys from PEM file
We need to break down the Amazon supplied .pem file into
private and public keys in order to access the servers from
our local machine;
To do this, download PuttyGen @
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.
html
Using PuttyGen, import your PEM file (Conversions > Import
Key) and export public and private keys into a safe place.
10. 4. Configure Putty
We now need to SSH into our EC2 servers from our local
machine. To do this, we will use the Putty SSH client for
Windows;
Begin by configuring Putty sessions for each of the three
servers and saving them for future convenience;
Under connection > SSH > Auth in the putty tree, point
towards the private key that you generated in the previous
step.
11. 5. Optional - mRemote
I use a tool called mRemote which allows you to embed Putty
instances into a tabbed browser. Try it @
http://www.mremote.org. I recommend this as navigating
around all of the hosts in your Hadoop cluster can be fiddly
for larger manually managed clusters;
If you do this, be sure to select the corresponding Putty
Session for each mRemote connection to the private key is
carried through so that you can connect:
12. 6. Install Java & Hadoop
We need to install Java on the cluster machines in order to
run Hadoop. The OpenJDK7 will suffice for this tutorial.
Connect to all three machines using Putty or mRemote, and
on each of the three machines run the following:
sudo apt-get install openjdk-7-jdk
When that’s complete, configure the JAVA_HOME variable by
adding the following line at the top of ~/.bashrc:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
We can now download and unpack Hadoop. On each of the
three machines run the following:
cd ~
wget http://apache.mirror.rbftpnetworks.com/hadoop/common/hadoop-
1.0.3/hadoop-1.0.3-bin.tar.gz
gzip –d hadoop-1.0.3-bin.tar.gz
tar –xf hadoop-1.0.3-bin.tar
13. 7. Configure SSH Keypairs
Hadoop needs to SSH from master to the slave servers to
start and stop processes.
All of the Amazon servers will have our generated public key
installed on them in their ~ubuntu/.ssh/authorized_keys file
automatically by Amazon. However, we need to put the
corresponding private key in our ~ubuntu/.ssh/id_rsa file on
the master server to be able to go there.
To upload the file, use the file transfer software WinSCP @
http://winscp.net to push the file into your .ssh folder.
Be sure to upload your OpenSSH private key generated from
PuttyGen – (Conversions > Export Open SSH Key)!
14. 8. Passwordless SSH
It is better if Hadoop can move between boxes without
requiring the pass phrase to your key file;
To do this, we can save the password into the ssh agent by
using the following commands on the master server. This will
avoid the need for specifying the password repeatedly when
stopping and starting the cluster:
◦ ssh-agent bash
◦ ssh-add
15. 9. Open Firewall Ports
We need to open a number of ports to allow the Hadoop
cluster to communicate and expose various web interfaces to
us. Do this by adding inbound rules to the default security
group on the AWS EC2 management console. Open port
9000, 9001 and 50000-50100
17. 1. Decide On Cluster Layout
There are four components of Hadoop which we would like to
spread out across the cluster:
◦ Data nodes – actually store and manage data;
◦ Naming node – acts as a catalogue service, showing what data is stored
where;
◦ Job tracker – tracks and manages submitted MapReduce tasks;
◦ Task tracker – low level worker that is issued jobs from job tracker.
Lets go with the following setup. This is fairly typical in terms
of data nodes and task trackers across the cluster, and one
instance of the naming node and job tracker:
Node Hostname Component
Master ec2-23-22-133-70 Naming Node
Job Tracker
Slave 1 ec2-23-20-53-36 Data Node
Task Tracker
Slave 2 ec2-184-73-42-163 Data Node
Task Tracker
18. 2a. Configure Server Names
Logout of all of the machines and log back into the master
server;
The hadoop configuration will be located here on the server:
cd /home/ubuntu/hadoop-1.0.3/conf
Open the file ‘masters’ and replace the word ‘localhost’ with
the hostname of the server that you have allocated to master:
cd /home/ubuntu/hadoop-1.0.3/conf
vi masters
Open the file ‘slaves’ and replace the word ‘localhost’ with the
2 hostnames of the server that you have been allocated on 2
separate lines:
cd /home/ubuntu/hadoop-1.0.3/conf
vi slaves
19. 2b. Configure Server Names
It should look like this, though of course using your own
allocated hostnames:
Do not use ‘localhost’ in the masters/slaves files as this can
lead to non descriptive errors!
20. 3a. Configure HDFS
HDFS is the distributed file system that sits behind Hadoop
instances, syncing data so that it’s close to the processing
and providing redundancy. We should therefore set this up
first;
We need to specify some mandatory parameters to get HDFS
up and running in various XML configuration files;
Still on the master server, the first thing we need to do is to
set the name of the default file system so that it always points
back at master, again using your own fully qualified
hostname:
/home/ubuntu/hadoop-1.0.3/conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-107-20-118-109.compute-1.amazonaws.com:9000</value>
</property>
</configuration>
21. 3b. Configure HDFS
Still on the master server, we also need to set the
dfs.replication parameter, which says how many nodes data
should be replicated to for failover and redundancy purposes:
/home/ubuntu/hadoop-1.0.3/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
22. 4. Configure MapReduce
As well as the underlying HDFS file system, we have to set
one mandatory parameter that will be used by the Hadoop
MapReduce framework;
Still on the master server, we need to set the job tracker
location, which Hadoop will use. As discussed earlier, we will
put the job tracker on master in this instance, again being
careful to substitute in your own master server host name:
/home/ubuntu/hadoop-1.0.3/conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ec2-107-22-78-136.compute-1.amazonaws.com:54311</value>
</property>
</configuration>
23. 5a. Push Configuration To Slaves
We need to push out the little mandatory configuration that
we have done onto all of the slaves. Typically, this could be
mounted on a shared drive but we will do it manually this time
using SCP:
cd /home/ubuntu/hadoop-1.0.3/conf
scp * ubuntu@ec2-23-20-53-36.compute-1.amazonaws.com:/home/ubuntu/hadoop-
1.0.3/conf
scp * ubuntu@ec2-184-73-42-163.compute-1.amazonaws.com:/home/ubuntu/hadoop-
1.0.3/conf
By virtue of pushing out the masters and slaves files, the
various nodes in this cluster should all be correctly
congfigured and referencing each other at this stage.
24. 6a. Format HDFS
Before we can start Hadoop, we need to format and initialise
the underlying distributed file system;
To do this, on the master server, execute the following
command:
cd /home/ubuntu/hadoop-1.0.3/bin
./hadoop namenode -format
It should only take a minute. The expected output of the
format operation is on the next page;
The file system will be built and formatted. It will exist in
/tmp/hadoop-ubuntu if you would like to browse around. This
file system will be managed by Hadoop to distribute it across
nodes and access data.
27. 1a. Start HDFS
We will begin by starting the HDFS file system from the
master server.
There is a script for this which will run the name node on the
master and the data nodes on the slaves:
cd /home/ubuntu/hadoop-1.0.3
./bin/start-dfs.sh
28. 1b. Start HDFS
At this point, monitor the log files on the master and the
slaves. You should see that HDFS cleanly starts on the
slaves when we start it on the master:
cd /home/ubuntu/hadoop-1.0.3
tail –f logs/hadoop-ubuntu-datanode-ip-10-245-114-186.log
If anything appears to have gone wrong, double check the
configuration files above are correct, firewall ports are open,
and everything has been accurately pushed to all slaves.
29. 2. Start MapReduce
Once we’ve confirmed that the HDFS is up, is now time to
start the map reduce component of Hadoop;
There is a script for this which will run the name node on the
master and the data nodes on the slaves. Run the following
on the master server.
cd /home/ubuntu/hadoop-1.0.3
./bin/start-mapred.sh
Again, double check the log files on all servers to check that
everything is communicating cleanly before moving further..
Double check configuration and firewall ports in the case of
issues.
31. 2a. Web Interfaces
Now we’re up and running, Hadoop has started a number of
web interfaces that give information about the cluster and
HDFS. Take a look at these to familiarise yourself with them:
NameNode master:50070 Information about the
name node and the health
of the distributed file
system
DataNode slave1:50075 TBC
slave2:50075
JobTracker master:50030 Information about
submitted and queued jobs
TaskTracker slave1:50060 Information about tasks
slave2:50060 that are submitted and
queued
36. 1. Source Dataset
It’s now time to submit a processing job to the Hadoop
cluster.
Though we won’t go into much detail here, for the exercise, I
used a dataset of UK government spending which you can
bring down onto your master server like so:
wget http://www.dwp.gov.uk/docs/dwp-payments-april10.csv
37. 2. Push Dataset Into HDFS
We need to push the dataset into the HDFS so that it’s
available to be shared across the nodes for subsequent
processing:
/home/ubuntu/hadoop-1.0.3/bin/hadoop dfs -put dwp-payments-april10.csv dwp-
payments-april10.csv
After pushing the file, note how the NameNode data health
page shows the file count, and space used increasing after
the push.
38. 3a. Write A MapReduce Job
It’s now time to open our IDE and write the Java code that will
represent our MapReduce job. For the purposes of this
presentation, we are just looking to validate the Hadoop
cluster, so we will not go into detail with regards to
MapReduce;
My Java project is available at @
http://www.benjaminwootton.co.uk/wordcount.zip.
As per the sample project, I recommend that you use Maven
in order to easily bring in Hadoop depenendencies and build
the required JAR. You can manage this part differently if you
prefer.
39.
40. 4. Package The MapReduce Job
Hadoop Jobs are typically packaged as Java JAR files. By
virtue of using Maven, we can get the packaged JAR simply
by running a mvn clean package against the sample
project.
Upload the JAR onto your master server using WinSCP:
41. 5a. Execute The MapReduce Job
We are finally ready to run the job! To do that, we’ll run the
Hadoop script and pass in a reference to the JAR file, the
name of the class containing the main method, and an input
file and an output file that will be used by our Java code:
cd /home/ubuntu/hadoop-1.0.3
./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount
/user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS
If all goes well, we’ll see the job run with no errors:
ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop jar ~/firsthadoop.jar
benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv
/user/ubuntu/RESULTS
12/06/03 16:06:12 INFO mapred.JobClient: Running job: job_201206031421_0004
12/06/03 16:06:13 INFO mapred.JobClient: map 0% reduce 0%
12/06/03 16:06:29 INFO mapred.JobClient: map 50% reduce 0%
12/06/03 16:06:31 INFO mapred.JobClient: map 100% reduce 0%
12/06/03 16:06:41 INFO mapred.JobClient: map 100% reduce 100%
12/06/03 16:06:50 INFO mapred.JobClient: Job complete:
job_201206031421_0004
42. 5b. Execute The MapReduce Job
You will also be able to monitor progress of the job on
the job tracker web application that is running on the
master server:
43. 6. Confirm Results!
And the final step is to confirm our results by interogating the
HDFS:
ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop dfs -cat /user/ubuntu/RESULTS/part-
00000
Pension 2150
Corporate 5681
Employment Programmes 491
Entity 2
Housing Benefits 388
Jobcentre Plus 14774
45. 1. What We Have Done!
Setup EC2, requested machines, configured firewalls and
passwordless SSH;
Downloaded Java and Hadoop;
Configured HDFS and MapReduce and pushed configuration
around the cluster;
Started HDFS and MapReduce;
Compiled a MapReduce job using Maven;
Submitted the job, ran it succesfully, and viewed the output.
46. 2. In Summary….
Hopefully you can see how this model of computation would
be useful for very large datasets that we wish to perform
processing on;
Hopefully you are also sold on EC2 as a distributed, fast, cost
effective platform for using Hadoop for big-data work.
Please get in touch with any questions, comments, or
corrections!
@benjaminwootton
www.benjaminwootton.co.uk