This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Hadoop Overview & Architecture
1. Hadoop Overview &
Architecture
Milind Bhandarkar
Chief Scientist, Machine Learning Platforms,
Greenplum, A Division of EMC
(Twitter: @techmilind)
2. About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team at Yahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid Solutions Team at Yahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn,
and EMC-Greenplum
14. Search Assist
• Insight: Related concepts appear close
together in text corpus
• Input: Web pages
• 1 Billion Pages, 10K bytes each
• 10 TB of input data
• Output: List(word, List(related words))
13
15. Search Assist
// Input: List(URL, Text)
foreach URL in Input :
Words = Tokenize(Text(URL));
foreach word in Tokens :
Insert (word, Next(word, Tokens)) in Pairs;
Insert (word, Previous(word, Tokens)) in Pairs;
// Result: Pairs = List (word, RelatedWord)
Group Pairs by word;
// Code Samples
// Result: List (word, List(RelatedWords)
foreach word in Pairs :
Count RelatedWords in GroupedPairs;
// Result: List (word, List(RelatedWords, count))
foreach word in CountedPairs :
Sort Pairs(word, *) descending by count;
choose Top 5 Pairs;
// Result: List (word, Top5(RelatedWords))
14
17. People You May Know
• Insight: You might also know Joe Smith if a
lot of folks you know, know Joe Smith
• if you don t know Joe Smith already
• Numbers:
• 100 MM users
• Average connections per user is 100
16
18. People You May Know
Input: List(UserName, List(Connections))
//
foreach u in UserList : // 100 MM
foreach x in Connections(u) : // 100
// Code foreach y in Connections(x) : // 100
Samples
if (y not in Connections(u)) :
Count(u, y)++; // 1 Trillion Iterations
Sort (u,y) in descending order of Count(u,y);
Choose Top 3 y;
Store (u, {y0, y1, y2}) for serving;
17
19. Performance
• 101 Random accesses for each user
• Assume 1 ms per random access
• 100 ms per user
• 100 MM users
• 100 days on a single machine
18
24. Parallelism
• Map is inherently parallel
• Each list element processed independently
• Reduce is inherently sequential
• Unless processing multiple lists
• Grouping to produce multiple lists
23
25. Search Assist Map
// Input: http://hadoop.apache.org
Pairs = Tokenize_And_Pair ( Text ( Input ) )
// Example
24
27. Issues with Large Data
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related data
• Dealing with failures load imbalance
26
28.
29. Apache Hadoop
• January 2006: Subproject of Lucene
• January 2008: Top-level Apache project
• Stable Version: 1.0.3
• Latest Version: 2.0.0 (Alpha)
28
31. Problem: Bandwidth to
Data
• Scan 100TB Datasets on 1000 node cluster
• Remote storage @ 10MB/s = 165 mins
• Local storage @ 50-200MB/s = 33-8 mins
• Moving computation is more efficient than
moving data
• Need visibility into data placement
30
32. Problem: Scaling Reliably
• Failure is not an option, it s a rule !
• 1000 nodes, MTBF 1 day
• 4000 disks, 8000 cores, 25 switches, 1000
NICs, 2000 DIMMS (16TB RAM)
• Need fault tolerant store with reasonable
availability guarantees
• Handle hardware faults transparently
31
33. Hadoop Goals
• Scalable: Petabytes (10 15 Bytes) of data on
thousands on nodes
• Economical: Commodity components only
• Reliable
• Engineering reliability into every application
is expensive
32
40. Hadoop Streaming
• Hadoop is written in Java
• Java MapReduce code is native
• What about Non-Java Programmers ?
• Perl, Python, Shell, R
• grep, sed, awk, uniq as Mappers/Reducers
• Text Input and Output
39
41. Hadoop Streaming
• Thin Java wrapper for Map Reduce Tasks
• Forks actual Mapper Reducer
• IPC via stdin, stdout, stderr
• Key.toString() t Value.toString() n
• Slower than Java programs
• Allows for quick prototyping / debugging
40
44. HDFS
• Data is organized into files and directories
• Files are divided into uniform sized blocks
(default 128MB) and distributed across
cluster nodes
• HDFS exposes block placement so that
computation can be migrated to data
43
45. HDFS
• Blocks are replicated (default 3) to handle
hardware failure
• Replication for performance and fault
tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
corruption detection and recovery
44
48. Namenode
• Mapping of datanode to list of blocks
• Monitor datanode health
• Replicate missing blocks
• Keeps ALL namespace in memory
• 60M objects (File/Block) in 16GB
47
49. Datanodes
• Handle block storage on multiple volumes
block integrity
• Clients access the blocks directly from data
nodes
• Periodically send heartbeats and block
reports to Namenode
• Blocks are stored as underlying OS s files
48
51. Example: Unigrams
• Input: Huge text corpus
• Wikipedia Articles (40GB uncompressed)
• Output: List of words sorted in descending
order of frequency
53. MR for Unigrams
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
54. MR for Unigrams
mapper (word, frequency):
emit (frequency, word)
reducer (frequency, words):
for each word in words:
emit (word, frequency)
55. Unigrams: Java Mapper
public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Text word = new Text(itr.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
56. Unigrams: Java Reducer
public static class Reduce extends MapReduceBase
implements ReducerText, IntWritable, Text, IntWritable {
public void reduce(Text key,IteratorIntWritable values,
OutputCollectorText,IntWritable output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
57. Unigrams: Driver
public void run(String inputPath, String outputPath) throws
Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(wordcount);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
86. Why ? (contd)
• MapReduce is not suitable for every
application
• Fine-Grained Iterative applications
• HaLoop: Hadoop in a Loop
• Message passing applications
• Graph Processing
85
87. Requirements
• Need scalable cluster resources manager
• Separate scheduling from resource
management
• Multi-Lingual Communication Protocols
86
88. Bottom Line
• @techmilind #mrng (MapReduce, Next
Gen) is in reality, #rmng (Resource Manager,
Next Gen)
• Expect different programming paradigms to
be implemented
• Including MPI (soon)
87
90. The New World
• Resource Manager
• Allocates resources (containers) to applications
• Node Manager
• Manages containers on nodes
• Application Master
• Specific to paradigm e.g. MapReduce application master,
MPI application master etc
89
91. Container
• In current terminology: A Task Slot
• Slice of the node s hardware resources
• #of cores, virtual memory, disk size, disk and
network bandwidth etc
• Currently, only memory usage is sliced
90