SlideShare a Scribd company logo
1 of 91
Download to read offline
Hadoop Overview &
   Architecture	


            Milind Bhandarkar	

  Chief Scientist, Machine Learning Platforms,	

        Greenplum, A Division of EMC	

            (Twitter: @techmilind)
About Me	

•    http://www.linkedin.com/in/milindb	


•    Founding member of Hadoop team at Yahoo! [2005-2010]	


•    Contributor to Apache Hadoop since v0.1	


•    Built and led Grid Solutions Team at Yahoo! [2007-2010]	


•    Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)	


•    Center for Development of Advanced Computing (C-DAC), National Center
     for Supercomputing Applications (NCSA), Center for Simulation of Advanced
     Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn,
     and EMC-Greenplum
Agenda	

• Motivation	

• Hadoop	

 • Map-Reduce	

 • Distributed File System	

 • Hadoop Architecture	

 • Next Generation MapReduce	

• Q & A	

                    2
Hadoop At Scale
      (Some Statistics)	

• 40,000 + machines in 20+ clusters	

• Largest cluster is 4,000 machines	

• 170 Petabytes of storage	

• 1000+ users	

• 1,000,000+ jobs/month	

                      3
BEHIND
EVERY CLICK
Hadoop Workflow
Who Uses Hadoop ?
Why Hadoop ?	



       7
Big Datasets
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte                             
(http://www.mkomo.com/cost-per-gigabyte)
Storage Trends
(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples	



          11
Yahoo! Search Assist
Search Assist	

• Insight: Related concepts appear close
  together in text corpus	

• Input: Web pages	

 • 1 Billion Pages, 10K bytes each	

 • 10 TB of input data	

• Output: List(word, List(related words))	

                       13
Search Assist	

// Input: List(URL, Text)	
	
foreach URL in Input :	
	
    Words = Tokenize(Text(URL));	
	
    foreach word in Tokens :	
	
        Insert (word, Next(word, Tokens)) in Pairs;	
	
        Insert (word, Previous(word, Tokens)) in Pairs;	
	
// Result: Pairs = List (word, RelatedWord)	
	
Group Pairs by word;	
// Code Samples	
// Result: List (word, List(RelatedWords)	
	
foreach word in Pairs :	
	
    Count RelatedWords in GroupedPairs;	
	
// Result: List (word, List(RelatedWords, count))	
	
foreach word in CountedPairs :	
	
    Sort Pairs(word, *) descending by count;	
	
    choose Top 5 Pairs;	
	
// Result: List (word, Top5(RelatedWords))	

                            14
People You May Know
People You May Know	

• Insight: You might also know Joe Smith if a
  lot of folks you know, know Joe Smith	

 • if you don t know Joe Smith already	

• Numbers:	

 • 100 MM users	

 • Average connections per user is 100	

                      16
People You May Know	

	
	
	 Input: List(UserName, List(Connections))	
//
	
	

	
foreach u in UserList : // 100 MM	
	
	   foreach x in Connections(u) : // 100	
// Code foreach y in Connections(x) : // 100	
        Samples	
	           if (y not in Connections(u)) :	
	                Count(u, y)++; // 1 Trillion Iterations	
	   Sort (u,y) in descending order of Count(u,y);	
	   Choose Top 3 y;	
	   Store (u, {y0, y1, y2}) for serving;	
	
	

                            17
Performance	

• 101 Random accesses for each user	

 • Assume 1 ms per random access	

 • 100 ms per user	

• 100 MM users	

 • 100 days on a single machine	

                     18
MapReduce Paradigm	



         19
Map  Reduce	


• Primitives in Lisp ( Other functional
  languages) 1970s	

• Google Paper 2004	

 • http://labs.google.com/papers/
    mapreduce.html	



                        20
Map	


Output_List = Map (Input_List)	




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =	
	
(1, 4, 9, 16, 25, 36,49, 64, 81, 100)	




                            21
Reduce	


Output_Element = Reduce (Input_List)	




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385	




                            22
Parallelism	

• Map is inherently parallel	

 • Each list element processed independently	

• Reduce is inherently sequential	

 • Unless processing multiple lists	

• Grouping to produce multiple lists	

                      23
Search Assist Map	

// Input: http://hadoop.apache.org  	

	
Pairs = Tokenize_And_Pair ( Text ( Input ) )	




	
	
// Example	
	
	



                            24
Search Assist Reduce	

// Input: GroupedList (word, GroupedList(words))	
	
CountedPairs = CountOccurrences (word, RelatedWords)	




Output = {	
(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,
4) (hadoop, mapreduce, 9) ...	
}	



                            25
Issues with Large Data	


• Map Parallelism: Chunking input data	

• Reduce Parallelism: Grouping related data	

• Dealing with failures  load imbalance	



                       26
Apache Hadoop	


• January 2006: Subproject of Lucene	

• January 2008: Top-level Apache project	

• Stable Version: 1.0.3	

• Latest Version: 2.0.0 (Alpha)	


                      28
Apache Hadoop	

• Reliable, Performant Distributed file system	

• MapReduce Programming framework	

• Ecosystem: HBase, Hive, Pig, Howl, Oozie,
  Zookeeper, Chukwa, Mahout, Cascading,
  Scribe, Cassandra, Hypertable, Voldemort,
  Azkaban, Sqoop, Flume, Avro ...	



                       29
Problem: Bandwidth to
        Data	

• Scan 100TB Datasets on 1000 node cluster	

 • Remote storage @ 10MB/s = 165 mins	

 • Local storage @ 50-200MB/s = 33-8 mins	

• Moving computation is more efficient than
  moving data	

  • Need visibility into data placement	

                       30
Problem: Scaling Reliably	

• Failure is not an option, it s a rule !	

 • 1000 nodes, MTBF  1 day	

 • 4000 disks, 8000 cores, 25 switches, 1000
    NICs, 2000 DIMMS (16TB RAM)	

• Need fault tolerant store with reasonable
  availability guarantees	

  • Handle hardware faults transparently	

                         31
Hadoop Goals	

• Scalable: Petabytes (10      15   Bytes) of data on
  thousands on nodes	

• Economical: Commodity components only	

• Reliable	

 • Engineering reliability into every application
    is expensive	



                       32
Hadoop MapReduce	



        33
Think MapReduce	


• Record = (Key, Value)	

• Key : Comparable, Serializable	

• Value: Serializable	

• Input, Map, Shuffle, Reduce, Output	


                      34
Seems Familiar ?	

	
	
	
	
	
cat /var/log/auth.log* |  	
	
grep “session opened” | cut -d’ ‘ -f10 | 	
	
sort | Samples	
// Code	
uniq -c  	
	
~/userlist 	
	
	
	
	
	
	

                            35
Map	


• Input: (Key , Value )	

              1         1

• Output: List(Key , Value )	

                    2               2

• Projections, Filtering, Transformation	



                            36
Shuffle	


• Input: List(Key , Value )	

                  2             2

• Output	

 • Sort(Partition(List(Key , List(Value ))))	

                                    2    2

• Provided by Hadoop	


                        37
Reduce	


• Input: List(Key , List(Value ))	

                   2                   2

• Output: List(Key , Value )	

                       3           3

• Aggregation	



                           38
Hadoop Streaming	

• Hadoop is written in Java	

 • Java MapReduce code is native 	

• What about Non-Java Programmers ?	

 • Perl, Python, Shell, R	

 • grep, sed, awk, uniq as Mappers/Reducers	

• Text Input and Output	

                      39
Hadoop Streaming	

• Thin Java wrapper for Map  Reduce Tasks	

• Forks actual Mapper  Reducer	

• IPC via stdin, stdout, stderr	

• Key.toString() t Value.toString() n	

• Slower than Java programs	

 • Allows for quick prototyping / debugging	

                      40
Hadoop Streaming	

	
	
	 bin/hadoop jar hadoop-streaming.jar 	
$
	     -input in-files -output out-dir 	
	     -mapper mapper.sh -reducer reducer.sh	
	
	 mapper.sh	
#
// Code Samples	
	
sed -e 's/ /n/g' | grep .	
	
	
#
	 reducer.sh	
	
	
uniq -c | awk '{print $2 t $1}'	
	
	

                            41
Hadoop Distributed File
   System (HDFS)	



           42
HDFS	

• Data is organized into files and directories	

• Files are divided into uniform sized blocks
  (default 128MB) and distributed across
  cluster nodes	

• HDFS exposes block placement so that
  computation can be migrated to data	



                        43
HDFS	

• Blocks are replicated (default 3) to handle
  hardware failure	

• Replication for performance and fault
  tolerance (Rack-Aware placement)	

• HDFS keeps checksums of data for
  corruption detection and recovery	



                        44
HDFS	


• Master-Worker Architecture	

• Single NameNode	

• Many (Thousands) DataNodes	



                    45
HDFS Master
         (NameNode)	

• Manages filesystem namespace	

• File metadata (i.e. inode )	

• Mapping inode to list of blocks + locations	

• Authorization  Authentication	

• Checkpoint  journal namespace changes	

                       46
Namenode	

• Mapping of datanode to list of blocks	

• Monitor datanode health	

• Replicate missing blocks	

• Keeps ALL namespace in memory	

• 60M objects (File/Block) in 16GB	

                       47
Datanodes	

• Handle block storage on multiple volumes 
  block integrity	

• Clients access the blocks directly from data
  nodes	

• Periodically send heartbeats and block
  reports to Namenode	

• Blocks are stored as underlying OS s files	

                       48
HDFS Architecture
Example: Unigrams	


• Input: Huge text corpus	

 • Wikipedia Articles (40GB uncompressed)	

• Output: List of words sorted in descending
  order of frequency
Unigrams	


$ cat ~/wikipedia.txt | 	
sed -e 's/ /n/g' | grep . | 	
sort | 	
uniq -c  	
~/frequencies.txt	
	
$ cat ~/frequencies.txt | 	
# cat | 	
sort -n -k1,1 -r |	
# cat  	
~/unigrams.txt
MR for Unigrams	


mapper (filename, file-contents):	
    	for each word in file-contents:	
    	    	emit (word, 1)	
	
reducer (word, values):	
    	sum = 0	
    	for each value in values:	
    	    	sum = sum + value	
    	emit (word, sum)
MR for Unigrams	



mapper (word, frequency):	
    	emit (frequency, word)	
	
reducer (frequency, words):	
    	for each word in words:	
    	    	emit (word, frequency)
Unigrams: Java Mapper	

public static class MapClass extends MapReduceBase	
     	implements MapperLongWritable, Text, Text, IntWritable {	
	
         public void map(LongWritable key, Text value,      	    	
     	      	OutputCollectorText, IntWritable output,	
     	      	Reporter reporter) throws IOException {	
	
     	      	String line = value.toString();	
     	      	StringTokenizer itr = new StringTokenizer(line);	
     	      	while (itr.hasMoreTokens()) {	
     	      	     	Text word = new Text(itr.nextToken());	
     	      	     	output.collect(word, new IntWritable(1));	
     	      	}	
     	}	
}
Unigrams: Java Reducer	


public static class Reduce extends MapReduceBase	
     	implements ReducerText, IntWritable, Text, IntWritable {	
	
     	public void reduce(Text key,IteratorIntWritable values,	
     	     	OutputCollectorText,IntWritable output,	
     	     	Reporter reporter) throws IOException {	
     	     		
     	     	int sum = 0;	
     	     	while (values.hasNext()) {	
     	     	     	sum += values.next().get();	
     	     	}	
     	     	output.collect(key, new IntWritable(sum));	
     	}	
}
Unigrams: Driver	


public void run(String inputPath, String outputPath) throws
Exception	
{	
     	JobConf conf = new JobConf(WordCount.class);	
     	conf.setJobName(wordcount);	
     	conf.setMapperClass(MapClass.class);	
     	conf.setReducerClass(Reduce.class);	
     	FileInputFormat.addInputPath(conf, new Path(inputPath)); 	
     	FileOutputFormat.setOutputPath(conf, new Path(outputPath));	
     	JobClient.runJob(conf);	
}
Configuration	

•  Unified Mechanism for	

  •  Configuring Daemons	

  •  Runtime environment for Jobs/Tasks	

•  Defaults: *-default.xml	

•  Site-Specific: *-site.xml	

•  final parameters
Example	

configuration	
    	property	
    	    	namemapred.job.tracker/name	
    	    	valuehead.server.node.com:9001/value	
    	/property	
    	property	
    	    	namefs.default.name/name	
    	    	valuehdfs://head.server.node.com:9000/value	
    	/property	
    	property	
    	namemapred.child.java.opts/name	
    	value-Xmx512m/value	
    	finaltrue/final	
    	/property	
....	
/configuration
Running Hadoop Jobs
Running a Job	

[milindb@gateway ~]$ hadoop jar 	
$HADOOP_HOME/hadoop-examples.jar wordcount 	
/data/newsarchive/20080923 /tmp/
newsoutinput.FileInputFormat: Total input paths to
process : 4	
mapred.JobClient: Running job: job_200904270516_5709	
mapred.JobClient: map 0% reduce 0%	
mapred.JobClient: map 3% reduce 0%	
mapred.JobClient: map 7% reduce 0%	
....	
mapred.JobClient: map 100% reduce 21%	
mapred.JobClient: map 100% reduce 31%	
mapred.JobClient: map 100% reduce 33%	
mapred.JobClient: map 100% reduce 66%	
mapred.JobClient: map 100% reduce 100%	
mapred.JobClient: Job complete: job_200904270516_5709
Running a Job	


mapred.JobClient: Counters: 18	
mapred.JobClient:   Job Counters	
mapred.JobClient:     Launched reduce tasks=1	
mapred.JobClient:     Rack-local map tasks=10	
mapred.JobClient:     Launched map tasks=25	
mapred.JobClient:     Data-local map tasks=1	
mapred.JobClient:   FileSystemCounters	
mapred.JobClient:     FILE_BYTES_READ=491145085	
mapred.JobClient:     HDFS_BYTES_READ=3068106537	
mapred.JobClient:     FILE_BYTES_WRITTEN=724733409	
mapred.JobClient:     HDFS_BYTES_WRITTEN=377464307
Running a Job	


mapred.JobClient:   Map-Reduce Framework	
mapred.JobClient:     Combine output records=73828180	
mapred.JobClient:     Map input records=36079096	
mapred.JobClient:     Reduce shuffle bytes=233587524	
mapred.JobClient:     Spilled Records=78177976	
mapred.JobClient:     Map output bytes=4278663275	
mapred.JobClient:     Combine input records=371084796	
mapred.JobClient:     Map output records=313041519	
mapred.JobClient:     Reduce input records=15784903
JobTracker WebUI	

	
	
	
	
	
	
	
// Code Samples
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
MapReduce Dataflow
MapReduce
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Task
Next Generation
  MapReduce	



       82
MapReduce Today	

    (Courtesy: Arun Murthy, Hortonworks)
Why ?	

• Scalability Limitations today	

 • Maximum cluster size: 4000 nodes	

 • Maximum Concurrent tasks: 40,000	

• Job Tracker SPOF	

• Fixed map and reduce containers (slots)	

 • Punishes pleasantly parallel apps	

                      84
Why ? (contd)	

• MapReduce is not suitable for every
  application	

• Fine-Grained Iterative applications	

 • HaLoop: Hadoop in a Loop	

• Message passing applications	

 • Graph Processing	

                        85
Requirements	


• Need scalable cluster resources manager	

• Separate scheduling from resource
  management	

• Multi-Lingual Communication Protocols	


                      86
Bottom Line	

• @techmilind #mrng (MapReduce, Next
  Gen) is in reality, #rmng (Resource Manager,
  Next Gen)	

• Expect different programming paradigms to
  be implemented	

 • Including MPI (soon)	

                      87
Architecture	

  (Courtesy: Arun Murthy, Hortonworks)
The New World	

•  Resource Manager	

  •  Allocates resources (containers) to applications	

•  Node Manager	

  •  Manages containers on nodes	

•  Application Master	

  •  Specific to paradigm e.g. MapReduce application master,
      MPI application master etc	



                               89
Container	


• In current terminology: A Task Slot	

• Slice of the node s hardware resources	

• #of cores, virtual memory, disk size, disk and
  network bandwidth etc	

  • Currently, only memory usage is sliced	


                       90

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 

What's hot (20)

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Sqoop
SqoopSqoop
Sqoop
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview Hortonworks
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data miningharithavijay94
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Peter Wood
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 

Viewers also liked (20)

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 

Similar to Hadoop Overview & Architecture

Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 

Similar to Hadoop Overview & Architecture (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Hadoop Overview & Architecture

  • 1. Hadoop Overview & Architecture Milind Bhandarkar Chief Scientist, Machine Learning Platforms, Greenplum, A Division of EMC (Twitter: @techmilind)
  • 2. About Me •  http://www.linkedin.com/in/milindb •  Founding member of Hadoop team at Yahoo! [2005-2010] •  Contributor to Apache Hadoop since v0.1 •  Built and led Grid Solutions Team at Yahoo! [2007-2010] •  Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) •  Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and EMC-Greenplum
  • 3. Agenda • Motivation • Hadoop • Map-Reduce • Distributed File System • Hadoop Architecture • Next Generation MapReduce • Q & A 2
  • 4. Hadoop At Scale (Some Statistics) • 40,000 + machines in 20+ clusters • Largest cluster is 4,000 machines • 170 Petabytes of storage • 1000+ users • 1,000,000+ jobs/month 3
  • 9. Big Datasets (Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  • 10. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  • 11. Storage Trends (Graph by Adam Leventhal, ACM Queue, Dec 2009)
  • 14. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) 13
  • 15. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Code Samples // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 14
  • 17. People You May Know • Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don t know Joe Smith already • Numbers: • 100 MM users • Average connections per user is 100 16
  • 18. People You May Know Input: List(UserName, List(Connections)) // foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 // Code foreach y in Connections(x) : // 100 Samples if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 17
  • 19. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 100 MM users • 100 days on a single machine 18
  • 21. Map Reduce • Primitives in Lisp ( Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html 20
  • 22. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 21
  • 23. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 22
  • 24. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 23
  • 25. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) // Example 24
  • 26. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 25
  • 27. Issues with Large Data • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures load imbalance 26
  • 28.
  • 29. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 1.0.3 • Latest Version: 2.0.0 (Alpha) 28
  • 30. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 29
  • 31. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement 30
  • 32. Problem: Scaling Reliably • Failure is not an option, it s a rule ! • 1000 nodes, MTBF 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 31
  • 33. Hadoop Goals • Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 32
  • 35. Think MapReduce • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output 34
  • 36. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | Samples // Code uniq -c ~/userlist 35
  • 37. Map • Input: (Key , Value ) 1 1 • Output: List(Key , Value ) 2 2 • Projections, Filtering, Transformation 36
  • 38. Shuffle • Input: List(Key , Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop 37
  • 39. Reduce • Input: List(Key , List(Value )) 2 2 • Output: List(Key , Value ) 3 3 • Aggregation 38
  • 40. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is native • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output 39
  • 41. Hadoop Streaming • Thin Java wrapper for Map Reduce Tasks • Forks actual Mapper Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging 40
  • 42. Hadoop Streaming bin/hadoop jar hadoop-streaming.jar $ -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh mapper.sh # // Code Samples sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 t $1}' 41
  • 43. Hadoop Distributed File System (HDFS) 42
  • 44. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data 43
  • 45. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery 44
  • 47. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. inode ) • Mapping inode to list of blocks + locations • Authorization Authentication • Checkpoint journal namespace changes 46
  • 48. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 47
  • 49. Datanodes • Handle block storage on multiple volumes block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS s files 48
  • 51. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency
  • 52. Unigrams $ cat ~/wikipedia.txt | sed -e 's/ /n/g' | grep . | sort | uniq -c ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat ~/unigrams.txt
  • 53. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)
  • 54. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency)
  • 55. Unigrams: Java Mapper public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); } } }
  • 56. Unigrams: Java Reducer public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key,IteratorIntWritable values, OutputCollectorText,IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 57. Unigrams: Driver public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }
  • 58. Configuration •  Unified Mechanism for •  Configuring Daemons •  Runtime environment for Jobs/Tasks •  Defaults: *-default.xml •  Site-Specific: *-site.xml •  final parameters
  • 59. Example configuration property namemapred.job.tracker/name valuehead.server.node.com:9001/value /property property namefs.default.name/name valuehdfs://head.server.node.com:9000/value /property property namemapred.child.java.opts/name value-Xmx512m/value finaltrue/final /property .... /configuration
  • 61. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/ newsoutinput.FileInputFormat: Total input paths to process : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709
  • 62. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307
  • 63. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903
  • 83. Next Generation MapReduce 82
  • 84. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  • 85. Why ? • Scalability Limitations today • Maximum cluster size: 4000 nodes • Maximum Concurrent tasks: 40,000 • Job Tracker SPOF • Fixed map and reduce containers (slots) • Punishes pleasantly parallel apps 84
  • 86. Why ? (contd) • MapReduce is not suitable for every application • Fine-Grained Iterative applications • HaLoop: Hadoop in a Loop • Message passing applications • Graph Processing 85
  • 87. Requirements • Need scalable cluster resources manager • Separate scheduling from resource management • Multi-Lingual Communication Protocols 86
  • 88. Bottom Line • @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) • Expect different programming paradigms to be implemented • Including MPI (soon) 87
  • 89. Architecture (Courtesy: Arun Murthy, Hortonworks)
  • 90. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 89
  • 91. Container • In current terminology: A Task Slot • Slice of the node s hardware resources • #of cores, virtual memory, disk size, disk and network bandwidth etc • Currently, only memory usage is sliced 90