SlideShare a Scribd company logo
1 of 36
Download to read offline
Hadoop & MapReduce
          Dr. Ioannis Konstantinou
      http://www.cslab.ntua.gr/~ikons


           AWS Usergroup Greece
               18/07/2012


        Computing Systems Laboratory
 School of Electrical and Computer Engineering
    National Technical University of Athens
Big Data
90% of today's data was created in the last 2 years
Moore's law: Data volume doubles every 18 months
YouTube: 13 million hours and 700 billion views in 2010
Facebook: 20TB/day (compressed)
CERN/LHC: 40TB/day (15PB/year)

Many more examples
Web logs, presentation files, medical files etc
Problem: Data explosion

   1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes)
   Data traffic of mobile telephony in the USA in 2010



       1.2 ZB (Zettabyte) = 1200 EB
       Total of digital data in 2010



        35 ZB (Zettabyte = 1021 bytes)
        Estimate for volume of total digital
        data in 2020
Solution: scalability


           How?
Source: Wikipedia (IBM Roadrunner)
Divide and Conquer
           “Problem”
                                  Partition


  w1          w2         w3

“worker”    “worker”   “worker”


   r1         r2         r3




           “Result”               Combine
Parallelization challenges
 How to assign units of work to the workers?
 What if there are more units of work than workers?
 What if the workers need to share intermediate incomplete
  data?
 How do we aggregate such intermediate data?
 How do we know when all workers have completed their
  assignments?
 What if some workers failed?
What is MapReduce?
A programming model
A programming framework
Used to develop solutions that will
    Process large amounts of data in a parallelized fashion

    In clusters of computing nodes

Originally a closed-source implementation at Google
    Scientific papers of ’03 & ’04 describe the framework

Hadoop: opensource implementation of the algorithms described in
  the scientific papers
    http://hadoop.apache.org/
What is Hadoop?
 2 large subsystems, 1 for data management & 1 for computation:
     HDFS (Hadoop Distributed File System)

     MapReduce computation framework runs above HDFS

     HDFS is essentially the I/O of Hadoop

 Written in java: A set of java processes running in multiple nodes

 Who uses it:
     Yahoo!

     Amazon

     Facebook

     Twitter

     Plus many more...
HDFS – distributed file system

 A scalable distributed file system for applications dealing with
  large data sets.
    Distributed: runs in a cluster

    Scalable: 10Κ nodes, 100Κ files 10PB storage

 Storage space is seamless for the whole cluster
 Files broken into blocks
 Typical block size: 128 MB.
 Replication: Each block copied to multiple data nodes.
Architecture of HDFS/MapReduce
 Master/Slave scheme
    HDFS: A central NameNode administers multple DataNodes

        NameNode: holds information about which DataNode holds which files
        DataNodes: «dummy» servers that hold raw file chunks
    MapReduce: A central JobTracker administers multiple TaskTrackers

-NameNode and JobTracker
   They run on the master
-DataNode and TaskTracker
   They run on the slaves
MapReduce
The problem is broken down in 2 phases.
   ●
       Map: Non overlapping sets of data input
       (<key,value> records) are assigned to different
       processes (mappers) that produce a set of
       intermediate <key,value> results
   ●
       Reduce: Data of Map phase are fed to a typically
       smaller number of processes(reducers) that
       aggregate the input results to a smaller number of
       <key,value> records.
How does it work?
Initialization phase
Input is uploaded to HDFS and is split into pieces of
 fixed size
Each TaskTracker node that participates in the
 computation is executing a copy of the MapReduce
 program
One of the nodes plays the JobTracker master role.
 This node will assign tasks to the rest (workers). Tasks
 can either be of type map or reduce.
JobTracker (Master)
The jobTracker holds data about:
  Status of tasks

  Location of input, output and intermediate data (runs
    together with NameNode - HDFS master)
The master is responsible for timecheduling of work
 tasks execution.
TaskTracker (Slave)
The TaskTracker runs tasks assigned by the master.
Runs at the same node as the DataNode (HFDS slave)
Task can be either of type Map or type Reduce
Typically the maximum number of concurrent tasks
 that can be run by a node is equal to the number of
 cpu cores it has (achieving optimal CPU utilization)
Map task
 A worker (TaskTracker) that has been assigned a map task
    ●
        Reads the relevant input data (input split) from HDFS, analyzes the <key, value>
        pairs and the output is passed as input to the map function.
    ●
        The map function processes the pairs and produces intermediate pairs that are
        aggregated in memory.
    ●
        Periodically a partition function is executed which stores the intermediate key-
        value pairs in the local node storage, while grouping them in R sets.This function
        is user defined.
    ●
        When the partition function completes the storage of the key-value pairs it
        informs the master that the task is complete and where the data are stored.
    ●
        The master forwards this information to the workers that run the reduce tasks
Reduce task
 A worker that has been assigned a reduce task

    Reads from every map process that has been executed the pairs that
      correspond to itself based on the locations instructed by the master.
    When all intermediate pairs have been retrieved they are sorted based on
      their key. Entries with the same key are grouped together.
    Function reduce is executed with input the pairs <key, group_of_values>
      that were the result of the previous phase.
    The reduce task processes the input data and produces the final pairs.

    The output pairs are attached in a file in the local file system. When the
      reduce task is completed the file becomes available in the distributed file
      system.
Task Completion
When a worker has completed its task it informs
 the master.
When all workers have informed the master then
 the master will return the function to the original
 program of the user.
Example
                    Master




         worker
          Map                Reduce
                             worker
Part 1


Part 2
Input    worker
          Map                Reduce
                             worker   Output

Part 3
         worker
          Map                Reduce
                             worker
MapReduce
Example: Word count 1/3

 Objective: measure the frequency of appearance of words in a large set
  of documents
 Potential use case: Discovery of popular url in a set of webserver
  logfiles
 Implementation plan:
    “Upload” documents on MapReduce

    Author a map function

    Author a reduce function

    Run a MapReduce task

    Retrieve results
Example: Word count 2/3
map(key, value):
// key: document name; value: text of document
     for each word w in value:
         emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
   result = 0
   for each count v in values:
      result += v
   emit(result)
Example: Word count 3/3
                              (w1, 2)   (w1,2)
    (d1, ‘’w1 w2 w4’)
                              (w2, 3)   (w2,3)
  (d2, ‘ w1 w2 w3 w4’)
                              (w3, 2)   (w1,3)
    (d3, ‘ w2 w3 w4’)
                              (w4,3)    (w2,4)
                                        (w1,3)           (w1,7)
                                        (w2,3)           (w2,15)
     (d4, ‘ w1 w2 w3’)        (w1,3)
     (d5, ‘w1 w3 w4’)         (w2,4)
(d6, ‘ w1 w4 w2 w2)           (w3,2)
    (d7, ‘ w4 w2 w1’)         (w4,3)
                                        (w3,2)           (w3,8)
                                        (w4,3)           (w4,7)
   (d8, ‘ w2 w2 w3’)          (w1,3)    (w3,2)
 (d9, ‘w1 w1 w3 w3’)          (w2,3)    (w4,3)
(d10, ‘ w2 w1 w4 w3’)         (w3,4)    (w3,4)
                              (w4,1)    (w4,1)


                M=3 mappers               R=2 reducers
Extra functions
Locality

Move computation near the data: The master tries to
 have a task executed on a worker that is as “near” as
 possible to the input data, thus reducing the
 bandwidth usage
  How does the master know?
Task distribution
The number of tasks is usually higher than the
 number of the available workers
One worker can execute more than one tasks
The balance of work load is improved. In the case
 of a single worker failure there is faster recovery
 and redistribution of tasks to other nodes.
Redundant task executions
Some tasks can be delayed, resulting in a delay in the
 overall work execution
The solution to the problem is the creation of task
 copies that can be executed in parallel from 2 or more
 different workers (speculative execution)
A task is considered complete when the master is
 informed about its completion by at least one node.
Partitioning
A user can specify a custom function that will
 partition the tasks during shuffling.
The type of input and output data can be defined by
 the user and has no limitation on what form it should
 have.
The input of a reducer is always sorted
There is the possibility to execute tasks locally in a
  serial manner
The master provides web interfaces for
  Monitoring tasks progress

  Browsing of HDFS
When should I use it?
Good choice for jobs that can be broken into parallelized jobs:
     Indexing/Analysis of log files

     Sorting of large data sets

     Image processing


•
    Bad choice for serial or low latency jobs:
    –
        Computation of number π with precision of 1,000,000 digits
    –
        Computation of Fibonacci sequence
    –
        Replacing MySQL
Use cases 1/3
             Large Scale Image Conversions
             100 Amazon EC2 Instances, 4TB raw TIFF data
             11 Million PDF in 24 hours and 240$
        •
              Internal log processing
        •
              Reporting, analytics and machine learning
        •
              Cluster of 1110 machines, 8800 cores and 12PB
              raw storage
        •
              Open source contributors (Hive)


        •
              Store and process tweets, logs, etc
        •
              Open source contributors (hadoop-lzo)
        •
              Large scale machine learning
Use cases 2/3
        100.000 CPUs in 25.000 computers

        Content/Ads Optimization, Search index

        Machine learning (e.g. spam filtering)

        Open source contributors (Pig)


       •
           Natural language search (through
           Powerset)
       •
           400 nodes in EC2, storage in S3
       •
           Open source contributors (!) to HBase
       •
           ElasticMapReduce service
       •
           On demand elastic Hadoop clusters for the
           Cloud
Use cases 3/3
           ETL processing, statistics generation
           Advanced algorithms for behavioral
             analysis and targeting
       •
             Used for discovering People you May Know,
             and for other apps
       •
             3X30 node cluster, 16GB RAM and 8TB
             storage
       •
             Leading Chinese language search engine
       •
             Search log analysis, data mining
       •
             300TB per week
       •
             10 to 500 node clusters
Amazon ElasticMapReduce (EMR)
  A hosted Hadoop-as-a-service solution provided by AWS
 No need for management or tuning of Hadoop clusters
     ●
         upload your input data, store your output data on S3
     ●
         procure as many EC2 instances as you need and only pay for the
         time you use them
 Hive and Pig support makes it easy to write data analytical scripts

 Java, Perl, Python, PHP, C++ for more sophisticated algorithms

 Integrates to dynamoDB (process combined datasets in S3 &
  dynamoDB)
 Support for HBase (NoSQL)
Questions

More Related Content

What's hot

Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

What's hot (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Hadoop & MapReduce (20)

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
H04502048051
H04502048051H04502048051
H04502048051
 
E031201032036
E031201032036E031201032036
E031201032036
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 

More from Newvewm

Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Newvewm
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Newvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvewm
 
Hitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotHitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotNewvewm
 
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionCloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionNewvewm
 
Onavo aws summit 2012
Onavo   aws summit 2012Onavo   aws summit 2012
Onavo aws summit 2012Newvewm
 
ClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseNewvewm
 
SaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleSaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleNewvewm
 
Cloud security management by newvem
Cloud security management by newvemCloud security management by newvem
Cloud security management by newvemNewvewm
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureNewvewm
 
OneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyOneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyNewvewm
 
Secure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorSecure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorNewvewm
 

More from Newvewm (12)

Entrepreneur un slideshow v6
Entrepreneur un slideshow v6Entrepreneur un slideshow v6
Entrepreneur un slideshow v6
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Newvem's Utilization Heat Map
Newvem's Utilization Heat MapNewvem's Utilization Heat Map
Newvem's Utilization Heat Map
 
Hitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet SpotHitting Your Cloud’s Usage Sweet Spot
Hitting Your Cloud’s Usage Sweet Spot
 
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud AdoptionCloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
Cloudpreneurs - McKinsey Reveals Fast Growth of Cloud Adoption
 
Onavo aws summit 2012
Onavo   aws summit 2012Onavo   aws summit 2012
Onavo aws summit 2012
 
ClickSoftware AWS Customer Case
ClickSoftware AWS Customer CaseClickSoftware AWS Customer Case
ClickSoftware AWS Customer Case
 
SaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security ExampleSaaS as a Security Hazard - Google Apps Security Example
SaaS as a Security Hazard - Google Apps Security Example
 
Cloud security management by newvem
Cloud security management by newvemCloud security management by newvem
Cloud security management by newvem
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
 
OneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case StudyOneHourTranslation - AWS Cloud Case Study
OneHourTranslation - AWS Cloud Case Study
 
Secure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by PorticorSecure Your AWS Cloud Data by Porticor
Secure Your AWS Cloud Data by Porticor
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Hadoop & MapReduce

  • 1. Hadoop & MapReduce Dr. Ioannis Konstantinou http://www.cslab.ntua.gr/~ikons AWS Usergroup Greece 18/07/2012 Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens
  • 2. Big Data 90% of today's data was created in the last 2 years Moore's law: Data volume doubles every 18 months YouTube: 13 million hours and 700 billion views in 2010 Facebook: 20TB/day (compressed) CERN/LHC: 40TB/day (15PB/year) Many more examples Web logs, presentation files, medical files etc
  • 3. Problem: Data explosion 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) Data traffic of mobile telephony in the USA in 2010 1.2 ZB (Zettabyte) = 1200 EB Total of digital data in 2010 35 ZB (Zettabyte = 1021 bytes) Estimate for volume of total digital data in 2020
  • 6. Divide and Conquer “Problem” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 7. Parallelization challenges  How to assign units of work to the workers?  What if there are more units of work than workers?  What if the workers need to share intermediate incomplete data?  How do we aggregate such intermediate data?  How do we know when all workers have completed their assignments?  What if some workers failed?
  • 8. What is MapReduce? A programming model A programming framework Used to develop solutions that will  Process large amounts of data in a parallelized fashion  In clusters of computing nodes Originally a closed-source implementation at Google  Scientific papers of ’03 & ’04 describe the framework Hadoop: opensource implementation of the algorithms described in the scientific papers  http://hadoop.apache.org/
  • 9. What is Hadoop?  2 large subsystems, 1 for data management & 1 for computation:  HDFS (Hadoop Distributed File System)  MapReduce computation framework runs above HDFS  HDFS is essentially the I/O of Hadoop  Written in java: A set of java processes running in multiple nodes  Who uses it:  Yahoo!  Amazon  Facebook  Twitter  Plus many more...
  • 10. HDFS – distributed file system  A scalable distributed file system for applications dealing with large data sets.  Distributed: runs in a cluster  Scalable: 10Κ nodes, 100Κ files 10PB storage  Storage space is seamless for the whole cluster  Files broken into blocks  Typical block size: 128 MB.  Replication: Each block copied to multiple data nodes.
  • 11. Architecture of HDFS/MapReduce  Master/Slave scheme  HDFS: A central NameNode administers multple DataNodes  NameNode: holds information about which DataNode holds which files  DataNodes: «dummy» servers that hold raw file chunks  MapReduce: A central JobTracker administers multiple TaskTrackers -NameNode and JobTracker They run on the master -DataNode and TaskTracker They run on the slaves
  • 12. MapReduce The problem is broken down in 2 phases. ● Map: Non overlapping sets of data input (<key,value> records) are assigned to different processes (mappers) that produce a set of intermediate <key,value> results ● Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key,value> records.
  • 13. How does it work?
  • 14. Initialization phase Input is uploaded to HDFS and is split into pieces of fixed size Each TaskTracker node that participates in the computation is executing a copy of the MapReduce program One of the nodes plays the JobTracker master role. This node will assign tasks to the rest (workers). Tasks can either be of type map or reduce.
  • 15. JobTracker (Master) The jobTracker holds data about: Status of tasks Location of input, output and intermediate data (runs together with NameNode - HDFS master) The master is responsible for timecheduling of work tasks execution.
  • 16. TaskTracker (Slave) The TaskTracker runs tasks assigned by the master. Runs at the same node as the DataNode (HFDS slave) Task can be either of type Map or type Reduce Typically the maximum number of concurrent tasks that can be run by a node is equal to the number of cpu cores it has (achieving optimal CPU utilization)
  • 17. Map task  A worker (TaskTracker) that has been assigned a map task ● Reads the relevant input data (input split) from HDFS, analyzes the <key, value> pairs and the output is passed as input to the map function. ● The map function processes the pairs and produces intermediate pairs that are aggregated in memory. ● Periodically a partition function is executed which stores the intermediate key- value pairs in the local node storage, while grouping them in R sets.This function is user defined. ● When the partition function completes the storage of the key-value pairs it informs the master that the task is complete and where the data are stored. ● The master forwards this information to the workers that run the reduce tasks
  • 18. Reduce task  A worker that has been assigned a reduce task  Reads from every map process that has been executed the pairs that correspond to itself based on the locations instructed by the master.  When all intermediate pairs have been retrieved they are sorted based on their key. Entries with the same key are grouped together.  Function reduce is executed with input the pairs <key, group_of_values> that were the result of the previous phase.  The reduce task processes the input data and produces the final pairs.  The output pairs are attached in a file in the local file system. When the reduce task is completed the file becomes available in the distributed file system.
  • 19. Task Completion When a worker has completed its task it informs the master. When all workers have informed the master then the master will return the function to the original program of the user.
  • 20. Example Master worker Map Reduce worker Part 1 Part 2 Input worker Map Reduce worker Output Part 3 worker Map Reduce worker
  • 22. Example: Word count 1/3  Objective: measure the frequency of appearance of words in a large set of documents  Potential use case: Discovery of popular url in a set of webserver logfiles  Implementation plan:  “Upload” documents on MapReduce  Author a map function  Author a reduce function  Run a MapReduce task  Retrieve results
  • 23. Example: Word count 2/3 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
  • 24. Example: Word count 3/3 (w1, 2) (w1,2) (d1, ‘’w1 w2 w4’) (w2, 3) (w2,3) (d2, ‘ w1 w2 w3 w4’) (w3, 2) (w1,3) (d3, ‘ w2 w3 w4’) (w4,3) (w2,4) (w1,3) (w1,7) (w2,3) (w2,15) (d4, ‘ w1 w2 w3’) (w1,3) (d5, ‘w1 w3 w4’) (w2,4) (d6, ‘ w1 w4 w2 w2) (w3,2) (d7, ‘ w4 w2 w1’) (w4,3) (w3,2) (w3,8) (w4,3) (w4,7) (d8, ‘ w2 w2 w3’) (w1,3) (w3,2) (d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3) (d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4) (w4,1) (w4,1) M=3 mappers R=2 reducers
  • 26. Locality Move computation near the data: The master tries to have a task executed on a worker that is as “near” as possible to the input data, thus reducing the bandwidth usage How does the master know?
  • 27. Task distribution The number of tasks is usually higher than the number of the available workers One worker can execute more than one tasks The balance of work load is improved. In the case of a single worker failure there is faster recovery and redistribution of tasks to other nodes.
  • 28. Redundant task executions Some tasks can be delayed, resulting in a delay in the overall work execution The solution to the problem is the creation of task copies that can be executed in parallel from 2 or more different workers (speculative execution) A task is considered complete when the master is informed about its completion by at least one node.
  • 29. Partitioning A user can specify a custom function that will partition the tasks during shuffling. The type of input and output data can be defined by the user and has no limitation on what form it should have.
  • 30. The input of a reducer is always sorted There is the possibility to execute tasks locally in a serial manner The master provides web interfaces for Monitoring tasks progress Browsing of HDFS
  • 31. When should I use it? Good choice for jobs that can be broken into parallelized jobs:  Indexing/Analysis of log files  Sorting of large data sets  Image processing • Bad choice for serial or low latency jobs: – Computation of number π with precision of 1,000,000 digits – Computation of Fibonacci sequence – Replacing MySQL
  • 32. Use cases 1/3  Large Scale Image Conversions  100 Amazon EC2 Instances, 4TB raw TIFF data  11 Million PDF in 24 hours and 240$ • Internal log processing • Reporting, analytics and machine learning • Cluster of 1110 machines, 8800 cores and 12PB raw storage • Open source contributors (Hive) • Store and process tweets, logs, etc • Open source contributors (hadoop-lzo) • Large scale machine learning
  • 33. Use cases 2/3  100.000 CPUs in 25.000 computers  Content/Ads Optimization, Search index  Machine learning (e.g. spam filtering)  Open source contributors (Pig) • Natural language search (through Powerset) • 400 nodes in EC2, storage in S3 • Open source contributors (!) to HBase • ElasticMapReduce service • On demand elastic Hadoop clusters for the Cloud
  • 34. Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting • Used for discovering People you May Know, and for other apps • 3X30 node cluster, 16GB RAM and 8TB storage • Leading Chinese language search engine • Search log analysis, data mining • 300TB per week • 10 to 500 node clusters
  • 35. Amazon ElasticMapReduce (EMR) A hosted Hadoop-as-a-service solution provided by AWS  No need for management or tuning of Hadoop clusters ● upload your input data, store your output data on S3 ● procure as many EC2 instances as you need and only pay for the time you use them  Hive and Pig support makes it easy to write data analytical scripts  Java, Perl, Python, PHP, C++ for more sophisticated algorithms  Integrates to dynamoDB (process combined datasets in S3 & dynamoDB)  Support for HBase (NoSQL)