SlideShare a Scribd company logo
1 of 52
Download to read offline
Distributed Computing the Google way
     An introduction to Apache Hadoop
3 million images
    are uploaded to      everyday.




…enough images to fill a
      375.000 page photo album.
Over 210 billion emails
                    are sent out daily.




…which is more than a year’s worth
                of letter mail in the US.
Bloggers post 900.000 new
      articles every day.




  Enough posts to fill the
New York Times for 19 years!
43.339 TB are sent across all
mobile phones globally everyday.




                           That is enough to fill...
               1.7 million         9.2 million    63.9 trillion



                Blu-rays             DVDs         3.5” diskettes
700.000 new members are
signing up on Facebook everyday..




          It’s the approximate population of Guyana.
Agenda



1   Introduction.
     2   MapReduce.
         3   Apache Hadoop.
              4     RDBMS & MapReduce.
                    5   Questions & Discussion.
Eduard Hildebrandt
    Consultant, Architect, Coach
                     Freelancer




                  +49 160 6307253
       mail@eduard-hildebrandt.de
  http://www.eduard-hildebrandt.de
Why should I care?
It’s not just Google!



     New York            Internet Archive     Hadron Collider
  Stock Exchange         www.archive.org       Switzerland




1 TB trade data       growing by 20 TB      producing 15 PB
    per day              per month              per year
It’s a growing job market!
It may be the future of distributed computing!

                            Think about…




GPS tracker
                                  genom analysis

                     RFID                           medical monitors


      The amount of data we produce will rise from year to year!
It’s about performance!



BEFORE
Development: 2-3 Weeks
Runtime: 26 days




AFTER
Development: 2-3 Days
Runtime: 20 minutes
Grid computing
                 focus on: distributing workload




• one SAN drive, many compute nodes
• works well for small data sets and long processing time
• examples: SETI@home, Folding@home
Problem: Sharing data is slow!
  Google processed 400 PB per month in 2007 with an
average job size of 180 GB. It takes ~ 45 minutes to read a
                 180 GB file sequentially.
Modern approach
               focus on: distributing the data




• stores data locally
• parallel read / write
     1 HDD  ~75 MB/sec
     1000 HDD  ~75000 MB/sec
The MAP and REDUCE algorithm


            Do       1
                                  Not          1
            As       1
                                  Do          1
            I    1                    Do          1
                                                               as 2
            Say       1                                        do 2
      Map                         As          1
                          Group
                                      As       1
                                                      Reduce   i2
            Not      1
                                                               not 1
            As       1            I       1                    say 1
                                      I    1
            I    1
                                  Say          1
            Do       1




  It‘s really map – group – reduce!
Implementation of the MAP algorithm


public static class MapClass extends MapReduceBase implements
         Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
         OutputCollector<Text, IntWritable> output, Reporter reporter)
         throws IOException {

         String line = value.toString();
         StringTokenizer itr = new StringTokenizer(line);
         while (itr.hasMoreTokens()) {
             word.set(itr.nextToken());
             output.collect(word, one);
         }
    }
}




                 Could it be even simpler?
Implementation of the REDUCE algorithm



public static class Reduce extends MapReduceBase implements
         Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable>       values,
         OutputCollector<Text, IntWritable> output, Reporter reporter)
         throws IOException {

        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}




                                                     Just REDUCE it!
Apache Hadoop




        http://hadoop.apache.org/



Hadoop is an open‐source Java framework for
 parallel processing large data on clusters
         of commodity hardware.
Hadoop History


      02/2003 first MapReduce library @ Google
             10/2003 GFS paper
                        12/2004 MapReduce paper
                                  07/2005 Nutch uses MapReduce
                                        02/2006 Hadoop moves out of Nutch




      2003       2004      2005        2006      2007      2008      2009     2010


04/2007 Yahoo running Hadoop on 10.000 nodes

    01/2008 Hadoop made on Apache top level project

                 07/2008 Hadoop wins tera sort benchmark

                                                  07/2010 this presentation
Who is using Apache Hadoop?
“Failure is the defining difference between
   distributed and local programming.”
                 -- Ken Arnold, CORBA designer
mean time between failures of a HDD: 1.200.000 hours




             If your cluster has 10.000 hard drives,
then you have a hard drive crash every 5 days on average.
HDFS

                                                               NameNode
                        Client                            sample1.txt  1,3,5
                                           metadata
                                                          sample2.txt  2,4




                       read/write


         DataNode 1                  DataNode 2                      DataNode 3
Rack 1




               1                             5                                      3
                   4                             1
                       2
                             replication     2                                  6




         DataNode 4                  DataNode 5                      DataNode 6
Rack 2




                   2                                                            1
                       6                     4        3                             4   6
               3                                 5                              5
Do       1
                                          Do 1
                             As       1    Do 1



                             Do       1
                                          As 1
                                           As 1     as 2
                             As       1
                                                    do 2
                                                    i2
block      block     block   I    1                 not 1
                                          I     1
              file                            I 1   say 1
                             Say 1
 split   split split split

                             I    1       Say 1

                             Not 1        Not 1




                   How does it fit together?
Hadoop architecture



2. Submit Job
                   Client



                       1. Copy files      4. Initialize Job



                NamedNode
                                       DataNode                       TaskTracker
                                                     5. Read files                  6. MapReduce


                TaskTracker                          7. Save Result      Job


                  JobCue




                                                    3. Haertbeat
Reduce it to the max!




Performance improvement when scaling
         with your hadoop system
Reads are OK, but writes are getting slower and slower
                                                                                        7
Drop secondary indexes and triggers.


                         Some queries are still to slow
               6
                         Periodically prematerialize the most complex queries.

                        Rising popularity swamps server
           5
                        Stop doing any server-side computation.

                                       New features increases query complexity
                4
                                       Denormalize your data to reduce joins.

                                                  Service continues to grow in popularity
                          3
                                                  Scale DB-Server vertically by buying a costly server.

                                               Service becomes more popular
                                          2    Cache common queries. Reads are no longer strongly ACID.

                                                              Initial public launch
                                                        1
                                                              Move from local workstation to a server.
How can we solve this scaling problem?
Join


          page_view
                                                                   pv_users
                                            user
   pageid userid      time                                      pageid   age
                                  userid   age     gender
      1     111    10:18:21                                       1      22
                              X    111     22      female   =
      2     111    10:19:53                                       2      22
                                   222     33      male
      1     222    11:05:12                                       1      33




SQL:
  INSERT INTO TABLE pv_users
  SELECT pv.pageid, u.age
  FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Join with MapReduce

         page_view
pageid userid        time
                                  key   value              key   value
  1       111    10:18:21         111   <1,1>              111   <1,1>
  2       111    10:19:53         111   <1,2>              111   <1,2>
  1       222    11:05:12         222   <1,1>              111   <2,22>
                            map                  shuffle                  reduce
         user
userid   age    gender            key    value             key    value
 111      22    female            111   <2,22>             222   <1,1>
 222      33    male              222   <2,33>             222   <2,33>
HBase
   HBase is an open-source, distributed, versioned,
 column-oriented store modeled after Google' Bigtable.

• No real indexes

• Automatic partitioning

• Scale linearly and automatically with new nodes

• Commodity hardware

• Fault tolerant

• Batch processing
RDBMS vs. MapReduce

                RDBMS                MapReduce

Data size        gigabytes             petabytes

Access      interactive and batch       batch

               read and write          write once
Updates
                many times          read many times
Structure      static schema        dynamic schema

Integrity           high                 low

Scaling          nonlinear              linear
Use the right tool!

MapReduce is a screwdriver.
good for:
• unstructured data
• data intensive computation
• batch operations
• scale horizontal
                                    good for:
                                    • structured data
                                    • transactions
                                    • interactive requests
                                    • scale vertically

                               Databases are hammers.
Where is the bridge?



user profiles
                        ?                 log files




           RDBMS                 Hadoop
Sqoop
SQL-to-Hadoop database import tool
  user profiles                                     log files



                          Sqoop




             RDBMS                        Hadoop




  $ sqoop –connect jdbc:mysql://database.example.com/users 
    –username aaron –password 12345 –all-tables 
    –warehouse-dir /common/warehouse
Sqoop
SQL-to-Hadoop database import tool
  user profiles                           log files



                     Sqoop




             RDBMS               Hadoop




                      Java classes          DB schema
What is common across
              Hadoop-able problems?

• nature of data
    • complex data
    • multiple data soucers
    • lot of it

• nature of the analysis
    • batch processing
    • parallel execution
    • spread data over nodes in a cluster
    • take computation to the data
TOP 10 Hadoop-able problems

1   modeling true risk        6   network data analysis

2   customer churn analysis   7   fraud detection

3   recommendation engine     8   trade surveillance

4   ad targeting              9   search quality

5   point of sale analysis    10 data “sandbox”
“Appetite comes with eating.”
                 -- François Rabelais
Case Study 1



Listening data:
user id   track id   scrobble   radio   skip
  123       456         0        1       1
  451       789         1        0       1
  241       234         0        1       1




Hadoop jobs for:
• number of unique listeners
• number of times the track was:
    • scrobbled
    • listened to on the radio
    • listened to in total
    • skipped on the radio
Case Study 2



User data:
• 12 TB of compressed data added per day
• 800 TB of compressed data scanned per day
• 25,000 map-reduce jobs per day
• 65 millions files in HDFS
• 30,000 simultaneous clients to the
  HDFS NameNode




Hadoop jobs for:
• friend recommendations
• Insights for the Facebook Advertisers
Enterprise
    SCIENCE        Xml

                   Cvs

                   Edi                                         Internet, mashups,
    INDUSTRY                                                   dashboard
                   Logs                        Server cloud

                               Create map
                   Objects

                   Sql                                                          ERP,SOA,BI
    LEGACY
                   Txt

                   Json
                                            REDUCE        IMPORT
SYSTEM DATA        binary

                                        Hadoop subsystem                RDBMS

     High Volume
1                    2       MapReduce algorithm                   3   Consume results
     Data
That was just the tip of the iceberg!




                      Core                                         HDFS
  HBase                                                                       Hive

                  Pig                                     Thrift
                                   Avro
                                                       Nutch
      ZooKeeper
                             Chukwa
                                               Jaql            Cassandra
            Dumbo
                                        Solr
                                                                   KosmosFS
Cascading              Mahout
                                                      Scribe
            Ganglia             Katta
                                                      Hypertable
Hadoop is a good choice for:

• analyzing log files
• sort a large amount of data
• search engines
• contextual adds
• image analysis
• protein folding
• classification
Hadoop is a poor choice for:
• figuring PI to 1.000.000 digits
• calculating Fibonacci sequences
• a general RDBMS replacement
Final thoughts
    Data intensive computation is a fundamentally different challenge
1
    than doing CPU intensive computation over small dataset.

2   New ways of thinking about problems are needed.

    Failure is acceptable and inevitable.
3
    Go cheap! Go distributed!

4   RDBMS is not dead!
    It just got new friends and helpers.

5   Give Hadoop a chance!
Time for questions!
Let’s get connected!



http://www.soa-at-work.com
http://www.eduard-hildebrandt.de

More Related Content

Viewers also liked

Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
Consensus in distributed computing
Consensus in distributed computingConsensus in distributed computing
Consensus in distributed computingRuben Tan
 
High performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingHigh performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingJuniper Networks
 
Introduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentIntroduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentMichelle Holley
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...tugrulh
 
Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Hitesh Kumar Markam
 
Hệ PhâN TáN
Hệ PhâN TáNHệ PhâN TáN
Hệ PhâN TáNit
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed ComputingRicha Singh
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVOpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVCloud Native Day Tel Aviv
 
Distributed computing
Distributed computingDistributed computing
Distributed computingshivli0769
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed SystemsRupsee
 

Viewers also liked (18)

Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Consensus in distributed computing
Consensus in distributed computingConsensus in distributed computing
Consensus in distributed computing
 
BitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed ComputingBitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed Computing
 
High performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computingHigh performance data center computing using manageable distributed computing
High performance data center computing using manageable distributed computing
 
Introduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentIntroduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application Development
 
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
 
Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing Concepts of Distributed Computing & Cloud Computing
Concepts of Distributed Computing & Cloud Computing
 
Hệ PhâN TáN
Hệ PhâN TáNHệ PhâN TáN
Hệ PhâN TáN
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed Computing
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFVOpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
OpenStack and OpenDaylight: An Integrated IaaS for SDN/NFV
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Similar to Distributed computing the Google way

Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptxShree Shree
 
MapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubyMapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubySwanand Pagnis
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир СуворовEYevseyeva
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDBStone Gao
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 

Similar to Distributed computing the Google way (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValuesColumn Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
 
Hadoop
HadoopHadoop
Hadoop
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 
MapReduce with Hadoop and Ruby
MapReduce with Hadoop and RubyMapReduce with Hadoop and Ruby
MapReduce with Hadoop and Ruby
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Distributed computing the Google way

  • 1. Distributed Computing the Google way An introduction to Apache Hadoop
  • 2. 3 million images are uploaded to everyday. …enough images to fill a 375.000 page photo album.
  • 3. Over 210 billion emails are sent out daily. …which is more than a year’s worth of letter mail in the US.
  • 4. Bloggers post 900.000 new articles every day. Enough posts to fill the New York Times for 19 years!
  • 5. 43.339 TB are sent across all mobile phones globally everyday. That is enough to fill... 1.7 million 9.2 million 63.9 trillion Blu-rays DVDs 3.5” diskettes
  • 6. 700.000 new members are signing up on Facebook everyday.. It’s the approximate population of Guyana.
  • 7.
  • 8. Agenda 1 Introduction. 2 MapReduce. 3 Apache Hadoop. 4 RDBMS & MapReduce. 5 Questions & Discussion.
  • 9. Eduard Hildebrandt Consultant, Architect, Coach Freelancer +49 160 6307253 mail@eduard-hildebrandt.de http://www.eduard-hildebrandt.de
  • 10. Why should I care?
  • 11. It’s not just Google! New York Internet Archive Hadron Collider Stock Exchange www.archive.org Switzerland 1 TB trade data growing by 20 TB producing 15 PB per day per month per year
  • 12. It’s a growing job market!
  • 13. It may be the future of distributed computing! Think about… GPS tracker genom analysis RFID medical monitors The amount of data we produce will rise from year to year!
  • 14. It’s about performance! BEFORE Development: 2-3 Weeks Runtime: 26 days AFTER Development: 2-3 Days Runtime: 20 minutes
  • 15. Grid computing focus on: distributing workload • one SAN drive, many compute nodes • works well for small data sets and long processing time • examples: SETI@home, Folding@home
  • 16. Problem: Sharing data is slow! Google processed 400 PB per month in 2007 with an average job size of 180 GB. It takes ~ 45 minutes to read a 180 GB file sequentially.
  • 17. Modern approach focus on: distributing the data • stores data locally • parallel read / write  1 HDD  ~75 MB/sec  1000 HDD  ~75000 MB/sec
  • 18. The MAP and REDUCE algorithm Do 1 Not 1 As 1 Do 1 I 1 Do 1 as 2 Say 1 do 2 Map As 1 Group As 1 Reduce i2 Not 1 not 1 As 1 I 1 say 1 I 1 I 1 Say 1 Do 1 It‘s really map – group – reduce!
  • 19. Implementation of the MAP algorithm public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Could it be even simpler?
  • 20. Implementation of the REDUCE algorithm public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Just REDUCE it!
  • 21. Apache Hadoop http://hadoop.apache.org/ Hadoop is an open‐source Java framework for parallel processing large data on clusters of commodity hardware.
  • 22. Hadoop History 02/2003 first MapReduce library @ Google 10/2003 GFS paper 12/2004 MapReduce paper 07/2005 Nutch uses MapReduce 02/2006 Hadoop moves out of Nutch 2003 2004 2005 2006 2007 2008 2009 2010 04/2007 Yahoo running Hadoop on 10.000 nodes 01/2008 Hadoop made on Apache top level project 07/2008 Hadoop wins tera sort benchmark 07/2010 this presentation
  • 23. Who is using Apache Hadoop?
  • 24. “Failure is the defining difference between distributed and local programming.” -- Ken Arnold, CORBA designer
  • 25. mean time between failures of a HDD: 1.200.000 hours If your cluster has 10.000 hard drives, then you have a hard drive crash every 5 days on average.
  • 26. HDFS NameNode Client sample1.txt  1,3,5 metadata sample2.txt  2,4 read/write DataNode 1 DataNode 2 DataNode 3 Rack 1 1 5 3 4 1 2 replication 2 6 DataNode 4 DataNode 5 DataNode 6 Rack 2 2 1 6 4 3 4 6 3 5 5
  • 27. Do 1 Do 1 As 1 Do 1 Do 1 As 1 As 1 as 2 As 1 do 2 i2 block block block I 1 not 1 I 1 file I 1 say 1 Say 1 split split split split I 1 Say 1 Not 1 Not 1 How does it fit together?
  • 28. Hadoop architecture 2. Submit Job Client 1. Copy files 4. Initialize Job NamedNode DataNode TaskTracker 5. Read files 6. MapReduce TaskTracker 7. Save Result Job JobCue 3. Haertbeat
  • 29. Reduce it to the max! Performance improvement when scaling with your hadoop system
  • 30. Reads are OK, but writes are getting slower and slower 7 Drop secondary indexes and triggers. Some queries are still to slow 6 Periodically prematerialize the most complex queries. Rising popularity swamps server 5 Stop doing any server-side computation. New features increases query complexity 4 Denormalize your data to reduce joins. Service continues to grow in popularity 3 Scale DB-Server vertically by buying a costly server. Service becomes more popular 2 Cache common queries. Reads are no longer strongly ACID. Initial public launch 1 Move from local workstation to a server.
  • 31. How can we solve this scaling problem?
  • 32. Join page_view pv_users user pageid userid time pageid age userid age gender 1 111 10:18:21 1 22 X 111 22 female = 2 111 10:19:53 2 22 222 33 male 1 222 11:05:12 1 33 SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
  • 33. Join with MapReduce page_view pageid userid time key value key value 1 111 10:18:21 111 <1,1> 111 <1,1> 2 111 10:19:53 111 <1,2> 111 <1,2> 1 222 11:05:12 222 <1,1> 111 <2,22> map shuffle reduce user userid age gender key value key value 111 22 female 111 <2,22> 222 <1,1> 222 33 male 222 <2,33> 222 <2,33>
  • 34. HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable. • No real indexes • Automatic partitioning • Scale linearly and automatically with new nodes • Commodity hardware • Fault tolerant • Batch processing
  • 35. RDBMS vs. MapReduce RDBMS MapReduce Data size gigabytes petabytes Access interactive and batch batch read and write write once Updates many times read many times Structure static schema dynamic schema Integrity high low Scaling nonlinear linear
  • 36. Use the right tool! MapReduce is a screwdriver. good for: • unstructured data • data intensive computation • batch operations • scale horizontal good for: • structured data • transactions • interactive requests • scale vertically Databases are hammers.
  • 37. Where is the bridge? user profiles ? log files RDBMS Hadoop
  • 38. Sqoop SQL-to-Hadoop database import tool user profiles log files Sqoop RDBMS Hadoop $ sqoop –connect jdbc:mysql://database.example.com/users –username aaron –password 12345 –all-tables –warehouse-dir /common/warehouse
  • 39. Sqoop SQL-to-Hadoop database import tool user profiles log files Sqoop RDBMS Hadoop Java classes DB schema
  • 40. What is common across Hadoop-able problems? • nature of data • complex data • multiple data soucers • lot of it • nature of the analysis • batch processing • parallel execution • spread data over nodes in a cluster • take computation to the data
  • 41. TOP 10 Hadoop-able problems 1 modeling true risk 6 network data analysis 2 customer churn analysis 7 fraud detection 3 recommendation engine 8 trade surveillance 4 ad targeting 9 search quality 5 point of sale analysis 10 data “sandbox”
  • 42. “Appetite comes with eating.” -- François Rabelais
  • 43. Case Study 1 Listening data: user id track id scrobble radio skip 123 456 0 1 1 451 789 1 0 1 241 234 0 1 1 Hadoop jobs for: • number of unique listeners • number of times the track was: • scrobbled • listened to on the radio • listened to in total • skipped on the radio
  • 44. Case Study 2 User data: • 12 TB of compressed data added per day • 800 TB of compressed data scanned per day • 25,000 map-reduce jobs per day • 65 millions files in HDFS • 30,000 simultaneous clients to the HDFS NameNode Hadoop jobs for: • friend recommendations • Insights for the Facebook Advertisers
  • 45. Enterprise SCIENCE Xml Cvs Edi Internet, mashups, INDUSTRY dashboard Logs Server cloud Create map Objects Sql ERP,SOA,BI LEGACY Txt Json REDUCE IMPORT SYSTEM DATA binary Hadoop subsystem RDBMS High Volume 1 2 MapReduce algorithm 3 Consume results Data
  • 46. That was just the tip of the iceberg! Core HDFS HBase Hive Pig Thrift Avro Nutch ZooKeeper Chukwa Jaql Cassandra Dumbo Solr KosmosFS Cascading Mahout Scribe Ganglia Katta Hypertable
  • 47. Hadoop is a good choice for: • analyzing log files • sort a large amount of data • search engines • contextual adds • image analysis • protein folding • classification
  • 48. Hadoop is a poor choice for: • figuring PI to 1.000.000 digits • calculating Fibonacci sequences • a general RDBMS replacement
  • 49. Final thoughts Data intensive computation is a fundamentally different challenge 1 than doing CPU intensive computation over small dataset. 2 New ways of thinking about problems are needed. Failure is acceptable and inevitable. 3 Go cheap! Go distributed! 4 RDBMS is not dead! It just got new friends and helpers. 5 Give Hadoop a chance!