SlideShare a Scribd company logo
1 of 90
IN4392 Cloud Computing
Cloud Programming Models




Cloud Computing (IN4392)

D.H.J. Epema and A. Iosup. With contributions from Claudio Martella and Bogdan Ghit.

2012-2013

                                                                                       1



Parallel and Distributed Group/TU Delft
Terms for Today’s Discussion
 Programming model
   = language + libraries + runtime system that create
     a model of computation (an abstract machine)
   = ―an abstraction of a computer system‖ Wikipedia
   Examples: message-passing vs shared memory, data- vs
   task-parallelism, …

 Abstraction level Q: What is the best abstraction level?
   = distance from physical machine
   Examples: Assembly low-level vs Java is high level
      Many design trade-offs: performance, ease-of-use,
      common-task optimization, programming paradigm, …
2012-2013                                            2
Characteristics of a Cloud Programming Model


1. Cost model (Efficiency) = cost/performance, overheads, …
2. Scalability
3. Fault-tolerance
4. Support for specific services
5. Control model, e.g., fine-grained many-task scheduling
6. Data model, including partitioning and placement, out-of-
   memory data access, etc.
7. Synchronization model


2012-2013                                             3
Agenda


1.     Introduction
2.     Cloud Programming in Practice (The Problem)
3.     Programming Models for Compute-Intensive Workloads
4.     Programming Models for Big Data
5.     Summary




2012-2013                                             4
Today’s Challenges


•    eScience
•    The Fourth Paradigm
•    The Data Deluge and Big Data
•    Possibly others




2012-2013                           5
eScience: The Why
      • Science experiments already cost 25—50% budget
             • … and perhaps incur 75% of the delays
      • Millions of lines of code with similar functionality
             • Little code reuse across projects and application domains
             • … but last two decades‘ science is very similar in structure

      • Most results difficult to share and reuse
             • Case-in-point: Sloan Digital Sky Survey
               digital map of 25% of the sky x spectra
               40TB+ sky survey data
               200M+ astro-objects (images)
               1M+ objects with spectrum (spectra)
               How to make it work for this and
               the next generation of scientists?
      2012-2013                                                                   6



Source: Jim Gray and Alex Szalay,―eScience -- A Transformed Scientific Method‖,
http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt
eScience (John Taylor, UK Sci.Tech., 1999)
• A new scientific method
       • Combine science with IT
       • Full scientific process: control scientific instrument or produce data
         from simulations, gather and reduce data, analyze and model
         results, visualize results
       • Mostly compute-intensive, e.g., simulation of complex phenomena


• IT support
       • Infrastructure: LHC Grid, Open Science Grid, DAS, NorduGrid, …
       • From programming models to infrastructure management tools


• Examples                  Q: Why is CompSci an example here?
       • * physics, Bioinformatics, Material science, Engineering, CompSci
2012-2013                                                               7
The Fourth Paradigm: The Why (An Anecdotal Example)
The Overwhelming Growth of Knowledge
―When 12 men founded the        Number of                  1993            1997
  Royal Society in 1660, it was Publications               1997            2001
  possible for an educated
  person to encompass all of
  scientific knowledge. […] In
  the last 50 years, such has
  been the pace of scientific
  advance that even the best
  scientists cannot keep up
  with discoveries at frontiers
  outside their own field.‖
  Tony Blair,
  PM Speech, May 2002           Data: King,The scientific impact of nations,Nature‘04.
The Fourth Paradigm: The What
  From Hypothesis to Data
  • Thousand years ago:
     science was empirical describing natural phenomena
                                                                           2

  • Last few hundred years:
                                                                       .
                                                                       a       4 G   c2

     theoretical branch using models, generalizations                  a         3   a2



  • Last few decades:
     a computational branch simulating complex phenomena
  • Today (the Fourth Paradigm): Q1: What is the Fourth
     data exploration                 Paradigm?
         unify theory, experiment, and simulation
         • Data captured by instruments or generated by simulator
         • Processed by software
                                       Q2: What are the dangers
         • Information/Knowledge stored the Fourth Paradigm?
                                        of in computer
         • Scientist analyzes results using data management and statistics
  2012-2013                                                                     9



Source: Jim Gray and ―The Fourth Paradigm‖,
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
The “Data Deluge”:
  The Why

"Everywhere you look, the quantity
  of information in the world is
  soaring. According to one
  estimate, mankind created 150
  exabytes (billion gigabytes) of
  data in 2005. This year, it will
  create 1,200 exabytes. Merely
  keeping up with this flood, and
  storing the bits that might be
  useful, is difficult enough.
  Analysing it, to spot patterns and
  extract useful information, is
  harder still.―
  The Data Deluge, The Economist,
  25 February 2010
                                       10
“Data Deluge”: The Personal Memex Example



• Vannevar Bush in the 1940s: record your life
• MIT Media Laboratory: The Human Speechome
  Project/TotalRecall, data mining/analysis/visio
   • Deb Roy and Rupal Patel ―record practically every
     waking moment of their son‘s first three years‖
     (20% privacy time…Is this even legal?! Should it be?!)
   • 11x1MP/14fps cameras, 14x16b-48KHz mics, 4.4TB
     RAID + tapes, 10 computers; 200k hours audio-video
   • Data size: 200GB/day, 1.5PB total

                                                      11
“Data Deluge”: The Gaming Analytics Example




                   • EQ II: 20TB/year all logs
                   • Halo3: 1.4PB served statistics
                     on player logs

                                                 12
“Data Deluge”: Datasets in Comp.Sci.

                                  Dataset
                                   Size
   http://gwa.ewi.tudelft.nl
                               1TB/yr
            The Failure
            Trace                1TB                     GamTA
            Archive
                               100GB                    P2PTA
  http://fta.inria.fr
                                10GB

Peer-to-Peer Trace Archive       1GB
… PWA, ITA, CRAWDAD, …
                                            ‗06   ‗09   ‗10        ‗11 Year

• 1,000s of scientists: From theory to practice
                                                                       13


                                                              13
“Data Deluge”:
  The Professional World Gets Connected




                                                   Feb 2012
  2012-2013                                                   14



Source: Vincenzo Cosenza, The State of LinkedIn,
http://vincos.it/the-state-of-linkedin/
What is “Big Data”?

• Very large, distributed aggregations of loosely structured
  data, often incomplete and inaccessible

• Easily exceeds the processing capacity of conventional
  database systems

• Principle of Big Data: ―When you can, keep everything!‖

• Too big, too fast, and doesn‘t comply with the traditional
  database architectures
  2011-2012                                                15
The Three “V”s of Big Data

• Volume
      •   More data vs. better models
      •   Data grows exponentially
      •   Analysis in near-real time to extract value
      •   Scalable storage and distributed queries
• Velocity
      • Speed of the feedback loop
      • Gain competitive advantage: fast recommendations
      • Identify fraud, predict customer churn faster
• Variety
      • The data can become messy: text, video, audio, etc.
      • Difficult to integrate into applications
     2011-2012                                                               16

Adapted from: Doug Laney, ―3D data management‖, META Group/Gartner report,
Feb 2001. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive
   Workloads
   1. Bags of Tasks
   2. Workflows
   3. Parallel Programming Models
4. Programming Models for Big Data
5. Summary

2012-2013                                        18
What is a Bag of Tasks (BoT)? A System View

 BoT = set of jobs sent by a user…


 …that start at most Δs after
  the first job                      Q: What is the user‘s view?


                                                Time [units]
• Why Bag of Tasks? From the perspective
  of the user, jobs in set are just tasks of a larger job
• A single useful result from the complete BoT
• Result can be combination of all tasks, or a selection
  of the results of most or even a single task
    2012-2013                                                  19
Iosup et al., The Characteristics and
Performance of Groups of Jobs in Grids,
Euro-Par, LNCS, vol.4641, pp. 382-393, 2007.
Applications of the BoT Programming Model


• Parameter sweeps
       • Comprehensive, possibly exhaustive investigation of a model
       • Very useful in engineering and simulation-based science

• Monte Carlo simulations
       • Simulation with random elements: fixed time yet limited inaccuracy
       • Very useful in engineering and simulation-based science

• Many other types of batch processing
       • Periodic computation, Cycle scavenging
       • Very useful to automate operations and reduce waste
2012-2013                                                           20
BoTs Became the Dominant Programming
           Model for Grid Computing
(US) TeraGrid-2 NCSA
 (US) Condor U.Wisc.
            (EU) EGEE
      (CA) SHARCNET
            (US) Grid3
           (US) GLOW
             (UK) RAL
   (NO,SE) NorduGrid
        (FR) Grid'5000
           (NL) DAS-2

From jobs [%]            0   20        40          60    80        100

(US) TeraGrid-2 NCSA
 (US) Condor U.Wisc.
            (EU) EGEE
      (CA) SHARCNET
            (US) Grid3
           (US) GLOW
             (UK) RAL
   (NO,SE) NorduGrid
        (FR) Grid'5000
           (NL) DAS-2

From CPUTime [%]
              0              20        40          60    80        100

          2012-2013                                           21


        Iosup and Epema: Grid Computing Workloads.
           IEEE Internet Computing 15(2): 19-26 (2011)
Practical Applications of the BoT Programming Model
  Parameter Sweeps in Condor [1/4]

  • Sue the scientist wants to
    ―Find the value of F(x,y,z) for
    10 values for x and y, and 6 values for z‖

  • Solution: Run a parameter sweep, with
    10 x 10 x 6 = 600 parameter values
  • Problem of the solution:
         • Sue runs one job (a combination of x, y, and z) on her low-end
           machine. It takes 6 hours.
         • That‘s 150 days uninterrupted computation on Sue‘s machine!

  2012-2013                                                          22



Source: Condor Team, Condor User‘s Tutorial.
http://cs.uwisc.edu/condor
Practical Applications of the BoT Programming Model
  Parameter Sweeps in Condor [2/4]
  Universe             =   vanilla
  Executable           =   sim.exe
  Input                =   input.txt
  Output               =   output.txt
  Error                =   error.txt
  Log                  =   sim.log

  Requirements = OpSys == “WINNT61” && Complex SLAs can
                 Arch == “INTEL” &&    be specified easily
    (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize)


  InitialDir = run_$(Process) Also passed as
  Queue 600                   parameter to sim.exe
  2012-2013                                             23



Source: Condor Team, Condor User‘s Tutorial.
http://cs.uwisc.edu/condor
Practical Applications of the BoT Programming Model
  Parameter Sweeps in Condor [3/4]
  % condor_submit sim.submit
  Submitting job(s)
    ................................................
    ................................................
    ................................................
    ................................................
    ................................................
    ...............
  Logging submit event(s)
    ................................................
    ................................................
    ................................................
    ................................................
    ................................................
    ...............
  600 job(s) submitted to cluster 3.

  2012-2013                                             24



Source: Condor Team, Condor User‘s Tutorial.
http://cs.uwisc.edu/condor
Practical Applications of the BoT Programming Model
    Parameter Sweeps in Condor [4/4]
% condor_q
-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> :
   x.cs.wisc.edu
ID     OWNER   SUBMITTED  RUN_TIME ST PRI SIZE CMD
3.0    frieda 4/20 12:08 0+00:00:05 R 0    9.8 sim.exe
3.1    frieda 4/20 12:08 0+00:00:03 I 0    9.8 sim.exe
3.2    frieda 4/20 12:08 0+00:00:01 I 0    9.8 sim.exe
3.3    frieda 4/20 12:08 0+00:00:00 I 0    9.8 sim.exe
...
3.598 frieda 4/20 12:08 0+00:00:00 I 0     9.8 sim.exe
3.599 frieda 4/20 12:08 0+00:00:00 I 0     9.8 sim.exe

600 jobs; 599 idle, 1 running, 0 held


    2012-2013                                             25



  Source: Condor Team, Condor User‘s Tutorial.
  http://cs.uwisc.edu/condor
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive
   Workloads
   1. Bags of Tasks
   2. Workflows
   3. Parallel Programming Models
4. Programming Models for Big Data
5. Summary

2012-2013                                        26
What is a Wokflow?


  WF = set of jobs with
        precedence
(think Direct Acyclic Graph)




2012-2013                      27
Applications of the Workflow Programming
        Model
        • Complex applications
               • Complex filtering of data
               • Complex analysis of instrument measurements


        • Applications created by non-CS scientists*
               • Workflows have a natural correspondence in the real-world,
                 as descriptions of a scientific procedure
               • Visual model of a graph sometimes easier to program


        • Precursor of the MapReduce Programming Model
          (next slides)
        2012-2013                                                              28



*Adapted from: Carole Goble and David de Roure, Chapter in ―The Fourth
Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Workflows Existed in Grids, but Did Not
    Become a Dominant Programming Model
  • Traces


  • Selected Findings




     •   Loose coupling
     •   Graph with 3-4 levels
     •   Average WF size is 30/44 jobs
     •   75%+ WFs are sized 40 jobs or less, 95% are sized 200 jobs or less
    2012-2013                                                        29
Ostermann et al., On the Characteristics of Grid
Workflows, CoreGRID Integrated Research in Grid
Computing (CGIW), 2008.
Practical Applications of the WF Programming Model
        Bioinformatics in Taverna




        2012-2013                                                            30



Source: Carole Goble and David de Roure, Chapter in ―The Fourth Paradigm‖,
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive
   Workloads
   1. Bags of Tasks
   2. Workflows
   3. Parallel Programming Models
4. Programming Models for Big Data
5. Summary

2012-2013                                        31
Parallel Programming Models                      TUD course in4049
                                                       Introduction to HPC


      • Abstract machines
                          Task (groups of 5, 5 minutes):
         • (Distributed) shared memory
                          discuss parallel programming in clouds
             • Distributed memory: MPI
                      Task (inter-group discussion):
                      discuss.
      • Conceptual programming models
             •    Master/worker
             •    Divide and conquer
             •    Data / Task parallelism
             •    BSP
      • System-level programming models
             • Threads on GPUs and other multi-cores
      2012-2013                                                     32


Varbanescu et al.: Towards an Effective Unified
   Programming Model for Many-Cores. IPDPS WS 2012
Agenda


1.     Introduction
2.     Cloud Programming in Practice (The Problem)
3.     Programming Models for Compute-Intensive Workloads
4.     Programming Models for Big Data
5.     Summary




2012-2013                                             33
Ecosystems of Big-Data Programming Models
Q: Where does MR-on-demand fit?
                          Q: Where does Pregel-on-GPUs fit?
                                                                                 High-Level Language
Flume BigQuery       SQL   Meteor JAQL       Hive    Pig   Sawzall       Scope   DryadLINQ      AQL

                                                                                  Programming Model
                                PACT MapReduce Model              Pregel     Dataflow         Algebrix


                                                                                        Execution Engine
Flume Dremel Tera Azure Nephele Haloop Hadoop/ Giraph MPI/ Dryad                              Hyracks
Engine Service Data Engine              YARN         Erlang
        Tree Engine


                                                                                         Storage Engine
 S3      GFS      Tera Azure           HDFS         Voldemort             L CosmosFS           Asterix
                  Data Data                                               F                    B-tree
                  Store Store     * Plus Zookeeper, CDN, etc.             S
      2012-2013                                                                                34



Adapted from: Dagstuhl Seminar on Information Management in the Cloud,
http://www.dagstuhl.de/program/calendar/partlist/?semnr=11321&SUOG
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive Workloads
4. Programming Models for Big Data
   1. MapReduce
   2. Graph Processing
   3. Other Big Data Programming Models
5. Summary


2012-2013                                         35
MapReduce
• Model for processing and generating large data sets

• Enables a functional-like programming model

• Splits computations into independent parallel tasks

• Makes efficient use of large commodity clusters

• Hides the details of parallelization, fault-tolerance,
  data distribution, monitoring and load balancing

2011-2012                                                  36
MapReduce: The Programming Model

A programmig model, not a programming language!
1. Input/Output:
  • Set of key/value pairs
2. Map Phase:
  • Processes input key/value pair
  • Produces set of intermediate pairs
  map (in_key, in_value) -> list(out_key, interm_value)
3. Reduce Phase:
  • Combines all intermediate values for a given key
  • Produces a set of merged output values
  reduce(out_key, list(interm_value)) -> list(out_value)

 2011-2012                                                 37
MapReduce in the Cloud

• Facebook use case:
   • Social-networking services
   • Analyze connections in the graph of friendships to recommend new
     connections
• Google use case:
   • Web-base email services, GoogleDocs
   • Analyze messages and user behavior to optimize ad selection and
     placement
• Youtube use case:
   • Video-sharing sites
   • Analyze user preferences to give better stream suggestions


    2011-2012                                                      38
Wordcount Example
• File 1: ―The big data is big.‖
• File 2: ―MapReduce tames big data.‖

• Map Output:
    •   Mapper-1: (The, 1), (big, 1), (data, 1), (is, 1), (big, 1)
    •   Mapper-2: (MapReduce, 1), (tames, 1), (big, 1), (data, 1)


• Reduce Input                                          • Reduce Output
    •   Reducer-1:   (The, 1)                                 •   Reducer-1:   (The, 1)
    •   Reducer-2:   (big, 1), (big, 1), (big, 1)             •   Reducer-2:   (big, 3)
    •   Reducer-3:   (data, 1), (data, 1)                     •   Reducer-3:   (data, 2)
    •   Reducer-4:   (is, 1)                                  •   Reducer-4:   (is, 1)
    •   Reducer-5:   (MapReduce, 1), (MapReduce, 1)           •   Reducer-5:   (MapReduce, 2)
    •   Reducer-6:   (tames, 1)                               •   Reducer-6:   (tames, 1)
  2011-2012                                                                            39
Colored Square Counter




2011-2012                40
MapReduce @ Google: Example 1


    • Generating language model statistics
        • Count # of times every 5-word sequence occurs in large corpus
          of documents (and keep all those where count >= 4)


    • MapReduce solution:
        • Map: extract 5-word sequences => count from document
        • Reduce: combine counts, and keep if count large enough




       2011-2012                                                      41


•    http://www.slideshare.net/jhammerb/mapreduce-pact06-keynote
MapReduce @ Google: Example 2


    • Joining with other data
         • Generate per-doc summary, but include per-host summary
           (e.g., # of pages on host, important terms on host)


    • MapReduce solution:
         • Map: extract host name from URL, lookup per-host info,
           combine with per-doc data and emit
         • Reduce: identity function (emit key/value directly)




      2011-2012                                                     42


•   http://www.slideshare.net/jhammerb/mapreduce-pact06-keynote
More Examples of MapReduce Applications

•    Distributed Grep
•    Count of URL Access Frequency
•    Reverse Web-Link Graph
•    Term-Vector per Host
•    Distributed Sort
•    Inverted indices
•    Page ranking
•    Machine learning algorithms
•    …


2011-2012                             43
Execution Overview (1/2)
 Q: What is the performance problem raised by this step?




2011-2012                                             44
Execution Overview (2/2)
• One master, many workers example
   • Input data split into M map tasks (64 / 128 MB): 200,000 maps
   • Reduce phase partitioned into R reduce tasks: 4,000 reduces
   • Dynamically assign tasks to workers: 2,000 workers


• Master assigns each map / reduce task to an idle worker
   • Map worker:
      • Data locality awareness
      • Read input and generate R local files with key/value pairs
   • Reduce worker:
      • Read intermediate output from mappers
      • Sort and reduce to produce the output
    2011-2012                                                        45
Failures and Back-up Tasks




2011-2012                    46
What is Hadoop? A MapReduce Exec. Engine
• Inspired by Google, supported by Yahoo!

• Designed to perform fast and reliable analysis of the big data

• Large expansion in many domains such as:
   • Finance, technology, telecom, media, entertainment, government,
     research institutions




   2011-2012                                                      47
Hadoop @ Yahoo

                                                                 • When you visit
                                                                   yahoo, you are
                                                                   interacting with
                                                                   data processed
                                                                   with Hadoop!




        2011-2012                                                             48


• https://opencirrus.org/system/files/OpenCirrusHadoop2009.ppt
Hadoop Distributed File System

• Assumptions and goals:
   •     Data distributed across hundreds or thousands of machines
   •     Detection of faults and automatic recovery
   •     Designed for batch processing vs. interactive use
   •     High throughput of data access vs. low latency of data access
   •     High aggregate bandwidth and high scalability
   •     Write-once-read-many file access model
   •     Moving computations is cheaper than moving data: minimize network
         congestion and increase throughput of the system




       2011-2012                                                      50
HDFS Architecture
• Master/slave architecture

• NameNode (NN)
   •   Manages the file system namespace
   •   Regulates access to files by clients
   •   Open/close/rename files or directories
   •   Mapping of blocks to DataNodes
• DataNode (DN)
   •   One per node in the cluster
   •   Manages local storage of the node
   •   Block creation/deletion/replication initiated by NN
   •   Serve read/write requests from clients
       2011-2012                                             51
HDFS Internals
• Replica Placement Policy
   • First replica on one node in the local rack
   • Second replica on a different node in the local rack
   • Third replica on a different node in a different rack
      improved write performance (2/3 are on the local rack)
      preserves data reliability and read performance


• Communication Protocols
   •     Layered on top of TCP/IP
   •     Client Protocol: client – NameNode machine
   •     DataNode Protocol: DataNodes – NameNode
   •     NameNode responds to RPC requests issued by DataNodes / clients

       2011-2012                                                      52
Hadoop Scheduler
• Job divided into several independent tasks executed in parallel
   • The input file is split into chunks of 64 / 128 MB
   • Each chunk is assigned to a map task
   • Reduce task aggregate the output of the map tasks


• The master assigns tasks to the workers in FIFO order
   • JobTracker: maintains a queue of running jobs, the states of the
     TaskTrackers, the tasks assignments
   • TaskTrackers: report their state via a heartbeat mechanism


• Data Locality: execute tasks close to their data
• Speculative execution: re-launch slow tasks
     2011-2012                                                          53
MapReduce Evolution [1/5]
            Hadoop is Maturing: Important Contributors
Number of patches




           2012-2013                                                54



Source: http://www.theregister.co.uk/2012/08/17/community_hadoop/
MapReduce Evolution [2/5]
   Late Scheduler
 • LATE Scheduler
     •   Longest Approximate Time to End
     •   Speculatively execute the task that will finish farthest into the future
     •   time_left = (1-progressScore) / progressRate
     •   Tasks make progress at a roughly constant rate
     •   Robust to node heterogeneity


 • EC2 Sort running times
     • LATE vs. Hadoop vs. No spec.




      2011-2012                                                              55


Zaharia et al.: Improving MapReduce performance in
   heterogeneous environments. OSDI 2008.
MapReduce Evolution [3/5]
   FAIR Scheduling
   • Isolation and statistical multiplexing
   • Two-level architecture:
       • First, allocates task slots across pools
       • Second, each pool allocates its slots among multiple jobs
   • Based on a max-min fairness policy




      2011-2012                                                      56

Zaharia et al.: Delay scheduling: a simple technique for
   achieving locality and fairness in cluster
   scheduling. EuroSys 2010. Also TR EECS-2009-55
MapReduce Evolution [4/5]
        Delay Scheduling
 • Data locality issues:
    •     Head-of-line scheduling: FIFO, FAIR
    •     Low probability for small jobs to achieve data locality
    •     58% of jobs @ FACEBOOK have < 25 maps
    •     Only 5% achieve node locality vs. 59% rack locality




        2011-2012                                                   57

Zaharia et al.: Delay scheduling: a simple technique for
   achieving locality and fairness in cluster
   scheduling. EuroSys 2010. Also TR EECS-2009-55
MapReduce Evolution [5/5]
      Delay Scheduling
 • Solution:
     •   Skip the head-of-line job until you find a job that can launch a local task
     •   Wait T1 seconds before launching tasks on the same rack
     •   Wait T2 seconds before launching tasks off-rack
     •   T1 = T2 = 15 seconds => 80% data locality




      2011-2012                                                             58

Zaharia et al.: Delay scheduling: a simple technique for
   achieving locality and fairness in cluster
   scheduling. EuroSys 2010. Also TR EECS-2009-55
KOALA and MapReduce [1/9]
• KOALA                                 • MR-Runner
  •     Placement & allocation            •   Configuration & deployment
  •     Central for all MR clusters       •   MR cluster monitoring
  •     Maintains MR cluster metadata     •   Grow/Shrink mechanism

• On-demand MR clusters
  •     Performance isolation
  •     Data isolation
  •     Failure isolation
  •     Version isolation




      2011-2012                                                            59
Koala and MapReduce [2/9]
       Resizing Mechanism
• Two types of nodes:
   •    Core nodes: fully-functional nodes, with TaskTracker and DataNode (local disk access)
   •    Transient nodes: compute nodes, with TaskTracker



• Parameters
   •    F = Number of running tasks per number of available slots
   •    Predefined Fmin and Fmax thresholds
   •    Predefined constant step growStep / shrinkStep
   •    T = time elapsed between two successive resource offers



• Three policies:
   •    Grow-Shrink Policy (GSP): grow-shrink but maintain F between Fmin and Fmax
   •    Greedy-Grow Policy (GGP): grow, shrink when workload done
   •    Greedy-Grow-with-Data Policy (GGDP): GGP with core nodes (local disk access)
   2011-2012                                                                          60
Koala and MapReduce [3/9]
      Workloads




•     98% of jobs @ Facebook process 6.9 MB and take less than a minute [1]

•     Google reported in 2004 computations with TB of data on 1000s of machines [2]

[1]   Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz, ―Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis,‖
      pp. 43–56, 2012.
[2]   J. Dean and S. Ghemawat, ―Mapreduce: Simplified Data Processing on Large Clusters,‖ Comm. of the ACM, Vol. 51, no. 1, pp. 107–113, 2008.




      2011-2012                                                                                                                            61
Koala and MapReduce [4/9]
    Wordcount (Type 0, full system)

                 CPU                          Disk




•   100 GB input data
•   10 core nodes with 8 map slots each
•   800 map tasks executed in 7 waves
•   Wordcount is CPU-bound in the map phase

     2011-2012                                       62
Koala and MapReduce [5/9]
      Sort (Type 0, full system)
                   CPU                                          Disk




•   Performed by the MR framework during the shuffling phase
•   Intermediate key/value pairs are processed in increasing key order
•   Short map phase with 40%-60% CPU utilization
•   Long reduce phase which is highly disk intensive

      2011-2012                                                          63
Koala and MapReduce [6/9]
  Speedup (full system)

                  Optimum




              a) Wordcount   (Type 1)                    b) Sort (Type 2)

• Speedup relative to an MR cluster with 10 core nodes
• Close to linear speedup on core nodes


  2011-2012                                                                 64
Koala and MapReduce [7/9]
    Execution Time vs. Transient Nodes




•    Input data of 40 GB
•    Wordcount output data = 20 KB
•    Sort output data = 40 GB
•    Wordcount scales better than Sort on transient nodes
    2011-2012                                               65
Koala and MapReduce [8/9]
Performance of the Resizing Mechanism

                                                             Fmin=0.25
                                                             Fmax=1.25
                                                             growStep=5
                                                             shrinkStep=2
                                                             TGSP =30 s
                                                             TGG(D)P =120 s




•   Stream of 50 MR jobs
•   MR cluster of 20 core nodes + 20 transient nodes
•   GGP increases the size of the data transferred across the network
•   GSP grows/shrinks based on the resource utilization of the cluster
•   GGDP enables local writes on the disks of provisioned nodes
    2011-2012                                                             66
Koala and MapReduce [9/9]
   Use Case @ TUDelft
• Goal: Process 10s TB of P2P traces

• DAS-4 constraints:
   •   Time limit per job (15 min)
   •   Non-persistent storage



• Solution:
   •   Reserve the nodes for several days, import and process the data?


• Our approach:
   •   Split the data into multiple subsets
   •   Smaller data sets => faster import and processing
   •   Setup multiple MR clusters, one for each subset
   2011-2012                                                              67
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive Workloads
4. Programming Models for Big Data
   1. MapReduce
   2. Graph Processing
   3. Other Big Data Programming Models
5. Summary


2012-2013                                         68
Graph Processing Example
        Single-Source Shortest Path (SSSP)


                                                   • Dijkstra‘s algorithm
                                                        • Select node with minimal distance
                                                        • Update neighbors
                                                        • O(|E| + |V|·log|V|) with FiboHeap

                                                   • Initial dataset:
                                                       A: <0, (B, 5), (D, 3)>
                                                       B: <inf, (E, 1)>
                                                       C: <inf, (F, 5)>
                                                       D: <inf, (B, 1), (C, 3), (E, 4), (F, 4)>
                                                       ...
        2012-2013                                                                          69



Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
Graph Processing Example
        SSSP in MapReduce
                          Q: What is the performance problem?
                                                   • Mapper output: distances
                                                       <B, 5>, <D, 3>, <C, inf>, ...
                                                   • But also graph structure
                                                       <A, <0, (B, 5), (D, 3)> ...


                                                   • Reducer input: distances
                                                       B: <inf, 5>, D: <inf, 3> ...
                                                   • But also graph structure
    N jobs, where N is the                             B: <inf, (E, 1)> ...
        graph diameter
        2012-2013                                                                      70



Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
The Pregel
        Programming Model for Graph Processing



                •   Batch-oriented processing
                •   Runs in-memory
                •   Vertex-centric API
                •   Fault-tolerant
                •   Runs on Master-Slave architecture

                • OK, the actual model follows in the next slides
        2012-2013                                                         71



Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
The Pregel
          Programming Model for Graph Processing
                                                                            Processors
                                                        Local Computation
Based on Valiant‘s
  Bulk Synchronous Parallel (BSP)
      •     N processing units with fast local memory
      •     Shared communication medium
      •     Series of Supersteps
      •     Global Synchronization Barrier Communication
      •     Ends when all voteToHalt

• Pregel executes initialization, one
  or several supersteps, shutdown                       Barrier
                                                        Synchronization
          2012-2013                                                            72



Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
Pregel
        The Superstep

        • Each Vertex (execution in parallel)
               •    Receive messages from other vertices
               •    Perform own computation (user-defined function)
               •    Modify own state or state of outgoing messages
               •    Mutate topology of the graph
               •    Send messages to other vertices
        • Termination condition
               • All vertices inactive
               • All messages have been transmitted


        2012-2013                                                     73



Source: Pregel article.
Pregel: The Vertex-Based API

                                       Input
                                       message
        Implements
        processing
        algorithm


                                         Output
                                         message



        2012-2013                            74



Source: Pregel article.
The Pregel Architecture
        Master-Worker
  • Master assigns vertices to Workers
        • Graph partitioning                      Master
  •    Master coordinates Supersteps
  •    Master coordinates Checkpoints
                                             Worker       Worker
  •    Workers execute vertices compute()
                                               1            n
  •    Workers exchange messages directly
                                                      5
                               1   2


                                         4
                                   3              7
                                                             k

        2012-2013                                           75



Source: Pregel article.
Apache Giraph
        An Open-Source Implementation of Pregel
            Tasktracker            Tasktracker             Tasktracker        Tasktracker
             Map      Map           Map       Map            Map      Map      Map    Map
             Slot     Slot          Slot      Slot           Slot     Slot     Slot   Slot


            Persistent
                                   Zookeeper                                                 Master
     computation state                                                    NN & JT
        •    Loose implementation of Pregel
        •    Strong community (Facebook, Twitter, LinkedIn)
        •    Runs 100% on existing Hadoop clusters
        •    Single Map-only job
        2012-2013                                                                             78



Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
http://incubator.apache.org/giraph/
Page rank benchmarks
    • Tiberium Tan
         • Almost 4000 nodes, shared among numerous groups in Yahoo!
         • Hadoop 0.20.204 (secure Hadoop)
         • 2x Quad Core 2.4GHz, 24 GB RAM, 1x 6TB HD


    • org.apache.giraph.benchmark.PageRankBenchmark
         • Generates data, number of edges, number of vertices, # of
           supersteps
         • 1 master/ZooKeeper
         • 20 supersteps
         • No checkpoints
         • 1 random edge per vertex


                                                                       79



Source: Avery Ching presentation at HortonWorks.
http://www.slideshare.net/averyching/20111014hortonworks/
Worker scalability (250M vertices)
                      5000
                      4500
                      4000
                      3500
      Total seconds




                      3000
                      2500
                      2000
                      1500
                      1000
                       500
                        0
                             0   50    100        150       200   250   300   350
                                                 # of workers
                                                                              80



Source: Avery Ching presentation at HortonWorks.
http://www.slideshare.net/averyching/20111014hortonworks/
Vertex scalability (300 workers)
                      2500


                      2000
      Total seconds




                      1500


                      1000


                       500


                        0
                             0   100        200         300      400          500   600
                                       # of vertices (in 100's of millions)
                                                                                    81



Source: Avery Ching presentation at HortonWorks.
http://www.slideshare.net/averyching/20111014hortonworks/
Vertex/worker scalability
                      2500                                                      600




                                                                                       # of vertices (100's of millions)
                                                                                500
                      2000
      Total seconds




                                                                                400
                      1500
                                                                                300
                      1000
                                                                                200

                       500
                                                                                100

                        0                                                       0
                             0   50   100     150       200   250   300   350
                                             # of workers
                                                                                      82



Source: Avery Ching presentation at HortonWorks.
http://www.slideshare.net/averyching/20111014hortonworks/
Agenda


1. Introduction
2. Cloud Programming in Practice (The Problem)
3. Programming Models for Compute-Intensive Workloads
4. Programming Models for Big Data
   1. MapReduce
   2. Graph Processing
   3. Other Big Data Programming Models
5. Summary


2012-2013                                         83
Stratosphere

• Meteor query language, Supremo operator framework

• Programming Contracts (PACTs) programming model
       • Extended set of 2nd order functions (vs MapReduce)
       • Declarative definition of data parallelism

• Nephele execution engine
       • Schedules multiple dataflows simultaneously
       • Supports IaaS environments based on Amazon EC2, Eucalyptus

• HDFS storage engine

2012-2013                                                       84
Stratosphere
      Programming Contracts (PACTs) [1/2]
                                          2nd-order function




   Also in MapReduce
      2012-2013                                                85



Source: PACT overview,
https://stratosphere.eu/wiki/doku.php/wiki:pactpm
Stratosphere
      Programming Contracts (PACTs) [2/2]




                                                    Q: How can PACTs
                                                      optimize data
                                                       processing?




      2012-2013                                                        86



Source: PACT overview,
https://stratosphere.eu/wiki/doku.php/wiki:pactpm
Stratosphere
Nephele




2012-2013      87
Stratosphere vs MapReduce
      • PACT extends MapReduce
            •    Both propose 2nd-order functions (5 PACTs vs Map & Reduce)
            •    Both require from user 1st-order functions (what‘s inside the Map)
            •    Both can benefit from higher-level languages
            •    PACT ecosystem has IaaS support


      • Key-value
        data model




     2012-2013                                                                  88



Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get
Together, Feb 2012.
Stratosphere vs MapReduce 3. Join edges to triads
      Triangle Enumeration [1/3]
      • Input:                             2. Sort by
        undirected graph                   smaller vertex ID
        edge-based
      • Output:
        triangles
                                                                     4. Triads to triangles
       1. Read input graph




     2012-2013                                                                       90



Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get
Together, Feb 2012 and Stratosphere example.
Stratosphere vs MapReduce
      Triangle Enumeration [2/3]

 MapReduce
 Formulation




     2012-2013                                                                  91



Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get
Together, Feb 2012 and Stratosphere example.
Stratosphere vs MapReduce
      Triangle Enumeration [3/3]

 Stratosphere
 Formulation




     2012-2013                                                                  92



Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get
Together, Feb 2012 and Stratosphere example.
Agenda


1.     Introduction
2.     Cloud Programming in Practice (The Problem)
3.     Programming Models for Compute-Intensive Workloads
4.     Programming Models for Big Data
5.     Summary




2012-2013                                             93
Conclusion Take-Home Message
• Programming model = computer system abstraction

• Programming Models for Compute-Intensive Workloads
  • Many trade-offs, few dominant programming models
  • Models: bags of tasks, workflows, master/worker, BSP, …
• Programming Models for Big Data
  •     Big data programming models have ecosystems
  •     Many trade-offs, many programming models
  •     Models: MapReduce, Pregel, PACT, Dryad, …
  •     Execution engines: Hadoop, Koala+MR, Giraph, PACT/Nephele, Dryad, …

• Reality check: cloud programming is maturing
      February 20, 2013                      http://www.flickr.com/photos/dimitrisotiropoulos/4204766418/
                                                                                              94
Reading Material (Really Active Field)
•   Workloads
     •      Alexandru Iosup, Dick H. J. Epema: Grid Computing Workloads. IEEE Internet Computing 15(2): 19-26 (2011)

•   The Fourth Paradigm
     •   ―The Fourth Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/

•   Programming Models for Compute-Intensive Workloads
     •      Alexandru Iosup, Mathieu Jan, Omer Ozan Sonmez, Dick H. J. Epema: The Characteristics and Performance of Groups of Jobs in Grids.
            Euro-Par 2007: 382-393
     •      Ostermann et al., On the Characteristics of Grid Workflows, CoreGRID Integrated Research in Grid Computing (CGIW), 2008.
            http://www.pds.ewi.tudelft.nl/~iosup/wftraces07chars_camera.pdf
     •      Corina Stratan, Alexandru Iosup, Dick H. J. Epema: A performance study of grid workflow engines. GRID 2008: 25-32


•   Programming Models for Big Data
     •      Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150
     •      Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. Commun. ACM 53(1): 72-77 (2010)
     •      Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in
            heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI'08).
            USENIX Association, Berkeley, CA, USA, 29-42.
     •      Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica: Delay scheduling: a simple technique
            for achieving locality and fairness in cluster scheduling. EuroSys 2010: 265-278
     •      Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, Russell Sears: MapReduce Online. NSDI 2010: 313-328
     •      Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski: Pregel: a system
            for large-scale graph processing. SIGMOD Conference 2010: 135-146
     •      Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke: Nephele/PACTs: a programming model and
            execution framework for web-scale analytical processing. SoCC 2010: 119-130

         2012-2013                                                                                                                 95

More Related Content

What's hot

Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
A Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDBA Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDBMongoDB
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M usersJongyoon Choi
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing Luay AL-Assadi
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 

What's hot (20)

Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
04 data mining : data generelization
04 data mining : data generelization04 data mining : data generelization
04 data mining : data generelization
 
How medium uses Neo4j
How medium uses Neo4jHow medium uses Neo4j
How medium uses Neo4j
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
A Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDBA Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDB
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
Hadoop
Hadoop Hadoop
Hadoop
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High Availability
 

Similar to Cloud Programming Models: eScience, Big Data, etc.

Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADaniel Vila Suero
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 
Azure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataAzure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataMicrosoft Technet France
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Processing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud PlatformProcessing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud Platformterradue
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 

Similar to Cloud Programming Models: eScience, Big Data, etc. (20)

Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbA
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Big Data
Big Data Big Data
Big Data
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Azure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataAzure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big data
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Processing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud PlatformProcessing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud Platform
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Cloud Programming Models: eScience, Big Data, etc.

  • 1. IN4392 Cloud Computing Cloud Programming Models Cloud Computing (IN4392) D.H.J. Epema and A. Iosup. With contributions from Claudio Martella and Bogdan Ghit. 2012-2013 1 Parallel and Distributed Group/TU Delft
  • 2. Terms for Today’s Discussion Programming model = language + libraries + runtime system that create a model of computation (an abstract machine) = ―an abstraction of a computer system‖ Wikipedia Examples: message-passing vs shared memory, data- vs task-parallelism, … Abstraction level Q: What is the best abstraction level? = distance from physical machine Examples: Assembly low-level vs Java is high level Many design trade-offs: performance, ease-of-use, common-task optimization, programming paradigm, … 2012-2013 2
  • 3. Characteristics of a Cloud Programming Model 1. Cost model (Efficiency) = cost/performance, overheads, … 2. Scalability 3. Fault-tolerance 4. Support for specific services 5. Control model, e.g., fine-grained many-task scheduling 6. Data model, including partitioning and placement, out-of- memory data access, etc. 7. Synchronization model 2012-2013 3
  • 4. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 5. Summary 2012-2013 4
  • 5. Today’s Challenges • eScience • The Fourth Paradigm • The Data Deluge and Big Data • Possibly others 2012-2013 5
  • 6. eScience: The Why • Science experiments already cost 25—50% budget • … and perhaps incur 75% of the delays • Millions of lines of code with similar functionality • Little code reuse across projects and application domains • … but last two decades‘ science is very similar in structure • Most results difficult to share and reuse • Case-in-point: Sloan Digital Sky Survey digital map of 25% of the sky x spectra 40TB+ sky survey data 200M+ astro-objects (images) 1M+ objects with spectrum (spectra) How to make it work for this and the next generation of scientists? 2012-2013 6 Source: Jim Gray and Alex Szalay,―eScience -- A Transformed Scientific Method‖, http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt
  • 7. eScience (John Taylor, UK Sci.Tech., 1999) • A new scientific method • Combine science with IT • Full scientific process: control scientific instrument or produce data from simulations, gather and reduce data, analyze and model results, visualize results • Mostly compute-intensive, e.g., simulation of complex phenomena • IT support • Infrastructure: LHC Grid, Open Science Grid, DAS, NorduGrid, … • From programming models to infrastructure management tools • Examples Q: Why is CompSci an example here? • * physics, Bioinformatics, Material science, Engineering, CompSci 2012-2013 7
  • 8. The Fourth Paradigm: The Why (An Anecdotal Example) The Overwhelming Growth of Knowledge ―When 12 men founded the Number of 1993 1997 Royal Society in 1660, it was Publications 1997 2001 possible for an educated person to encompass all of scientific knowledge. […] In the last 50 years, such has been the pace of scientific advance that even the best scientists cannot keep up with discoveries at frontiers outside their own field.‖ Tony Blair, PM Speech, May 2002 Data: King,The scientific impact of nations,Nature‘04.
  • 9. The Fourth Paradigm: The What From Hypothesis to Data • Thousand years ago: science was empirical describing natural phenomena 2 • Last few hundred years: . a 4 G c2 theoretical branch using models, generalizations a 3 a2 • Last few decades: a computational branch simulating complex phenomena • Today (the Fourth Paradigm): Q1: What is the Fourth data exploration Paradigm? unify theory, experiment, and simulation • Data captured by instruments or generated by simulator • Processed by software Q2: What are the dangers • Information/Knowledge stored the Fourth Paradigm? of in computer • Scientist analyzes results using data management and statistics 2012-2013 9 Source: Jim Gray and ―The Fourth Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/
  • 10. The “Data Deluge”: The Why "Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is difficult enough. Analysing it, to spot patterns and extract useful information, is harder still.― The Data Deluge, The Economist, 25 February 2010 10
  • 11. “Data Deluge”: The Personal Memex Example • Vannevar Bush in the 1940s: record your life • MIT Media Laboratory: The Human Speechome Project/TotalRecall, data mining/analysis/visio • Deb Roy and Rupal Patel ―record practically every waking moment of their son‘s first three years‖ (20% privacy time…Is this even legal?! Should it be?!) • 11x1MP/14fps cameras, 14x16b-48KHz mics, 4.4TB RAID + tapes, 10 computers; 200k hours audio-video • Data size: 200GB/day, 1.5PB total 11
  • 12. “Data Deluge”: The Gaming Analytics Example • EQ II: 20TB/year all logs • Halo3: 1.4PB served statistics on player logs 12
  • 13. “Data Deluge”: Datasets in Comp.Sci. Dataset Size http://gwa.ewi.tudelft.nl 1TB/yr The Failure Trace 1TB GamTA Archive 100GB P2PTA http://fta.inria.fr 10GB Peer-to-Peer Trace Archive 1GB … PWA, ITA, CRAWDAD, … ‗06 ‗09 ‗10 ‗11 Year • 1,000s of scientists: From theory to practice 13 13
  • 14. “Data Deluge”: The Professional World Gets Connected Feb 2012 2012-2013 14 Source: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/
  • 15. What is “Big Data”? • Very large, distributed aggregations of loosely structured data, often incomplete and inaccessible • Easily exceeds the processing capacity of conventional database systems • Principle of Big Data: ―When you can, keep everything!‖ • Too big, too fast, and doesn‘t comply with the traditional database architectures 2011-2012 15
  • 16. The Three “V”s of Big Data • Volume • More data vs. better models • Data grows exponentially • Analysis in near-real time to extract value • Scalable storage and distributed queries • Velocity • Speed of the feedback loop • Gain competitive advantage: fast recommendations • Identify fraud, predict customer churn faster • Variety • The data can become messy: text, video, audio, etc. • Difficult to integrate into applications 2011-2012 16 Adapted from: Doug Laney, ―3D data management‖, META Group/Gartner report, Feb 2001. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data- Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
  • 17. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 1. Bags of Tasks 2. Workflows 3. Parallel Programming Models 4. Programming Models for Big Data 5. Summary 2012-2013 18
  • 18. What is a Bag of Tasks (BoT)? A System View BoT = set of jobs sent by a user… …that start at most Δs after the first job Q: What is the user‘s view? Time [units] • Why Bag of Tasks? From the perspective of the user, jobs in set are just tasks of a larger job • A single useful result from the complete BoT • Result can be combination of all tasks, or a selection of the results of most or even a single task 2012-2013 19 Iosup et al., The Characteristics and Performance of Groups of Jobs in Grids, Euro-Par, LNCS, vol.4641, pp. 382-393, 2007.
  • 19. Applications of the BoT Programming Model • Parameter sweeps • Comprehensive, possibly exhaustive investigation of a model • Very useful in engineering and simulation-based science • Monte Carlo simulations • Simulation with random elements: fixed time yet limited inaccuracy • Very useful in engineering and simulation-based science • Many other types of batch processing • Periodic computation, Cycle scavenging • Very useful to automate operations and reduce waste 2012-2013 20
  • 20. BoTs Became the Dominant Programming Model for Grid Computing (US) TeraGrid-2 NCSA (US) Condor U.Wisc. (EU) EGEE (CA) SHARCNET (US) Grid3 (US) GLOW (UK) RAL (NO,SE) NorduGrid (FR) Grid'5000 (NL) DAS-2 From jobs [%] 0 20 40 60 80 100 (US) TeraGrid-2 NCSA (US) Condor U.Wisc. (EU) EGEE (CA) SHARCNET (US) Grid3 (US) GLOW (UK) RAL (NO,SE) NorduGrid (FR) Grid'5000 (NL) DAS-2 From CPUTime [%] 0 20 40 60 80 100 2012-2013 21 Iosup and Epema: Grid Computing Workloads. IEEE Internet Computing 15(2): 19-26 (2011)
  • 21. Practical Applications of the BoT Programming Model Parameter Sweeps in Condor [1/4] • Sue the scientist wants to ―Find the value of F(x,y,z) for 10 values for x and y, and 6 values for z‖ • Solution: Run a parameter sweep, with 10 x 10 x 6 = 600 parameter values • Problem of the solution: • Sue runs one job (a combination of x, y, and z) on her low-end machine. It takes 6 hours. • That‘s 150 days uninterrupted computation on Sue‘s machine! 2012-2013 22 Source: Condor Team, Condor User‘s Tutorial. http://cs.uwisc.edu/condor
  • 22. Practical Applications of the BoT Programming Model Parameter Sweeps in Condor [2/4] Universe = vanilla Executable = sim.exe Input = input.txt Output = output.txt Error = error.txt Log = sim.log Requirements = OpSys == “WINNT61” && Complex SLAs can Arch == “INTEL” && be specified easily (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) InitialDir = run_$(Process) Also passed as Queue 600 parameter to sim.exe 2012-2013 23 Source: Condor Team, Condor User‘s Tutorial. http://cs.uwisc.edu/condor
  • 23. Practical Applications of the BoT Programming Model Parameter Sweeps in Condor [3/4] % condor_submit sim.submit Submitting job(s) ................................................ ................................................ ................................................ ................................................ ................................................ ............... Logging submit event(s) ................................................ ................................................ ................................................ ................................................ ................................................ ............... 600 job(s) submitted to cluster 3. 2012-2013 24 Source: Condor Team, Condor User‘s Tutorial. http://cs.uwisc.edu/condor
  • 24. Practical Applications of the BoT Programming Model Parameter Sweeps in Condor [4/4] % condor_q -- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 sim.exe 3.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 sim.exe 3.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 sim.exe 3.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe ... 3.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe 3.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe 600 jobs; 599 idle, 1 running, 0 held 2012-2013 25 Source: Condor Team, Condor User‘s Tutorial. http://cs.uwisc.edu/condor
  • 25. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 1. Bags of Tasks 2. Workflows 3. Parallel Programming Models 4. Programming Models for Big Data 5. Summary 2012-2013 26
  • 26. What is a Wokflow? WF = set of jobs with precedence (think Direct Acyclic Graph) 2012-2013 27
  • 27. Applications of the Workflow Programming Model • Complex applications • Complex filtering of data • Complex analysis of instrument measurements • Applications created by non-CS scientists* • Workflows have a natural correspondence in the real-world, as descriptions of a scientific procedure • Visual model of a graph sometimes easier to program • Precursor of the MapReduce Programming Model (next slides) 2012-2013 28 *Adapted from: Carole Goble and David de Roure, Chapter in ―The Fourth Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/
  • 28. Workflows Existed in Grids, but Did Not Become a Dominant Programming Model • Traces • Selected Findings • Loose coupling • Graph with 3-4 levels • Average WF size is 30/44 jobs • 75%+ WFs are sized 40 jobs or less, 95% are sized 200 jobs or less 2012-2013 29 Ostermann et al., On the Characteristics of Grid Workflows, CoreGRID Integrated Research in Grid Computing (CGIW), 2008.
  • 29. Practical Applications of the WF Programming Model Bioinformatics in Taverna 2012-2013 30 Source: Carole Goble and David de Roure, Chapter in ―The Fourth Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/
  • 30. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 1. Bags of Tasks 2. Workflows 3. Parallel Programming Models 4. Programming Models for Big Data 5. Summary 2012-2013 31
  • 31. Parallel Programming Models TUD course in4049 Introduction to HPC • Abstract machines Task (groups of 5, 5 minutes): • (Distributed) shared memory discuss parallel programming in clouds • Distributed memory: MPI Task (inter-group discussion): discuss. • Conceptual programming models • Master/worker • Divide and conquer • Data / Task parallelism • BSP • System-level programming models • Threads on GPUs and other multi-cores 2012-2013 32 Varbanescu et al.: Towards an Effective Unified Programming Model for Many-Cores. IPDPS WS 2012
  • 32. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 5. Summary 2012-2013 33
  • 33. Ecosystems of Big-Data Programming Models Q: Where does MR-on-demand fit? Q: Where does Pregel-on-GPUs fit? High-Level Language Flume BigQuery SQL Meteor JAQL Hive Pig Sawzall Scope DryadLINQ AQL Programming Model PACT MapReduce Model Pregel Dataflow Algebrix Execution Engine Flume Dremel Tera Azure Nephele Haloop Hadoop/ Giraph MPI/ Dryad Hyracks Engine Service Data Engine YARN Erlang Tree Engine Storage Engine S3 GFS Tera Azure HDFS Voldemort L CosmosFS Asterix Data Data F B-tree Store Store * Plus Zookeeper, CDN, etc. S 2012-2013 34 Adapted from: Dagstuhl Seminar on Information Management in the Cloud, http://www.dagstuhl.de/program/calendar/partlist/?semnr=11321&SUOG
  • 34. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 1. MapReduce 2. Graph Processing 3. Other Big Data Programming Models 5. Summary 2012-2013 35
  • 35. MapReduce • Model for processing and generating large data sets • Enables a functional-like programming model • Splits computations into independent parallel tasks • Makes efficient use of large commodity clusters • Hides the details of parallelization, fault-tolerance, data distribution, monitoring and load balancing 2011-2012 36
  • 36. MapReduce: The Programming Model A programmig model, not a programming language! 1. Input/Output: • Set of key/value pairs 2. Map Phase: • Processes input key/value pair • Produces set of intermediate pairs map (in_key, in_value) -> list(out_key, interm_value) 3. Reduce Phase: • Combines all intermediate values for a given key • Produces a set of merged output values reduce(out_key, list(interm_value)) -> list(out_value) 2011-2012 37
  • 37. MapReduce in the Cloud • Facebook use case: • Social-networking services • Analyze connections in the graph of friendships to recommend new connections • Google use case: • Web-base email services, GoogleDocs • Analyze messages and user behavior to optimize ad selection and placement • Youtube use case: • Video-sharing sites • Analyze user preferences to give better stream suggestions 2011-2012 38
  • 38. Wordcount Example • File 1: ―The big data is big.‖ • File 2: ―MapReduce tames big data.‖ • Map Output: • Mapper-1: (The, 1), (big, 1), (data, 1), (is, 1), (big, 1) • Mapper-2: (MapReduce, 1), (tames, 1), (big, 1), (data, 1) • Reduce Input • Reduce Output • Reducer-1: (The, 1) • Reducer-1: (The, 1) • Reducer-2: (big, 1), (big, 1), (big, 1) • Reducer-2: (big, 3) • Reducer-3: (data, 1), (data, 1) • Reducer-3: (data, 2) • Reducer-4: (is, 1) • Reducer-4: (is, 1) • Reducer-5: (MapReduce, 1), (MapReduce, 1) • Reducer-5: (MapReduce, 2) • Reducer-6: (tames, 1) • Reducer-6: (tames, 1) 2011-2012 39
  • 40. MapReduce @ Google: Example 1 • Generating language model statistics • Count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) • MapReduce solution: • Map: extract 5-word sequences => count from document • Reduce: combine counts, and keep if count large enough 2011-2012 41 • http://www.slideshare.net/jhammerb/mapreduce-pact06-keynote
  • 41. MapReduce @ Google: Example 2 • Joining with other data • Generate per-doc summary, but include per-host summary (e.g., # of pages on host, important terms on host) • MapReduce solution: • Map: extract host name from URL, lookup per-host info, combine with per-doc data and emit • Reduce: identity function (emit key/value directly) 2011-2012 42 • http://www.slideshare.net/jhammerb/mapreduce-pact06-keynote
  • 42. More Examples of MapReduce Applications • Distributed Grep • Count of URL Access Frequency • Reverse Web-Link Graph • Term-Vector per Host • Distributed Sort • Inverted indices • Page ranking • Machine learning algorithms • … 2011-2012 43
  • 43. Execution Overview (1/2) Q: What is the performance problem raised by this step? 2011-2012 44
  • 44. Execution Overview (2/2) • One master, many workers example • Input data split into M map tasks (64 / 128 MB): 200,000 maps • Reduce phase partitioned into R reduce tasks: 4,000 reduces • Dynamically assign tasks to workers: 2,000 workers • Master assigns each map / reduce task to an idle worker • Map worker: • Data locality awareness • Read input and generate R local files with key/value pairs • Reduce worker: • Read intermediate output from mappers • Sort and reduce to produce the output 2011-2012 45
  • 45. Failures and Back-up Tasks 2011-2012 46
  • 46. What is Hadoop? A MapReduce Exec. Engine • Inspired by Google, supported by Yahoo! • Designed to perform fast and reliable analysis of the big data • Large expansion in many domains such as: • Finance, technology, telecom, media, entertainment, government, research institutions 2011-2012 47
  • 47. Hadoop @ Yahoo • When you visit yahoo, you are interacting with data processed with Hadoop! 2011-2012 48 • https://opencirrus.org/system/files/OpenCirrusHadoop2009.ppt
  • 48. Hadoop Distributed File System • Assumptions and goals: • Data distributed across hundreds or thousands of machines • Detection of faults and automatic recovery • Designed for batch processing vs. interactive use • High throughput of data access vs. low latency of data access • High aggregate bandwidth and high scalability • Write-once-read-many file access model • Moving computations is cheaper than moving data: minimize network congestion and increase throughput of the system 2011-2012 50
  • 49. HDFS Architecture • Master/slave architecture • NameNode (NN) • Manages the file system namespace • Regulates access to files by clients • Open/close/rename files or directories • Mapping of blocks to DataNodes • DataNode (DN) • One per node in the cluster • Manages local storage of the node • Block creation/deletion/replication initiated by NN • Serve read/write requests from clients 2011-2012 51
  • 50. HDFS Internals • Replica Placement Policy • First replica on one node in the local rack • Second replica on a different node in the local rack • Third replica on a different node in a different rack improved write performance (2/3 are on the local rack) preserves data reliability and read performance • Communication Protocols • Layered on top of TCP/IP • Client Protocol: client – NameNode machine • DataNode Protocol: DataNodes – NameNode • NameNode responds to RPC requests issued by DataNodes / clients 2011-2012 52
  • 51. Hadoop Scheduler • Job divided into several independent tasks executed in parallel • The input file is split into chunks of 64 / 128 MB • Each chunk is assigned to a map task • Reduce task aggregate the output of the map tasks • The master assigns tasks to the workers in FIFO order • JobTracker: maintains a queue of running jobs, the states of the TaskTrackers, the tasks assignments • TaskTrackers: report their state via a heartbeat mechanism • Data Locality: execute tasks close to their data • Speculative execution: re-launch slow tasks 2011-2012 53
  • 52. MapReduce Evolution [1/5] Hadoop is Maturing: Important Contributors Number of patches 2012-2013 54 Source: http://www.theregister.co.uk/2012/08/17/community_hadoop/
  • 53. MapReduce Evolution [2/5] Late Scheduler • LATE Scheduler • Longest Approximate Time to End • Speculatively execute the task that will finish farthest into the future • time_left = (1-progressScore) / progressRate • Tasks make progress at a roughly constant rate • Robust to node heterogeneity • EC2 Sort running times • LATE vs. Hadoop vs. No spec. 2011-2012 55 Zaharia et al.: Improving MapReduce performance in heterogeneous environments. OSDI 2008.
  • 54. MapReduce Evolution [3/5] FAIR Scheduling • Isolation and statistical multiplexing • Two-level architecture: • First, allocates task slots across pools • Second, each pool allocates its slots among multiple jobs • Based on a max-min fairness policy 2011-2012 56 Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys 2010. Also TR EECS-2009-55
  • 55. MapReduce Evolution [4/5] Delay Scheduling • Data locality issues: • Head-of-line scheduling: FIFO, FAIR • Low probability for small jobs to achieve data locality • 58% of jobs @ FACEBOOK have < 25 maps • Only 5% achieve node locality vs. 59% rack locality 2011-2012 57 Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys 2010. Also TR EECS-2009-55
  • 56. MapReduce Evolution [5/5] Delay Scheduling • Solution: • Skip the head-of-line job until you find a job that can launch a local task • Wait T1 seconds before launching tasks on the same rack • Wait T2 seconds before launching tasks off-rack • T1 = T2 = 15 seconds => 80% data locality 2011-2012 58 Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys 2010. Also TR EECS-2009-55
  • 57. KOALA and MapReduce [1/9] • KOALA • MR-Runner • Placement & allocation • Configuration & deployment • Central for all MR clusters • MR cluster monitoring • Maintains MR cluster metadata • Grow/Shrink mechanism • On-demand MR clusters • Performance isolation • Data isolation • Failure isolation • Version isolation 2011-2012 59
  • 58. Koala and MapReduce [2/9] Resizing Mechanism • Two types of nodes: • Core nodes: fully-functional nodes, with TaskTracker and DataNode (local disk access) • Transient nodes: compute nodes, with TaskTracker • Parameters • F = Number of running tasks per number of available slots • Predefined Fmin and Fmax thresholds • Predefined constant step growStep / shrinkStep • T = time elapsed between two successive resource offers • Three policies: • Grow-Shrink Policy (GSP): grow-shrink but maintain F between Fmin and Fmax • Greedy-Grow Policy (GGP): grow, shrink when workload done • Greedy-Grow-with-Data Policy (GGDP): GGP with core nodes (local disk access) 2011-2012 60
  • 59. Koala and MapReduce [3/9] Workloads • 98% of jobs @ Facebook process 6.9 MB and take less than a minute [1] • Google reported in 2004 computations with TB of data on 1000s of machines [2] [1] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz, ―Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis,‖ pp. 43–56, 2012. [2] J. Dean and S. Ghemawat, ―Mapreduce: Simplified Data Processing on Large Clusters,‖ Comm. of the ACM, Vol. 51, no. 1, pp. 107–113, 2008. 2011-2012 61
  • 60. Koala and MapReduce [4/9] Wordcount (Type 0, full system) CPU Disk • 100 GB input data • 10 core nodes with 8 map slots each • 800 map tasks executed in 7 waves • Wordcount is CPU-bound in the map phase 2011-2012 62
  • 61. Koala and MapReduce [5/9] Sort (Type 0, full system) CPU Disk • Performed by the MR framework during the shuffling phase • Intermediate key/value pairs are processed in increasing key order • Short map phase with 40%-60% CPU utilization • Long reduce phase which is highly disk intensive 2011-2012 63
  • 62. Koala and MapReduce [6/9] Speedup (full system) Optimum a) Wordcount (Type 1) b) Sort (Type 2) • Speedup relative to an MR cluster with 10 core nodes • Close to linear speedup on core nodes 2011-2012 64
  • 63. Koala and MapReduce [7/9] Execution Time vs. Transient Nodes • Input data of 40 GB • Wordcount output data = 20 KB • Sort output data = 40 GB • Wordcount scales better than Sort on transient nodes 2011-2012 65
  • 64. Koala and MapReduce [8/9] Performance of the Resizing Mechanism Fmin=0.25 Fmax=1.25 growStep=5 shrinkStep=2 TGSP =30 s TGG(D)P =120 s • Stream of 50 MR jobs • MR cluster of 20 core nodes + 20 transient nodes • GGP increases the size of the data transferred across the network • GSP grows/shrinks based on the resource utilization of the cluster • GGDP enables local writes on the disks of provisioned nodes 2011-2012 66
  • 65. Koala and MapReduce [9/9] Use Case @ TUDelft • Goal: Process 10s TB of P2P traces • DAS-4 constraints: • Time limit per job (15 min) • Non-persistent storage • Solution: • Reserve the nodes for several days, import and process the data? • Our approach: • Split the data into multiple subsets • Smaller data sets => faster import and processing • Setup multiple MR clusters, one for each subset 2011-2012 67
  • 66. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 1. MapReduce 2. Graph Processing 3. Other Big Data Programming Models 5. Summary 2012-2013 68
  • 67. Graph Processing Example Single-Source Shortest Path (SSSP) • Dijkstra‘s algorithm • Select node with minimal distance • Update neighbors • O(|E| + |V|·log|V|) with FiboHeap • Initial dataset: A: <0, (B, 5), (D, 3)> B: <inf, (E, 1)> C: <inf, (F, 5)> D: <inf, (B, 1), (C, 3), (E, 4), (F, 4)> ... 2012-2013 69 Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
  • 68. Graph Processing Example SSSP in MapReduce Q: What is the performance problem? • Mapper output: distances <B, 5>, <D, 3>, <C, inf>, ... • But also graph structure <A, <0, (B, 5), (D, 3)> ... • Reducer input: distances B: <inf, 5>, D: <inf, 3> ... • But also graph structure N jobs, where N is the B: <inf, (E, 1)> ... graph diameter 2012-2013 70 Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
  • 69. The Pregel Programming Model for Graph Processing • Batch-oriented processing • Runs in-memory • Vertex-centric API • Fault-tolerant • Runs on Master-Slave architecture • OK, the actual model follows in the next slides 2012-2013 71 Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
  • 70. The Pregel Programming Model for Graph Processing Processors Local Computation Based on Valiant‘s Bulk Synchronous Parallel (BSP) • N processing units with fast local memory • Shared communication medium • Series of Supersteps • Global Synchronization Barrier Communication • Ends when all voteToHalt • Pregel executes initialization, one or several supersteps, shutdown Barrier Synchronization 2012-2013 72 Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012.
  • 71. Pregel The Superstep • Each Vertex (execution in parallel) • Receive messages from other vertices • Perform own computation (user-defined function) • Modify own state or state of outgoing messages • Mutate topology of the graph • Send messages to other vertices • Termination condition • All vertices inactive • All messages have been transmitted 2012-2013 73 Source: Pregel article.
  • 72. Pregel: The Vertex-Based API Input message Implements processing algorithm Output message 2012-2013 74 Source: Pregel article.
  • 73. The Pregel Architecture Master-Worker • Master assigns vertices to Workers • Graph partitioning Master • Master coordinates Supersteps • Master coordinates Checkpoints Worker Worker • Workers execute vertices compute() 1 n • Workers exchange messages directly 5 1 2 4 3 7 k 2012-2013 75 Source: Pregel article.
  • 74. Apache Giraph An Open-Source Implementation of Pregel Tasktracker Tasktracker Tasktracker Tasktracker Map Map Map Map Map Map Map Map Slot Slot Slot Slot Slot Slot Slot Slot Persistent Zookeeper Master computation state NN & JT • Loose implementation of Pregel • Strong community (Facebook, Twitter, LinkedIn) • Runs 100% on existing Hadoop clusters • Single Map-only job 2012-2013 78 Source: Claudio Martella, Presentation on Giraph at TU Delft, Apr 2012. http://incubator.apache.org/giraph/
  • 75. Page rank benchmarks • Tiberium Tan • Almost 4000 nodes, shared among numerous groups in Yahoo! • Hadoop 0.20.204 (secure Hadoop) • 2x Quad Core 2.4GHz, 24 GB RAM, 1x 6TB HD • org.apache.giraph.benchmark.PageRankBenchmark • Generates data, number of edges, number of vertices, # of supersteps • 1 master/ZooKeeper • 20 supersteps • No checkpoints • 1 random edge per vertex 79 Source: Avery Ching presentation at HortonWorks. http://www.slideshare.net/averyching/20111014hortonworks/
  • 76. Worker scalability (250M vertices) 5000 4500 4000 3500 Total seconds 3000 2500 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 # of workers 80 Source: Avery Ching presentation at HortonWorks. http://www.slideshare.net/averyching/20111014hortonworks/
  • 77. Vertex scalability (300 workers) 2500 2000 Total seconds 1500 1000 500 0 0 100 200 300 400 500 600 # of vertices (in 100's of millions) 81 Source: Avery Ching presentation at HortonWorks. http://www.slideshare.net/averyching/20111014hortonworks/
  • 78. Vertex/worker scalability 2500 600 # of vertices (100's of millions) 500 2000 Total seconds 400 1500 300 1000 200 500 100 0 0 0 50 100 150 200 250 300 350 # of workers 82 Source: Avery Ching presentation at HortonWorks. http://www.slideshare.net/averyching/20111014hortonworks/
  • 79. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 1. MapReduce 2. Graph Processing 3. Other Big Data Programming Models 5. Summary 2012-2013 83
  • 80. Stratosphere • Meteor query language, Supremo operator framework • Programming Contracts (PACTs) programming model • Extended set of 2nd order functions (vs MapReduce) • Declarative definition of data parallelism • Nephele execution engine • Schedules multiple dataflows simultaneously • Supports IaaS environments based on Amazon EC2, Eucalyptus • HDFS storage engine 2012-2013 84
  • 81. Stratosphere Programming Contracts (PACTs) [1/2] 2nd-order function Also in MapReduce 2012-2013 85 Source: PACT overview, https://stratosphere.eu/wiki/doku.php/wiki:pactpm
  • 82. Stratosphere Programming Contracts (PACTs) [2/2] Q: How can PACTs optimize data processing? 2012-2013 86 Source: PACT overview, https://stratosphere.eu/wiki/doku.php/wiki:pactpm
  • 84. Stratosphere vs MapReduce • PACT extends MapReduce • Both propose 2nd-order functions (5 PACTs vs Map & Reduce) • Both require from user 1st-order functions (what‘s inside the Map) • Both can benefit from higher-level languages • PACT ecosystem has IaaS support • Key-value data model 2012-2013 88 Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get Together, Feb 2012.
  • 85. Stratosphere vs MapReduce 3. Join edges to triads Triangle Enumeration [1/3] • Input: 2. Sort by undirected graph smaller vertex ID edge-based • Output: triangles 4. Triads to triangles 1. Read input graph 2012-2013 90 Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get Together, Feb 2012 and Stratosphere example.
  • 86. Stratosphere vs MapReduce Triangle Enumeration [2/3] MapReduce Formulation 2012-2013 91 Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get Together, Feb 2012 and Stratosphere example.
  • 87. Stratosphere vs MapReduce Triangle Enumeration [3/3] Stratosphere Formulation 2012-2013 92 Source: Fabian Hueske, Large Scale Data Analysis Beyond MapReduce, Hadoop Get Together, Feb 2012 and Stratosphere example.
  • 88. Agenda 1. Introduction 2. Cloud Programming in Practice (The Problem) 3. Programming Models for Compute-Intensive Workloads 4. Programming Models for Big Data 5. Summary 2012-2013 93
  • 89. Conclusion Take-Home Message • Programming model = computer system abstraction • Programming Models for Compute-Intensive Workloads • Many trade-offs, few dominant programming models • Models: bags of tasks, workflows, master/worker, BSP, … • Programming Models for Big Data • Big data programming models have ecosystems • Many trade-offs, many programming models • Models: MapReduce, Pregel, PACT, Dryad, … • Execution engines: Hadoop, Koala+MR, Giraph, PACT/Nephele, Dryad, … • Reality check: cloud programming is maturing February 20, 2013 http://www.flickr.com/photos/dimitrisotiropoulos/4204766418/ 94
  • 90. Reading Material (Really Active Field) • Workloads • Alexandru Iosup, Dick H. J. Epema: Grid Computing Workloads. IEEE Internet Computing 15(2): 19-26 (2011) • The Fourth Paradigm • ―The Fourth Paradigm‖, http://research.microsoft.com/en-us/collaboration/fourthparadigm/ • Programming Models for Compute-Intensive Workloads • Alexandru Iosup, Mathieu Jan, Omer Ozan Sonmez, Dick H. J. Epema: The Characteristics and Performance of Groups of Jobs in Grids. Euro-Par 2007: 382-393 • Ostermann et al., On the Characteristics of Grid Workflows, CoreGRID Integrated Research in Grid Computing (CGIW), 2008. http://www.pds.ewi.tudelft.nl/~iosup/wftraces07chars_camera.pdf • Corina Stratan, Alexandru Iosup, Dick H. J. Epema: A performance study of grid workflow engines. GRID 2008: 25-32 • Programming Models for Big Data • Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 • Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. Commun. ACM 53(1): 72-77 (2010) • Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29-42. • Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys 2010: 265-278 • Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, Russell Sears: MapReduce Online. NSDI 2010: 313-328 • Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski: Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146 • Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: 119-130 2012-2013 95

Editor's Notes

  1. A1: Fourth Paradigm = from “How to test this hypothesis?” to “What does the data set show? What interesting correlation or data mining insight can I get from this complex (multi-)data collection?” Source: Dennis Gannon and Dan Reed, “Parallelism and the Cloud”, The Fourth Paradigm, p.131--136. http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf A2: Main danger: that we quit using theory altogether. Other dangers: biased data, privacy, reporting power equals truth power (non-falsifiable theories and strong-hand governments can be a real danger, see Communist regimes and their propaganda services).
  2. https://stratosphere.eu/wiki/doku.php/wiki:pactpmCROSS = Cartesian product of the records of two inputsMATCH = resembles an equality join where the keys of both inputs are the attributes to join onCOGROUP = can be seen as a Reduce on two inputs
  3. https://stratosphere.eu/wiki/doku.php/wiki:pactpmCROSS = Cartesian product of the records of two inputsMATCH = resembles an equality join where the keys of both inputs are the attributes to join onCOGROUP = can be seen as a Reduce on two inputsA: Optimize data flows (do not store everything in HDFS, like MapReduce; stream and compress data). Optimize 2nd-order function. Reduce code complexity by adding another level of indirection. Etc.