SlideShare a Scribd company logo
1 of 54
Building a HighLevel Dataflow System
    on top of MapReduce: The Pig
              Experience


                   Tilani Gunawardena
Content
•   Introduction
•   Background
•   System Overview
•   The System & Type Inference
•   Compilation To Map-Reduce
•   Plan Execution
•   Streaming
•   Performance
•   Adoption
•   Project Experience
•   Future Works
Introduction
•   Internet companies swimming in data
      • TBs/day for Yahoo! Or Google!
      • PBs/day for FaceBook!

•   Data
     – unstructured elements
         • web page text,images
     – structured elements
         • web page click records , extracted entity-relationship models

•   Procesing
     – Filter, join , count


•   Data Warehousing??
     – Scale -Often not scalable enough
     – Price-Prohibitively expensive at web scale
     – SQL-
          • High level declarative approach
          • Little control over execution method

•   The Map-Reduce Appeal ??
     – Scale -Scalable due to simpler design, Explicit programming model
     – Price-Runs on cheap commodity hardware
     – SQL
MapReduce Disadvantages
• Does not directly support complex N-step
  dataflow
• Lacks explicit support for combined processing of
  multiple data sets
  – joins and other data matching operations
• Frequently needed data manipulation primitives
  must be coded by hand
  – Filtering, aggregation ,Join,Projecton,Sorting
Pig
• Pig's language Pig Latin – Chooses spot
  between MapReduce framework and SQL.
• Defines a new language to allow better
  control in large scale data processing
• Allow database programmers not to write
  map and reduce code, which is at too low
  a level
Pig Latin: Data Types
• Rich and Simple Data Model

Simple Types:
int, long, double, chararray(string), bytearray

Complex Types:
• Map: is an associative array;key:chararray;value: any type
• Tuple: Collection of fields e.g. (áppe’, ‘mango’)
• Bag: Collection of tuples
{      (‘apple’ , ‘mango’)
    (ápple’, (‘red’ , ‘yellow’))
}
Pig Latin: Input/Output Data
Input:
queries = LOAD `data.txt'
USING BinStorage
AS (url, category, pagerank);
Output:
STORE result INTO `myoutput‘ ;

BinStorage: binary storage function in Pig
Pig Latin: Expression Table
Pig Latin: General Syntax
• Discarding Unwanted Data: FILTER
• Comparison operators such
  as ==, eq, !=, neq
• Logical connectors AND, OR, NOT
Pig Latin: Type Declaration
• Pig supports three options for declaring the data types of
  field
   – No data types are declared:default is to treat all fields as
     bytearray.
    Ex:a = LOAD `data' USING BinStorage AS (user);

   – Declaring types in Pig is to provide them explicitly as part of the
     AS clause during the LOAD:
   Ex :a =LOAD `data' USING BinStorage AS (user:chararray);

   – For the load function itself to provide the schema
     information, which accommodates self-describing data formats
     such as JSON
Pig Latin: Lazy Conversion of Types
• When Pig does need to cast a bytearray to another type because
  the program applies a type-specic operator, it delays that cast to
  the point where it is actually necessary.



• Status will need to be cast to a chararray
• EarnedPoints and possiblePoints will need to be cast to double
• These casts will not be done when the data is loaded
• They will be done as part of the comparison and division
  operations
• Avoids casting values that are removed by the filter before the
  result of the cast is used.
Pig Latin-Operators

•   LOAD : LOAD 'data' [USING function] [AS schema];
    where, „data‟ : Name of file or directory
           USING, AS : Keywords
            function : Load function.
            schema : Loader produces data of type specified by schema. If data does not
    conform to schema, error is generated.
          ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);
          LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);



•   STORE : Stores results to file system
     – STORE alias INTO 'directory' [USING function];
         where, alias : name of relation
                   INTO, USING : keywords
                   „directory‟ : storage directory‟s name. If directory already exists,
                                 operation fails
                   function: Store function.
         ex: STORE result INTO `myOutput';
         STORE query_revenues INTO `myoutput‘ USING myStore();
FOREACH
• Generates data transformations based on columns of data.
• Eg: X = FOREACH A GENERATE a1, a2;


expanded_queries = FOREACH queries GENERATE userId,
   expandQuery(queryString);
               -----------------
expanded_queries = FOREACH queries GENERATE userId,
   FLATTEN(expandQuery(queryString));
GROUP / COGROUP
• Groups the data in one or more relations.
• GROUP used for 1 relation
• COGROUP used for 1 to 127 relations
JOIN (inner)
• Performs inner join of 2 or more relations based on common field values.

Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)}
     X = JOIN A BY a1, B BY b1;
    (1,2,3,1,3)
     (4,2,1,4,6)
     (4,2,1,4,9)

ORDER BY
• Sorts relation based on 1 or more fields

Eg: X = ORDER A BY a3 DESC;
  (1,2,3)
  (4,2,1)
System Overview
• A step-by-step dataflow language where
  computation steps are chained together through
  the use of variables,
• The use of high-level transformations, e.g.,
  GROUP, FILTER
• The ability to specify schemas as part of issuing a
  program
• The use of userdened functions (e.g., top10)
Pig allows three modes of user interaction:
• Interactive mode:the user is presented with an
  interactive shell (called Grunt), which accepts Pig
  commands.
• Batch mode:A user submits a prewritten script
  containing a series of Pig commands
• Embedded mode:Pig is also provided as a Java
  library allowing Pig Latin commands to be
  submitted via method invocations from a Java
  program
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
 Parser
      •   Verifies program is syntactically correct and that all referenced variables are defined.
      •   Type checking
      •   Schema inference
      •   Verify ability to instantiate classes corresponding to UDF
      •   Confirm existence of streaming executables

    – Output of parser :Logical plan
      • One-to-one correspondence between Pig Latin statements & logical operators.
      • Arranged in directed acyclic graph (DAG)

 Logical Optimizer
  • Logical optimizations
   – Projection pushdown are carried out
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
Map-Reduce Compiler:Logical to Physical compilation(1)

 Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN
Map-Reduce Compiler:Logical to Physical compilation(2)
Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs

(CO)GROUP operator becomes series of
 3 physical operators :-
 Local and global rearrange operators –
   Group tuples on same machine and
   adjacent in data stream;
   Rearrange – hashing or sorting by key
• Package operator -places adjacent same key
     tuples into a single-tuple package

JOIN operator handled in 2 ways :-
 rewritten into COGROUP followed by FOREACH
  operator to perform “flattening” to get
  parallel hash-join or sort-merge join;
 Fragment-replicate join
    – which executes entirely in the map stage or entirely
    In the reduce stage
Map-Reduce Compiler:Logical to Physical compilation(3)
Example for (CO)GROUP Conversion:
         • (1,R),(2,G) in stream A
         • (1,B), (2,Y) in stream B

•     Local Rearrange Operator :
       – Eg: Converts tuple (1,R) to {1,(1,R)}

•    Global Rearrange operator: Sort
      – Eg: Reducer 1 : {1,{(1,R),(1,B)}}
      Reducer 2: {2,{(2,G),(2,Y)}}

•    Package Operator:
       – Places same-key tuples into single-tuple
      package
       – Eg: Reducer 1: {1,{(1,R)},{(1,B)}}
      Reducer 2: {2,{(2,G)},{(2,Y)}}
Map-Reduce Compiler:Logical to Physical compilation(4)
3 types of Join operators
      Fragment-replicate join
          • Joins huge table & very small table
                     Huge table fragmented and
                     distributed to mappers (or reducers)
          • Small table replicates to each machine
          • Either in map or reduce stage
      Parallel-hash join
          • Map stage - Hashes tables by join key
          • Reduce stage - Joins fragments of tables
                – Data with same hash values assigned to 1 reducer
      Sort-merge join
          • Both inputs sorted on join key
          • Each node gets a fragment of the sorted table, same keys got to the same table
          • Each node performs join; Only map step is sufficient
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)

 Physical to MapReduce Compilation:

•     Physical operators assigned to Hadoop
    stages to minimize no of reduce stages

• Local rearrange operator –
 simply annotates tuples with keys and stream identiers ,
 and lets Hadoop local sort stage to do work

•    Global rearrange operators removed .
    Implemented by Hadoop shuffle and
    merge stages

•    Load and store operators removed.
    Hadoop framework reads and writes data
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
Map-Reduce Compiler:Branching Plans(1)
 Branching Plans
• More than 1 STORE command – For each branch of split
• Data read once; Processed in multiple ways;
• Risk of data spilling to disk
• SPLIT operator :- Feeds copy of input to each nested sub-plan
Example 1: Logical Split command – Splits Table
• Only Map-Plan
   clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
   SPLIT clicks INTO
          pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan
          links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan
  // 1st Sub-Plan:
    cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat;
    STORE cpages INTO `pages';
// 2nd Sub-Plan:
    clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat;
    STORE clinks INTO `links';
Map-Reduce Compiler:Branching Plans(2)
Example2:
• Split propagates across map/reduce boundary
• No logical SPLIT operator
• Compiler inserts physical SPLIT operator
• MULTIPLEX operator : Routes tuples to correct sub-plan;
                      In Reduce stage only.

   clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
    goodclicks = FILTER clicks BY viewedat IS NOT NULL;
    // 1st Sub-Plan: Grouped by „pageid‟
       bypage = GROUP goodclicks BY pageid;
       cntbypage = FOREACH bypage GENERATE
       group,COUNT(goodclicks);
       STORE cntbypage INTO `bypage';
   //2nd Sub-Plan: Grouped by „linkid‟
       bylink = GROUP goodclicks BY linkid;
       cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks);
       STORE cntbylink INTO `bylink';
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
Map-Reduce Optimizer

Performs early partial aggregation in distributive or algebraic aggregation functions

eg: for function AVERAGE, the steps are:-
    a) Initial
       e.g. generate (sum, count) pairs
       Assigned to map stage.
    b) intermediate
       e.g. combine n (sum,count) pairs into a single pair
       Assigned to Combine stage.
    c) final
       e.g. combine n (sum,count) pairs and take the quotient
       Assigned to Reduce step
Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
   –Logical to Physical compilation
   –Physical to Map-Reduce Compilation
   –Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
Hadoop Job Manager


• Map-Reduce jobs sorted and submitted to Hadoop for
  execution
• Java jar file generated for Map and Reduce
  implementation classes and UDF
• Map and Reduce classes contain general-purpose
  dataflow execution engines
• Monitor and generates periodic reports
• Warnings or errors logged and reported
 Plan Execution
   • Flow Control
       – Nested Programs
   • Memory Management
 Streaming
   • Flow Control
PLAN EXECUTION - FLOW CONTROL
•   Execution of Map or Reduce stage in Physical Plan by Pig
•   Assume that data flows downward in an execution plan

•   To control movement of tuples through execution pipeline, 2 models available
      – Push & Pull(Iterator) Model
1) Push Model:
     Eg: Operator A pushes data to B that operates on it, and pushes the result to C.
         (A,B and C are physical operators)

     Difficult to implement for:
          •     UDF with multiple inputs
          •     Binary operators like fragment-replicate join

2) Pull Model :

     Eg: Operator C asks B for its next data item.
         If B has nothing pending to return, it asks A.
         When A returns a data item, B operates on it, and returns the result to C


Advantages:
 Single-threaded implementation : Avoids context-switching overhead
 Simple APIs for UDF
Drawback:
 Operations over bag nested inside tuple may lead to memory overflow
 If data flow graph has multiple sinks-operators at branch points may be required to buffer an
   unbounded number of tuples
PLAN EXECUTION - FLOW CONTROL (2)

Solution :
Response of operator, when asked to produce tuple
   a) Return tuple;
   b) Declare itself finished ; Or
   c) Return pause signal to indicate not finished; not able to produce output tuple;
PLAN EXECUTION - FLOW CONTROL (3)
NESTED PROGRAMS:
•   Pig Operators invoked over bags nested within tuples
•    For example: (To compute number of distinct pages and links visited by user)
     clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
         (Alice,Page1,Linnk1,Site1)
         (John,Page1,Link2,Site2)
         (John,Page2, Link2,Site3)
     byuser = GROUP clicks BY userid;
         (Alice, {(Alice, Page1,Linnk1,Site1)})
         (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)})
     result = FOREACH byuser
     {
             uniqPages = DISTINCT clicks.pageid;
             uniqLinks = DISTINCT clicks.linkid;
             GENERATE group, COUNT(uniqPages),COUNT(uniqLinks);
     };
         (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
         (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )
PLAN EXECUTION - FLOW CONTROL (4)

•   Outer operator graph contains FOREACH operator
•   Contains nested operator graph of 2 pipelines
•   Each pipeline contains DISTINCT and COUNT operators
•   FOREACH requests tuple T from PACKAGE operator
•   Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator
•   Requests tuple from the bottom of pipeline (COUNT operator)
•   Process repeated for second pipeline
•   FOREACH operator constructs and returns output tuple
PLAN EXECUTION - FLOW CONTROL
•   When nested plan is single branching pipeline:
    clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
        (Alice,Page1,Linnk1,Site1)
        (John,Page1,Link2,Site2)
        (John,Page2, Link2,NULL)
    byuser = GROUP clicks BY userid;
        (Alice, {(Alice, Page1,Linnk1,Site1)})
        (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)})
    result = FOREACH byuser
    {
        fltrd = FILTER clicks BY viewedat IS NOT NULL;
        uniqPages = DISTINCT fltrd.pageid;
        uniqLinks = DISTINCT fltrd.linkid;
        GENERATE group, COUNT(uniqPages), COUNT(uniqLinks);
    };
        (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
        (John, {(John,Page1,Link2,Site2)} , 1 , 1 )

A more complex situation arises when the nested plan is not two independent pipelines but rather a
single branching pipeline
Solution:
•   Pig currently handles this case by duplicating the FILTER operator and producing two independent
    pipelines, to be executed as explained above.
 Plan Execution
   • Flow Control
       – Nested Programs
   • Memory Management
 Streaming
   • Flow Control
PLAN EXECUTION - Memory Management

• Hadoop, Pig is implemented in Java.
• Java memory management problems during query processing
   – Java does not allow the developer to control memory
     allocation and deallocation directly,
• naive option :is to increase the JVM memory size limit
  beyond the physical memory size, and let the virtual
  memory manager take care of staging data between
  memory and disk.
   – Problem: performance degradation.



• Better to return an “out-of-memory" error
   – administrator can adjust the memory management
     parameters and re-submit the program
PLAN EXECUTION - Memory Management

•   Memory overflow mostly due to large bags of tuples

•   Java's MemoryPoolMXBean class notifies low memory situation.
    If notified, PIG spills excess bags to disk.

•   Pig estimates bag sizes by sampling few tuples

•   Memory manager maintains list of Pig bags created in same JVM
    using linked list of Java WeakReferences

•   WeakReference ensures garbage collection of bags no longer in use
 Plan Execution
   • Flow Control
       – Nested Programs
   • Memory Management
 Streaming
   • Flow Control
STREAMING – FLOW CONTROL
•   Pig allows User-dened functions (UDFs)
     – UDFs must be written in Java and must conform to Pig's UDF interface
     – Has synchronous behavior

Streaming :
• Allows data to be pushed through external executables
     – users are able to intermix relational operations like grouping and filtering with custom or
        legacy executables.
• Streaming executable behaves asynchronously.



challenges in implementing streaming in Pig :
       fitting it into the iterator model of Pig's execution pipeline
           • Because of asynchronous behavior of the user's executable
           • STREAM operator that wraps the executable cannot simply pull tuples synchronously
               as it does with other operators because it does not know what state executable is in.
           • There may be no output :
                  – executable is waiting to receive more input: the stream operator needs to push
                      new data
                  – executable is still busy processing prior inputs. :the stream operator should wait.
• Single-threaded operator execution model, a deadlock can
  occur
   – Pig operator is waiting for the external executable to
     consume a new input tuple, while at the same time the
     executable is waiting for its output to be consumed

Solution :
STREAM operator :
• Creates 2 additional threads - One to feed data to executable and other to
   consume data
• Blocks until tuple available on executable's output queue or until executable
   terminates
• If space available in input queue, places tuple from parent operator into it
Performance
• Initial implementation of Pig, functionality and
  proof of concept were considered more
  important
• As Pig was adopted within Yahoo- better
  performance quickly became a priority.

• Pig Mix-publicly available benchmark to
  measure performance on a regular basis so that
  the effects of individual code changes on
  performance could be understood.
Benchmark Results
Pig Mix benchmark
• September 11, 2008:
     o Initial Apache open-source release
•   November 11, 2008:
     – Enhanced type system
     – Rewrote execution pipeline
     – Combiner enhanced
•   January 20, 2009:
     – Buffering during data parsing
     – Fragment-replicate join algorithm
•   February 23, 2009:
     – Rework of partitioning function used in ORDER BY to ensure more balanced
       distribution of keys to reducers
•   April 20, 2009:
     – Branching execution plans
•   Vertical axis : Ratio of total running time for 12 Pig programs
                    to corresponding Map-Reduce programs
•   Current performance ratio is 1:5 - Reasonable trade of point between execution time and
    code development/maintenance effort.
Pros & Cons
• The step-by-step method of creating a
  program in Pig is much cleaner and simpler to
  use than the single block method of SQL. It is
  easier to keep track of what your variables
  are, and where you are in the process of
  analyzing your data.
• With the various interleaved clauses in SQL It
  is difficult to know what is actually happening
  sequentially.
Pros & Cons

• Explicit Dataflow            • Column wise Storage
• Retains Properties of Map-     structures are missing
  Reduce                       • Memory Management
• Scalability                  • No facilitation for Non Java
• Fault Tolerance                Users
• Multi Way Processing         • Limited Optimization
• Open Source
                               • No GUI for Flow Graphs
Future Work
• Query optimization
   – Currently rule-based optimizer for plan rearrangement and join selection
   – Cost-based in the future
• Non-Java UDFs
• SQL interface
• Grouping and joining of pre-partitioned/sorted data.
   – Avoid data shuffling for grouping and joining
   – Building metadata facilities to keep track of data layout
• Skew handling.
    – For load balancing
Summary
• Big demand for parallel data processing
   – Programmers like dataflow pipes over static files
• Ease of programming.
• UDF -Users can create their own functions to do special-
  purpose processing.
• Optimization opportunities :The way in which tasks are
  encoded permits the system to optimize their execution
  automatically, allowing the user to focus on semantics rather
  than efficiency.
• Open source

Pig Latin : Sweet spot between map-reduce and SQL
Related Work
• Sawzall
  – Data processing language on top of map-reduce
  – Rigid structure of filtering followed by aggregation
• Hive
  – SQL-like language on top of Map-Reduce
• DryadLINQ
  – SQL-like language on top of Dryad
Thank You!

More Related Content

What's hot

Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationUT, San Antonio
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 

What's hot (20)

43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 

Similar to Pig Experience

Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 

Similar to Pig Experience (20)

Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop
HadoopHadoop
Hadoop
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Hadoop
HadoopHadoop
Hadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
 

Pig Experience

  • 1. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience Tilani Gunawardena
  • 2. Content • Introduction • Background • System Overview • The System & Type Inference • Compilation To Map-Reduce • Plan Execution • Streaming • Performance • Adoption • Project Experience • Future Works
  • 3. Introduction • Internet companies swimming in data • TBs/day for Yahoo! Or Google! • PBs/day for FaceBook! • Data – unstructured elements • web page text,images – structured elements • web page click records , extracted entity-relationship models • Procesing – Filter, join , count • Data Warehousing?? – Scale -Often not scalable enough – Price-Prohibitively expensive at web scale – SQL- • High level declarative approach • Little control over execution method • The Map-Reduce Appeal ?? – Scale -Scalable due to simpler design, Explicit programming model – Price-Runs on cheap commodity hardware – SQL
  • 4. MapReduce Disadvantages • Does not directly support complex N-step dataflow • Lacks explicit support for combined processing of multiple data sets – joins and other data matching operations • Frequently needed data manipulation primitives must be coded by hand – Filtering, aggregation ,Join,Projecton,Sorting
  • 5. Pig • Pig's language Pig Latin – Chooses spot between MapReduce framework and SQL. • Defines a new language to allow better control in large scale data processing • Allow database programmers not to write map and reduce code, which is at too low a level
  • 6. Pig Latin: Data Types • Rich and Simple Data Model Simple Types: int, long, double, chararray(string), bytearray Complex Types: • Map: is an associative array;key:chararray;value: any type • Tuple: Collection of fields e.g. (áppe’, ‘mango’) • Bag: Collection of tuples { (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’)) }
  • 7. Pig Latin: Input/Output Data Input: queries = LOAD `data.txt' USING BinStorage AS (url, category, pagerank); Output: STORE result INTO `myoutput‘ ; BinStorage: binary storage function in Pig
  • 9. Pig Latin: General Syntax • Discarding Unwanted Data: FILTER • Comparison operators such as ==, eq, !=, neq • Logical connectors AND, OR, NOT
  • 10. Pig Latin: Type Declaration • Pig supports three options for declaring the data types of field – No data types are declared:default is to treat all fields as bytearray. Ex:a = LOAD `data' USING BinStorage AS (user); – Declaring types in Pig is to provide them explicitly as part of the AS clause during the LOAD: Ex :a =LOAD `data' USING BinStorage AS (user:chararray); – For the load function itself to provide the schema information, which accommodates self-describing data formats such as JSON
  • 11. Pig Latin: Lazy Conversion of Types • When Pig does need to cast a bytearray to another type because the program applies a type-specic operator, it delays that cast to the point where it is actually necessary. • Status will need to be cast to a chararray • EarnedPoints and possiblePoints will need to be cast to double • These casts will not be done when the data is loaded • They will be done as part of the comparison and division operations • Avoids casting values that are removed by the filter before the result of the cast is used.
  • 12. Pig Latin-Operators • LOAD : LOAD 'data' [USING function] [AS schema]; where, „data‟ : Name of file or directory USING, AS : Keywords function : Load function. schema : Loader produces data of type specified by schema. If data does not conform to schema, error is generated. ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat); LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp); • STORE : Stores results to file system – STORE alias INTO 'directory' [USING function]; where, alias : name of relation INTO, USING : keywords „directory‟ : storage directory‟s name. If directory already exists, operation fails function: Store function. ex: STORE result INTO `myOutput'; STORE query_revenues INTO `myoutput‘ USING myStore();
  • 13. FOREACH • Generates data transformations based on columns of data. • Eg: X = FOREACH A GENERATE a1, a2; expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); ----------------- expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));
  • 14. GROUP / COGROUP • Groups the data in one or more relations. • GROUP used for 1 relation • COGROUP used for 1 to 127 relations
  • 15. JOIN (inner) • Performs inner join of 2 or more relations based on common field values. Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)} X = JOIN A BY a1, B BY b1; (1,2,3,1,3) (4,2,1,4,6) (4,2,1,4,9) ORDER BY • Sorts relation based on 1 or more fields Eg: X = ORDER A BY a3 DESC; (1,2,3) (4,2,1)
  • 17. • A step-by-step dataflow language where computation steps are chained together through the use of variables, • The use of high-level transformations, e.g., GROUP, FILTER • The ability to specify schemas as part of issuing a program • The use of userdened functions (e.g., top10)
  • 18. Pig allows three modes of user interaction: • Interactive mode:the user is presented with an interactive shell (called Grunt), which accepts Pig commands. • Batch mode:A user submits a prewritten script containing a series of Pig commands • Embedded mode:Pig is also provided as a Java library allowing Pig Latin commands to be submitted via method invocations from a Java program
  • 19. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 20.  Parser • Verifies program is syntactically correct and that all referenced variables are defined. • Type checking • Schema inference • Verify ability to instantiate classes corresponding to UDF • Confirm existence of streaming executables – Output of parser :Logical plan • One-to-one correspondence between Pig Latin statements & logical operators. • Arranged in directed acyclic graph (DAG)  Logical Optimizer • Logical optimizations – Projection pushdown are carried out
  • 21. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 22. Map-Reduce Compiler:Logical to Physical compilation(1)  Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN
  • 23. Map-Reduce Compiler:Logical to Physical compilation(2) Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs (CO)GROUP operator becomes series of 3 physical operators :-  Local and global rearrange operators – Group tuples on same machine and adjacent in data stream; Rearrange – hashing or sorting by key • Package operator -places adjacent same key tuples into a single-tuple package JOIN operator handled in 2 ways :-  rewritten into COGROUP followed by FOREACH operator to perform “flattening” to get parallel hash-join or sort-merge join;  Fragment-replicate join – which executes entirely in the map stage or entirely In the reduce stage
  • 24. Map-Reduce Compiler:Logical to Physical compilation(3) Example for (CO)GROUP Conversion: • (1,R),(2,G) in stream A • (1,B), (2,Y) in stream B • Local Rearrange Operator : – Eg: Converts tuple (1,R) to {1,(1,R)} • Global Rearrange operator: Sort – Eg: Reducer 1 : {1,{(1,R),(1,B)}} Reducer 2: {2,{(2,G),(2,Y)}} • Package Operator: – Places same-key tuples into single-tuple package – Eg: Reducer 1: {1,{(1,R)},{(1,B)}} Reducer 2: {2,{(2,G)},{(2,Y)}}
  • 25. Map-Reduce Compiler:Logical to Physical compilation(4) 3 types of Join operators  Fragment-replicate join • Joins huge table & very small table Huge table fragmented and distributed to mappers (or reducers) • Small table replicates to each machine • Either in map or reduce stage  Parallel-hash join • Map stage - Hashes tables by join key • Reduce stage - Joins fragments of tables – Data with same hash values assigned to 1 reducer  Sort-merge join • Both inputs sorted on join key • Each node gets a fragment of the sorted table, same keys got to the same table • Each node performs join; Only map step is sufficient
  • 26. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 27. Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)  Physical to MapReduce Compilation: • Physical operators assigned to Hadoop stages to minimize no of reduce stages • Local rearrange operator – simply annotates tuples with keys and stream identiers , and lets Hadoop local sort stage to do work • Global rearrange operators removed . Implemented by Hadoop shuffle and merge stages • Load and store operators removed. Hadoop framework reads and writes data
  • 28. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 29. Map-Reduce Compiler:Branching Plans(1)  Branching Plans • More than 1 STORE command – For each branch of split • Data read once; Processed in multiple ways; • Risk of data spilling to disk • SPLIT operator :- Feeds copy of input to each nested sub-plan Example 1: Logical Split command – Splits Table • Only Map-Plan clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); SPLIT clicks INTO pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan // 1st Sub-Plan: cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat; STORE cpages INTO `pages'; // 2nd Sub-Plan: clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat; STORE clinks INTO `links';
  • 30. Map-Reduce Compiler:Branching Plans(2) Example2: • Split propagates across map/reduce boundary • No logical SPLIT operator • Compiler inserts physical SPLIT operator • MULTIPLEX operator : Routes tuples to correct sub-plan; In Reduce stage only. clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); goodclicks = FILTER clicks BY viewedat IS NOT NULL; // 1st Sub-Plan: Grouped by „pageid‟ bypage = GROUP goodclicks BY pageid; cntbypage = FOREACH bypage GENERATE group,COUNT(goodclicks); STORE cntbypage INTO `bypage'; //2nd Sub-Plan: Grouped by „linkid‟ bylink = GROUP goodclicks BY linkid; cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks); STORE cntbylink INTO `bylink';
  • 31. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 32. Map-Reduce Optimizer Performs early partial aggregation in distributive or algebraic aggregation functions eg: for function AVERAGE, the steps are:- a) Initial  e.g. generate (sum, count) pairs  Assigned to map stage. b) intermediate  e.g. combine n (sum,count) pairs into a single pair  Assigned to Combine stage. c) final  e.g. combine n (sum,count) pairs and take the quotient  Assigned to Reduce step
  • 33. Pig System Process • Parser • Logical Optimizer • Map-Reduce Compiler –Logical to Physical compilation –Physical to Map-Reduce Compilation –Branching Plans • Map-Reduce Optimizer • Hadoop Job Manager
  • 34. Hadoop Job Manager • Map-Reduce jobs sorted and submitted to Hadoop for execution • Java jar file generated for Map and Reduce implementation classes and UDF • Map and Reduce classes contain general-purpose dataflow execution engines • Monitor and generates periodic reports • Warnings or errors logged and reported
  • 35.  Plan Execution • Flow Control – Nested Programs • Memory Management  Streaming • Flow Control
  • 36. PLAN EXECUTION - FLOW CONTROL • Execution of Map or Reduce stage in Physical Plan by Pig • Assume that data flows downward in an execution plan • To control movement of tuples through execution pipeline, 2 models available – Push & Pull(Iterator) Model 1) Push Model: Eg: Operator A pushes data to B that operates on it, and pushes the result to C. (A,B and C are physical operators) Difficult to implement for: • UDF with multiple inputs • Binary operators like fragment-replicate join 2) Pull Model : Eg: Operator C asks B for its next data item. If B has nothing pending to return, it asks A. When A returns a data item, B operates on it, and returns the result to C Advantages:  Single-threaded implementation : Avoids context-switching overhead  Simple APIs for UDF Drawback:  Operations over bag nested inside tuple may lead to memory overflow  If data flow graph has multiple sinks-operators at branch points may be required to buffer an unbounded number of tuples
  • 37. PLAN EXECUTION - FLOW CONTROL (2) Solution : Response of operator, when asked to produce tuple a) Return tuple; b) Declare itself finished ; Or c) Return pause signal to indicate not finished; not able to produce output tuple;
  • 38. PLAN EXECUTION - FLOW CONTROL (3) NESTED PROGRAMS: • Pig Operators invoked over bags nested within tuples • For example: (To compute number of distinct pages and links visited by user) clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); (Alice,Page1,Linnk1,Site1) (John,Page1,Link2,Site2) (John,Page2, Link2,Site3) byuser = GROUP clicks BY userid; (Alice, {(Alice, Page1,Linnk1,Site1)}) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)}) result = FOREACH byuser { uniqPages = DISTINCT clicks.pageid; uniqLinks = DISTINCT clicks.linkid; GENERATE group, COUNT(uniqPages),COUNT(uniqLinks); }; (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )
  • 39. PLAN EXECUTION - FLOW CONTROL (4) • Outer operator graph contains FOREACH operator • Contains nested operator graph of 2 pipelines • Each pipeline contains DISTINCT and COUNT operators • FOREACH requests tuple T from PACKAGE operator • Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator • Requests tuple from the bottom of pipeline (COUNT operator) • Process repeated for second pipeline • FOREACH operator constructs and returns output tuple
  • 40. PLAN EXECUTION - FLOW CONTROL • When nested plan is single branching pipeline: clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat); (Alice,Page1,Linnk1,Site1) (John,Page1,Link2,Site2) (John,Page2, Link2,NULL) byuser = GROUP clicks BY userid; (Alice, {(Alice, Page1,Linnk1,Site1)}) (John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)}) result = FOREACH byuser { fltrd = FILTER clicks BY viewedat IS NOT NULL; uniqPages = DISTINCT fltrd.pageid; uniqLinks = DISTINCT fltrd.linkid; GENERATE group, COUNT(uniqPages), COUNT(uniqLinks); }; (Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1) (John, {(John,Page1,Link2,Site2)} , 1 , 1 ) A more complex situation arises when the nested plan is not two independent pipelines but rather a single branching pipeline Solution: • Pig currently handles this case by duplicating the FILTER operator and producing two independent pipelines, to be executed as explained above.
  • 41.  Plan Execution • Flow Control – Nested Programs • Memory Management  Streaming • Flow Control
  • 42. PLAN EXECUTION - Memory Management • Hadoop, Pig is implemented in Java. • Java memory management problems during query processing – Java does not allow the developer to control memory allocation and deallocation directly, • naive option :is to increase the JVM memory size limit beyond the physical memory size, and let the virtual memory manager take care of staging data between memory and disk. – Problem: performance degradation. • Better to return an “out-of-memory" error – administrator can adjust the memory management parameters and re-submit the program
  • 43. PLAN EXECUTION - Memory Management • Memory overflow mostly due to large bags of tuples • Java's MemoryPoolMXBean class notifies low memory situation. If notified, PIG spills excess bags to disk. • Pig estimates bag sizes by sampling few tuples • Memory manager maintains list of Pig bags created in same JVM using linked list of Java WeakReferences • WeakReference ensures garbage collection of bags no longer in use
  • 44.  Plan Execution • Flow Control – Nested Programs • Memory Management  Streaming • Flow Control
  • 45. STREAMING – FLOW CONTROL • Pig allows User-dened functions (UDFs) – UDFs must be written in Java and must conform to Pig's UDF interface – Has synchronous behavior Streaming : • Allows data to be pushed through external executables – users are able to intermix relational operations like grouping and filtering with custom or legacy executables. • Streaming executable behaves asynchronously. challenges in implementing streaming in Pig :  fitting it into the iterator model of Pig's execution pipeline • Because of asynchronous behavior of the user's executable • STREAM operator that wraps the executable cannot simply pull tuples synchronously as it does with other operators because it does not know what state executable is in. • There may be no output : – executable is waiting to receive more input: the stream operator needs to push new data – executable is still busy processing prior inputs. :the stream operator should wait.
  • 46. • Single-threaded operator execution model, a deadlock can occur – Pig operator is waiting for the external executable to consume a new input tuple, while at the same time the executable is waiting for its output to be consumed Solution : STREAM operator : • Creates 2 additional threads - One to feed data to executable and other to consume data • Blocks until tuple available on executable's output queue or until executable terminates • If space available in input queue, places tuple from parent operator into it
  • 47. Performance • Initial implementation of Pig, functionality and proof of concept were considered more important • As Pig was adopted within Yahoo- better performance quickly became a priority. • Pig Mix-publicly available benchmark to measure performance on a regular basis so that the effects of individual code changes on performance could be understood.
  • 48. Benchmark Results Pig Mix benchmark • September 11, 2008: o Initial Apache open-source release • November 11, 2008: – Enhanced type system – Rewrote execution pipeline – Combiner enhanced • January 20, 2009: – Buffering during data parsing – Fragment-replicate join algorithm • February 23, 2009: – Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducers • April 20, 2009: – Branching execution plans • Vertical axis : Ratio of total running time for 12 Pig programs to corresponding Map-Reduce programs • Current performance ratio is 1:5 - Reasonable trade of point between execution time and code development/maintenance effort.
  • 49. Pros & Cons • The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. • With the various interleaved clauses in SQL It is difficult to know what is actually happening sequentially.
  • 50. Pros & Cons • Explicit Dataflow • Column wise Storage • Retains Properties of Map- structures are missing Reduce • Memory Management • Scalability • No facilitation for Non Java • Fault Tolerance Users • Multi Way Processing • Limited Optimization • Open Source • No GUI for Flow Graphs
  • 51. Future Work • Query optimization – Currently rule-based optimizer for plan rearrangement and join selection – Cost-based in the future • Non-Java UDFs • SQL interface • Grouping and joining of pre-partitioned/sorted data. – Avoid data shuffling for grouping and joining – Building metadata facilities to keep track of data layout • Skew handling. – For load balancing
  • 52. Summary • Big demand for parallel data processing – Programmers like dataflow pipes over static files • Ease of programming. • UDF -Users can create their own functions to do special- purpose processing. • Optimization opportunities :The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. • Open source Pig Latin : Sweet spot between map-reduce and SQL
  • 53. Related Work • Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation • Hive – SQL-like language on top of Map-Reduce • DryadLINQ – SQL-like language on top of Dryad

Editor's Notes

  1. Gives better control in data processing
  2. More natural to programmers than flat tuples ,Avoids expensive joins
  3. September 11, 2008: Initial Apache open-source releaseNovember 11, 2008:Enhanced type system, rewrote execution pipeline, enhanced use of combinerJanuary 20, 2009: Rework of buffering during data parsing, fragment-replicate join algorithmFebruary 23, 2009: Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducersApril 20, 2009: Branching execution plans