Efficient Parallel Set-Similarity Joins Using MapReduce
Pig Experience
1. Building a HighLevel Dataflow System
on top of MapReduce: The Pig
Experience
Tilani Gunawardena
2. Content
• Introduction
• Background
• System Overview
• The System & Type Inference
• Compilation To Map-Reduce
• Plan Execution
• Streaming
• Performance
• Adoption
• Project Experience
• Future Works
3. Introduction
• Internet companies swimming in data
• TBs/day for Yahoo! Or Google!
• PBs/day for FaceBook!
• Data
– unstructured elements
• web page text,images
– structured elements
• web page click records , extracted entity-relationship models
• Procesing
– Filter, join , count
• Data Warehousing??
– Scale -Often not scalable enough
– Price-Prohibitively expensive at web scale
– SQL-
• High level declarative approach
• Little control over execution method
• The Map-Reduce Appeal ??
– Scale -Scalable due to simpler design, Explicit programming model
– Price-Runs on cheap commodity hardware
– SQL
4. MapReduce Disadvantages
• Does not directly support complex N-step
dataflow
• Lacks explicit support for combined processing of
multiple data sets
– joins and other data matching operations
• Frequently needed data manipulation primitives
must be coded by hand
– Filtering, aggregation ,Join,Projecton,Sorting
5. Pig
• Pig's language Pig Latin – Chooses spot
between MapReduce framework and SQL.
• Defines a new language to allow better
control in large scale data processing
• Allow database programmers not to write
map and reduce code, which is at too low
a level
6. Pig Latin: Data Types
• Rich and Simple Data Model
Simple Types:
int, long, double, chararray(string), bytearray
Complex Types:
• Map: is an associative array;key:chararray;value: any type
• Tuple: Collection of fields e.g. (áppe’, ‘mango’)
• Bag: Collection of tuples
{ (‘apple’ , ‘mango’)
(ápple’, (‘red’ , ‘yellow’))
}
7. Pig Latin: Input/Output Data
Input:
queries = LOAD `data.txt'
USING BinStorage
AS (url, category, pagerank);
Output:
STORE result INTO `myoutput‘ ;
BinStorage: binary storage function in Pig
9. Pig Latin: General Syntax
• Discarding Unwanted Data: FILTER
• Comparison operators such
as ==, eq, !=, neq
• Logical connectors AND, OR, NOT
10. Pig Latin: Type Declaration
• Pig supports three options for declaring the data types of
field
– No data types are declared:default is to treat all fields as
bytearray.
Ex:a = LOAD `data' USING BinStorage AS (user);
– Declaring types in Pig is to provide them explicitly as part of the
AS clause during the LOAD:
Ex :a =LOAD `data' USING BinStorage AS (user:chararray);
– For the load function itself to provide the schema
information, which accommodates self-describing data formats
such as JSON
11. Pig Latin: Lazy Conversion of Types
• When Pig does need to cast a bytearray to another type because
the program applies a type-specic operator, it delays that cast to
the point where it is actually necessary.
• Status will need to be cast to a chararray
• EarnedPoints and possiblePoints will need to be cast to double
• These casts will not be done when the data is loaded
• They will be done as part of the comparison and division
operations
• Avoids casting values that are removed by the filter before the
result of the cast is used.
12. Pig Latin-Operators
• LOAD : LOAD 'data' [USING function] [AS schema];
where, „data‟ : Name of file or directory
USING, AS : Keywords
function : Load function.
schema : Loader produces data of type specified by schema. If data does not
conform to schema, error is generated.
ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);
LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);
• STORE : Stores results to file system
– STORE alias INTO 'directory' [USING function];
where, alias : name of relation
INTO, USING : keywords
„directory‟ : storage directory‟s name. If directory already exists,
operation fails
function: Store function.
ex: STORE result INTO `myOutput';
STORE query_revenues INTO `myoutput‘ USING myStore();
13. FOREACH
• Generates data transformations based on columns of data.
• Eg: X = FOREACH A GENERATE a1, a2;
expanded_queries = FOREACH queries GENERATE userId,
expandQuery(queryString);
-----------------
expanded_queries = FOREACH queries GENERATE userId,
FLATTEN(expandQuery(queryString));
14. GROUP / COGROUP
• Groups the data in one or more relations.
• GROUP used for 1 relation
• COGROUP used for 1 to 127 relations
15. JOIN (inner)
• Performs inner join of 2 or more relations based on common field values.
Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)}
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,6)
(4,2,1,4,9)
ORDER BY
• Sorts relation based on 1 or more fields
Eg: X = ORDER A BY a3 DESC;
(1,2,3)
(4,2,1)
17. • A step-by-step dataflow language where
computation steps are chained together through
the use of variables,
• The use of high-level transformations, e.g.,
GROUP, FILTER
• The ability to specify schemas as part of issuing a
program
• The use of userdened functions (e.g., top10)
18. Pig allows three modes of user interaction:
• Interactive mode:the user is presented with an
interactive shell (called Grunt), which accepts Pig
commands.
• Batch mode:A user submits a prewritten script
containing a series of Pig commands
• Embedded mode:Pig is also provided as a Java
library allowing Pig Latin commands to be
submitted via method invocations from a Java
program
19. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
20. Parser
• Verifies program is syntactically correct and that all referenced variables are defined.
• Type checking
• Schema inference
• Verify ability to instantiate classes corresponding to UDF
• Confirm existence of streaming executables
– Output of parser :Logical plan
• One-to-one correspondence between Pig Latin statements & logical operators.
• Arranged in directed acyclic graph (DAG)
Logical Optimizer
• Logical optimizations
– Projection pushdown are carried out
21. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
22. Map-Reduce Compiler:Logical to Physical compilation(1)
Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN
23. Map-Reduce Compiler:Logical to Physical compilation(2)
Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs
(CO)GROUP operator becomes series of
3 physical operators :-
Local and global rearrange operators –
Group tuples on same machine and
adjacent in data stream;
Rearrange – hashing or sorting by key
• Package operator -places adjacent same key
tuples into a single-tuple package
JOIN operator handled in 2 ways :-
rewritten into COGROUP followed by FOREACH
operator to perform “flattening” to get
parallel hash-join or sort-merge join;
Fragment-replicate join
– which executes entirely in the map stage or entirely
In the reduce stage
24. Map-Reduce Compiler:Logical to Physical compilation(3)
Example for (CO)GROUP Conversion:
• (1,R),(2,G) in stream A
• (1,B), (2,Y) in stream B
• Local Rearrange Operator :
– Eg: Converts tuple (1,R) to {1,(1,R)}
• Global Rearrange operator: Sort
– Eg: Reducer 1 : {1,{(1,R),(1,B)}}
Reducer 2: {2,{(2,G),(2,Y)}}
• Package Operator:
– Places same-key tuples into single-tuple
package
– Eg: Reducer 1: {1,{(1,R)},{(1,B)}}
Reducer 2: {2,{(2,G)},{(2,Y)}}
25. Map-Reduce Compiler:Logical to Physical compilation(4)
3 types of Join operators
Fragment-replicate join
• Joins huge table & very small table
Huge table fragmented and
distributed to mappers (or reducers)
• Small table replicates to each machine
• Either in map or reduce stage
Parallel-hash join
• Map stage - Hashes tables by join key
• Reduce stage - Joins fragments of tables
– Data with same hash values assigned to 1 reducer
Sort-merge join
• Both inputs sorted on join key
• Each node gets a fragment of the sorted table, same keys got to the same table
• Each node performs join; Only map step is sufficient
26. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
27. Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)
Physical to MapReduce Compilation:
• Physical operators assigned to Hadoop
stages to minimize no of reduce stages
• Local rearrange operator –
simply annotates tuples with keys and stream identiers ,
and lets Hadoop local sort stage to do work
• Global rearrange operators removed .
Implemented by Hadoop shuffle and
merge stages
• Load and store operators removed.
Hadoop framework reads and writes data
28. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
29. Map-Reduce Compiler:Branching Plans(1)
Branching Plans
• More than 1 STORE command – For each branch of split
• Data read once; Processed in multiple ways;
• Risk of data spilling to disk
• SPLIT operator :- Feeds copy of input to each nested sub-plan
Example 1: Logical Split command – Splits Table
• Only Map-Plan
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
SPLIT clicks INTO
pages IF pageid IS NOT NULL, // Corresponds to „FILTER‟ of 1st Sub-Plan
links IF linkid IS NOT NULL; // Corresponds to „FILTER‟ of 2nd Sub-Plan
// 1st Sub-Plan:
cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat;
STORE cpages INTO `pages';
// 2nd Sub-Plan:
clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat;
STORE clinks INTO `links';
30. Map-Reduce Compiler:Branching Plans(2)
Example2:
• Split propagates across map/reduce boundary
• No logical SPLIT operator
• Compiler inserts physical SPLIT operator
• MULTIPLEX operator : Routes tuples to correct sub-plan;
In Reduce stage only.
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
goodclicks = FILTER clicks BY viewedat IS NOT NULL;
// 1st Sub-Plan: Grouped by „pageid‟
bypage = GROUP goodclicks BY pageid;
cntbypage = FOREACH bypage GENERATE
group,COUNT(goodclicks);
STORE cntbypage INTO `bypage';
//2nd Sub-Plan: Grouped by „linkid‟
bylink = GROUP goodclicks BY linkid;
cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks);
STORE cntbylink INTO `bylink';
31. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
32. Map-Reduce Optimizer
Performs early partial aggregation in distributive or algebraic aggregation functions
eg: for function AVERAGE, the steps are:-
a) Initial
e.g. generate (sum, count) pairs
Assigned to map stage.
b) intermediate
e.g. combine n (sum,count) pairs into a single pair
Assigned to Combine stage.
c) final
e.g. combine n (sum,count) pairs and take the quotient
Assigned to Reduce step
33. Pig System Process
• Parser
• Logical Optimizer
• Map-Reduce Compiler
–Logical to Physical compilation
–Physical to Map-Reduce Compilation
–Branching Plans
• Map-Reduce Optimizer
• Hadoop Job Manager
34. Hadoop Job Manager
• Map-Reduce jobs sorted and submitted to Hadoop for
execution
• Java jar file generated for Map and Reduce
implementation classes and UDF
• Map and Reduce classes contain general-purpose
dataflow execution engines
• Monitor and generates periodic reports
• Warnings or errors logged and reported
35. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
36. PLAN EXECUTION - FLOW CONTROL
• Execution of Map or Reduce stage in Physical Plan by Pig
• Assume that data flows downward in an execution plan
• To control movement of tuples through execution pipeline, 2 models available
– Push & Pull(Iterator) Model
1) Push Model:
Eg: Operator A pushes data to B that operates on it, and pushes the result to C.
(A,B and C are physical operators)
Difficult to implement for:
• UDF with multiple inputs
• Binary operators like fragment-replicate join
2) Pull Model :
Eg: Operator C asks B for its next data item.
If B has nothing pending to return, it asks A.
When A returns a data item, B operates on it, and returns the result to C
Advantages:
Single-threaded implementation : Avoids context-switching overhead
Simple APIs for UDF
Drawback:
Operations over bag nested inside tuple may lead to memory overflow
If data flow graph has multiple sinks-operators at branch points may be required to buffer an
unbounded number of tuples
37. PLAN EXECUTION - FLOW CONTROL (2)
Solution :
Response of operator, when asked to produce tuple
a) Return tuple;
b) Declare itself finished ; Or
c) Return pause signal to indicate not finished; not able to produce output tuple;
38. PLAN EXECUTION - FLOW CONTROL (3)
NESTED PROGRAMS:
• Pig Operators invoked over bags nested within tuples
• For example: (To compute number of distinct pages and links visited by user)
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,Site3)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)})
result = FOREACH byuser
{
uniqPages = DISTINCT clicks.pageid;
uniqLinks = DISTINCT clicks.linkid;
GENERATE group, COUNT(uniqPages),COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )
39. PLAN EXECUTION - FLOW CONTROL (4)
• Outer operator graph contains FOREACH operator
• Contains nested operator graph of 2 pipelines
• Each pipeline contains DISTINCT and COUNT operators
• FOREACH requests tuple T from PACKAGE operator
• Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator
• Requests tuple from the bottom of pipeline (COUNT operator)
• Process repeated for second pipeline
• FOREACH operator constructs and returns output tuple
40. PLAN EXECUTION - FLOW CONTROL
• When nested plan is single branching pipeline:
clicks = LOAD `clicks„ AS (userid, pageid, linkid, viewedat);
(Alice,Page1,Linnk1,Site1)
(John,Page1,Link2,Site2)
(John,Page2, Link2,NULL)
byuser = GROUP clicks BY userid;
(Alice, {(Alice, Page1,Linnk1,Site1)})
(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)})
result = FOREACH byuser
{
fltrd = FILTER clicks BY viewedat IS NOT NULL;
uniqPages = DISTINCT fltrd.pageid;
uniqLinks = DISTINCT fltrd.linkid;
GENERATE group, COUNT(uniqPages), COUNT(uniqLinks);
};
(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)
(John, {(John,Page1,Link2,Site2)} , 1 , 1 )
A more complex situation arises when the nested plan is not two independent pipelines but rather a
single branching pipeline
Solution:
• Pig currently handles this case by duplicating the FILTER operator and producing two independent
pipelines, to be executed as explained above.
41. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
42. PLAN EXECUTION - Memory Management
• Hadoop, Pig is implemented in Java.
• Java memory management problems during query processing
– Java does not allow the developer to control memory
allocation and deallocation directly,
• naive option :is to increase the JVM memory size limit
beyond the physical memory size, and let the virtual
memory manager take care of staging data between
memory and disk.
– Problem: performance degradation.
• Better to return an “out-of-memory" error
– administrator can adjust the memory management
parameters and re-submit the program
43. PLAN EXECUTION - Memory Management
• Memory overflow mostly due to large bags of tuples
• Java's MemoryPoolMXBean class notifies low memory situation.
If notified, PIG spills excess bags to disk.
• Pig estimates bag sizes by sampling few tuples
• Memory manager maintains list of Pig bags created in same JVM
using linked list of Java WeakReferences
• WeakReference ensures garbage collection of bags no longer in use
44. Plan Execution
• Flow Control
– Nested Programs
• Memory Management
Streaming
• Flow Control
45. STREAMING – FLOW CONTROL
• Pig allows User-dened functions (UDFs)
– UDFs must be written in Java and must conform to Pig's UDF interface
– Has synchronous behavior
Streaming :
• Allows data to be pushed through external executables
– users are able to intermix relational operations like grouping and filtering with custom or
legacy executables.
• Streaming executable behaves asynchronously.
challenges in implementing streaming in Pig :
fitting it into the iterator model of Pig's execution pipeline
• Because of asynchronous behavior of the user's executable
• STREAM operator that wraps the executable cannot simply pull tuples synchronously
as it does with other operators because it does not know what state executable is in.
• There may be no output :
– executable is waiting to receive more input: the stream operator needs to push
new data
– executable is still busy processing prior inputs. :the stream operator should wait.
46. • Single-threaded operator execution model, a deadlock can
occur
– Pig operator is waiting for the external executable to
consume a new input tuple, while at the same time the
executable is waiting for its output to be consumed
Solution :
STREAM operator :
• Creates 2 additional threads - One to feed data to executable and other to
consume data
• Blocks until tuple available on executable's output queue or until executable
terminates
• If space available in input queue, places tuple from parent operator into it
47. Performance
• Initial implementation of Pig, functionality and
proof of concept were considered more
important
• As Pig was adopted within Yahoo- better
performance quickly became a priority.
• Pig Mix-publicly available benchmark to
measure performance on a regular basis so that
the effects of individual code changes on
performance could be understood.
48. Benchmark Results
Pig Mix benchmark
• September 11, 2008:
o Initial Apache open-source release
• November 11, 2008:
– Enhanced type system
– Rewrote execution pipeline
– Combiner enhanced
• January 20, 2009:
– Buffering during data parsing
– Fragment-replicate join algorithm
• February 23, 2009:
– Rework of partitioning function used in ORDER BY to ensure more balanced
distribution of keys to reducers
• April 20, 2009:
– Branching execution plans
• Vertical axis : Ratio of total running time for 12 Pig programs
to corresponding Map-Reduce programs
• Current performance ratio is 1:5 - Reasonable trade of point between execution time and
code development/maintenance effort.
49. Pros & Cons
• The step-by-step method of creating a
program in Pig is much cleaner and simpler to
use than the single block method of SQL. It is
easier to keep track of what your variables
are, and where you are in the process of
analyzing your data.
• With the various interleaved clauses in SQL It
is difficult to know what is actually happening
sequentially.
50. Pros & Cons
• Explicit Dataflow • Column wise Storage
• Retains Properties of Map- structures are missing
Reduce • Memory Management
• Scalability • No facilitation for Non Java
• Fault Tolerance Users
• Multi Way Processing • Limited Optimization
• Open Source
• No GUI for Flow Graphs
51. Future Work
• Query optimization
– Currently rule-based optimizer for plan rearrangement and join selection
– Cost-based in the future
• Non-Java UDFs
• SQL interface
• Grouping and joining of pre-partitioned/sorted data.
– Avoid data shuffling for grouping and joining
– Building metadata facilities to keep track of data layout
• Skew handling.
– For load balancing
52. Summary
• Big demand for parallel data processing
– Programmers like dataflow pipes over static files
• Ease of programming.
• UDF -Users can create their own functions to do special-
purpose processing.
• Optimization opportunities :The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather
than efficiency.
• Open source
Pig Latin : Sweet spot between map-reduce and SQL
53. Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
More natural to programmers than flat tuples ,Avoids expensive joins
September 11, 2008: Initial Apache open-source releaseNovember 11, 2008:Enhanced type system, rewrote execution pipeline, enhanced use of combinerJanuary 20, 2009: Rework of buffering during data parsing, fragment-replicate join algorithmFebruary 23, 2009: Rework of partitioning function used in ORDER BY to ensure more balanced distribution of keys to reducersApril 20, 2009: Branching execution plans