Yahoo's Experience Running Pig on Tez at Scale

Yahoo!’s Experience Running
Pig on Tez at Scale
By
Jon Eagles
Rohini Palaniswamy

Agenda
Introduction1
Our Migration Story
Scale and Stability
Performance and Utilization
Problems and What’s next
2
3
4
5

Overview
MAPREDUCE TEZ
 Mapper and Reducer phases
 Shuffle between mapper and reducer tasks
 JobControl to run group of jobs with dependencies
 Directed Acyclic Graph (DAG) with vertices
 Shuffle/One-One/Broadcast between vertex tasks
 Whole plan runs in a single DAG
Map Tasks
Reduce Tasks

Why Tez?
 Mapreduce is not cool anymore 
 Performance
 Utilization
 Run faster and also USE SAME OR LESS RESOURCES.
 For eg: 5x the speed, but utilize less memory and cpu.
 Increasing memory can slow down jobs if there is no capacity
 Runtime of job = (Time taken by tasks) * (Number of waves of task launches)
For eg: Queue with 1000GB can run 1000 1GB container tasks in parallel.
1002 tasks - Takes two waves and run time doubles.
3GB containers - Takes three waves and runtime triples.
 Run more and more jobs in parallel concurrently
 Resource contention - Performance of jobs degrade

Peak traffic
 Periods of peak utilization followed
by low utilization is a common
pattern
 Peak hours
 Catch-up processing during delays
 Reprocessing
 Developing, testing and ad-hoc
queries
 Peak hours
 Every hour
 5min, 10min, 15min, 30min and hourly
feeds collide
 Start of the day
 00:00 UTC and 00:00 PST
 5min, 10min, 15min, 30min, hourly
and daily feeds collide

Migration Status
 Number of Pig scripts
 MAPREDUCE - 58129
 TEZ - 148852
Progress : 72%
 Number of applications
 TEZ - 150027
Progress : 70%
 Number of Pig script runs
 TEZ - 5278
 Number of applications
 TEZ - 5418
*Data is for a single day
Oct 1 2015 Jun 17 2016

Unwritten Rule
NO
 User should be able to run scripts as is
 No settings changes or additional settings required
 No modifications to the scripts or UDFs required
 No special tuning required
Score: 80%

Rollout
 Multiple version support
 Commandline
 pig or pig --useversion current
 pig --useversion 0.11
 pig –useversion 0.14
 Oozie sharelib tags
 pig_current
 pig_11
 pig_14
 Staged rollout switching from Pig 0.11 to Pig 0.14 as current
 Staging – 2 clusters
 Research – 3 clusters
 Production – 10 clusters
 17 internal patch releases for Pig and Tez and 5 YARN releases.
 Yahoo! Pig 0.14 = Apache Pig 0.16 minus 40+ patches
 Yahoo! Tez 0.7 = Apache Tez 0.7.1 + ~15 patches

Feature parity with MapReduce
 Add translation for all important mapreduce settings to equivalent Tez settings
 Speculative execution, Java opts, Container sizes, Max attempts, Sort buffers, Shuffle tuning, etc
 Map settings were mapped to Root vertices
 Reduce settings were mapped to Intermediate and Leaf vertices
 Add equivalent Tez features for not commonly used mapreduce features.
 mapreduce.job.running.{map|reduce}.limit
 Number of tasks that can run at a time
 mapreduce.{map|reduce}.failures.maxpercent
 Percentage of failed tasks to ignore
 mapreduce.task.timeout
 Timeout based on progress reporting by application
 mapreduce.job.classloader
 Separate classloader for user code. Not implemented yet
 -files, -archives
 No plans to add support

Misconfigurations
 Bundling older versions of pig and hadoop jars with maven assembly plugin
 NoClassDefFoundError, NoSuchMethodExecption
 Incorrect memory configurations
 Containers killed for heap size exceeding container size
mapreduce.reduce.java.opts = -Xmx4096M
mapreduce.reduce.memory.mb=2048
 Bzip for shuffle compression instead of lzo
 Support added later

Bad Programming in UDFs
 Static variables instead of instance variables give wrong results with container reuse
groupedData = group A by $0;
aggregate = FOREACH groupedData generate group, MYCOUNT(A.f1);
MAPREDUCE ✔ TEZ ✖
aggregate = FOREACH groupedData generate group, MYCOUNT(A.f1), MYCOUNT(A.f2);
MAPREDUCE ✖ TEZ ✖
public class MYCOUNT extends AccumulatorEvalFunc<Long> {
private static int intermediateCount;
@Override
public void accumulate(Tuple b) throws IOException {
Iterator it = ((DataBag)b.get(0)).iterator();
while (it.hasNext()){
it.next(); intermediateCount ++;
}
}
@Override
public Long getValue() {
return intermediateCount;
}
}

Behavior Change
 Part file name change
 Output file names changed from part-{m|r}-00000 to part-v000-o000-r00000.
 Users had hardcoded part-{m|r}-00000 path references in mapreduce.job.cache.files of downstream jobs
 Index files
 Dictionaries
 Union optimization relied on output formats honoring mapreduce.output.basename
 Custom OutputFormats that hardcoded filenames as part-{m|r}-00000 had to be fixed.
 Database loading tool had problem with longer filenames
 Parsing pig job counters from Oozie stats
 There is only one job for the script instead of multiple jobs
 Users had assumptions about number of jobs when parsing counters
 Increased namespace usage with UnionOptimizer
 Mapreduce had an extra map stage that processed 128MB per map
 Specify PARALLEL with UNION or turn optimization off
Tez
20 20
MapReduce
20 20
1
Tez
no union
optimization
20 20
1

OutOfMemoryError – Application Master
 Processes more tasks, counters and events
 Does the job of N number of MapReduce application masters
 Pig auto increases AM heap size based on number of tasks and vertices
 Optimizations:
 Serialization improvements and reducing buffer copies
 Skip storing counters with value as zero.
 Task configuration in memory
 MapReduce sent job.xml to tasks via distributed cache
 Config sent via RPC to tasks
 Way lot more configuration and information duplication
 Processor
 Config per Input, Output and Edge
 HCatLoader duplicating configuration and partition split information in UDFContext
 Not an issue anymore with all the fixes gone in

OutOfMemoryError - Tasks
 Auto-parallelism shuffle overload
 Replicated join happening in reducer phase in Tez and map phase in MapReduce
 User configuration had map heap size higher than reducer heap size
 Increased memory utilization due to multiple inputs and outputs
 Fetch inputs and finish sorting them and release buffers before creating sort buffers for output
 In-Memory aggregation for Group By turned on by default in Pig.
 pig.exec.mapPartAgg=true aggregates using a HashMap before combiner
 Improvements to Memory based aggregation
 Better spilling in Pig’s SpillableMemoryManager
 95% of the cases fixed and 5% required memory increase

Memory Utilization
MapReduce
 Default Heap Size – 1GB
 Default io.sort.mb - 256MB
256 MB
256 MB
1 GB 768 MB
768 MB
Tez
256 MB
128 MB
1 GB
256 MB 256 MB
128 MB
128 MB
200 MB
312 MB
256 MB
128 MB
Join
Load
Group By
 Tez Max memory for sort buffers
 50% of heap – 512MB
 Pig memory based aggregation
 20% of heap - 200 MB
256 MB 256 MB

How complex processing can Tez handle?
 Scale
 DAGs with 100-200 vertices
 Terabytes of data and billions of records flowing from start to finish of the DAG
 98% of jobs can be run with same memory configurations as MapReduce
 Run 310K tasks in one DAG with just 3.5G heapsize
 One user set default_parallel to 5000 for a big DAG (totally unnecessary and wasteful). Job still ran 
 Stability
 45-50K Pig on Tez jobs (not including hive) run in a single cluster each day without issues
 Speculative execution
 Full fault tolerance
 Bad nodes
 Bad disks
 Shuffle fetch failures

Complexity of DAGs – Vertices per DAG
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 3 5 7 9 11
1-11 vertices
Number of DAGs
0
200
400
600
800
1000
1200
1400
12
15
18
21
24
27
30
33
36
39
43
46
49
58
63
69
80
110
155
12-155 vertices
Number of DAGs
*Data for a single day

DAG Patterns
DAG with 155 Vertices.
DAG with 61 Vertices

DAG Patterns

0
50
100
150
200
250
300
CPI Utilization (in Million vcores-secs)
Mapreduce
Tez
0
500
1,000
1,500
2,000
Total Runtime (in hrs)
Mapreduce
Tez
Before and After
0
10,000
20,000
30,000
40,000
Number of Jobs
Mapreduce
Tez
0
200
400
600
800
Memory Utilization (in PB-secs)
Mapreduce
Tez
*Numbers are for one major user in a single cluster
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
 Savings per day
 400-500 hours of
Runtime (23-30%)
 Individual job’s
runtime improvement
vary between 5%-5x
 100+ PB-secs of
memory (15-25%)
 30-50 million vcores-
sec of CPU (18-28%)

Utilization could still be better
 Speculative execution happens lot more than mapreduce (TEZ-3316)
 Task progress goes from 0 to 100%
 Progress should be updated based on % of input records processed
similar to mapreduce
 80% of our jobs run with speculative execution
 Slow start does not happen with MRInput vertices (TEZ-3274)
 Container reuse
 Larger size containers are reused for small container requests (TEZ-3315)
 < 1% of jobs affected
Start at the
same time

Before and After - IO
0
50,000
100,000
150,000
200,000
250,000
300,000
HDFS_BYTES_READ (in GB)
Mapreduce
Tez
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
HDFS_BYTES_WRITTEN (in GB)
Mapreduce
Tez
0
50,000
100,000
150,000
FILE_BYTES_WRITTEN (in GB)
Mapreduce
Tez
0
20,000
40,000
60,000
80,000
100,000
FILE_BYTES_READ (in GB)
*Numbers are for one major user in a single cluster
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016 Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
 Lower HDFS utilization
as expected with no
intermediate storage
 Both
HDFS_BYTES_READ and
HDFS_BYTES_WRITTEN
lower by ~20PB
 More savings if 3x
replication is accounted
 Higher local file utilization
than expected

IO issues to be fixed
 More spills with smaller buffers in case of multiple inputs and outputs
 Better in-memory merging with inputs
 Lots of space wasted with serialization
 DataOutputBuffer used for serializing keys and values extends ByteArrayOutputStream
 Byte array size is doubled for every expansion - 33MB data occupies 64MB
 Chained byte arrays avoiding array copies (TEZ-3159)
 Shuffle fetch failures and re-run of upstream tasks impact both runtime and increase IO
 Probability of occurrence lot higher than mapreduce
 Increased disk activity with with Auto-Parallelism
 Disks are hammered during fetch
 More on disk activity with shuffle and merge due to spill

Performance and Utilization
RunTime Number
of Jobs
Number
of Tasks
Mapreduce 1 hr 15 min 46 45758
Tez 39 min 1 22879
 Mapreduce
 AM Container Size – 1.5 GB ( 46*1.5 = 69 GB)
 Task Container Size – 2 GB
 Tez
 AM Container Size – 3 GB
 Task Container Size – 2 GB

Challenges with DAG Processing
 DAG is not always better than mapreduce
 Some of the Tez optimizations which do well on small scale backfired on large scale
 Multiple inputs and outputs
 Data skew in unordered partitioning
 Auto Parallelism

Multiple inputs and Outputs
 Above script groups by 59 dimensions and then joins them
 Grouping by multiple dimensions is a very common use case
 Mapreduce implementation used multiplexing and de-multiplexing logic to do it in a single job.
 Tag each record with the index of the group by and run it through corresponding plan
 Tez implementation of using separate reducers is generally efficient faster. But performance
degrades as the data volume and number of inputs and outputs increase due to too small sort
buffers and lot of spill.
 Solution – Implement older multiplexing and de-multiplexing logic in Tez for > 10 inputs/outputs.

Skewed Data
 Data skew is one of the biggest causes of slowness
 One or two reducers end up processing most of the data
 Skewed output in one vertex can keep getting propagated downstream causing more slowness
 Self joins which use One-One edges
 Unordered partitioning (TEZ-3209)
 Storing intermediate data in HDFS acted as autocorrect for skew in mapreduce
 Each of the maps, combined splits up to 128MB by default for processing.

Auto-Parallelism
Map Vertex
Reduce Vertex
Map Vertex
Reduce Vertex
 tez.shuffle-vertex-manager.desired-
task-input-size
 128MB per reducer – intermediate
vertex
 1GB per reducer – leaf vertex
 Increased shuffle fetches per reducer
 Before – 3 fetches per reducer
 After – 9 fetches in one reducer
 tez.shuffle-vertex-manager.min-
task-parallelism – Default is 1
 Pig does not use it yet
 1000 mappers and 999 reducers –
auto parallelism reduction to 1
reducer
 999000 fetches in single reducer
 Disk Merge is also costly

Improving Auto-Parallelism
 Scaling
 Composite messaging (TEZ-3222)
 Reduce number of DataMovementEvents
 Optimize empty partitions handling (TEZ-3145)
 Custom Shuffle Handler for Tez
 Remember file location of map output. Avoid listing on all volumes.
 Ability to fetch multiple partitions in one request
 Turn on MemToMem Merger by default
 tez.runtime.shuffle.memory-to-memory.enable (mapreduce.reduce.merge.memtomem.enabled in MapReduce)
 Still in experimental stage with lots of issues.
 Threshold is currently based on number of map outputs
 Dynamically merge based on the size and available memory
 Range partitioning is not efficient
 Easy implementation to start with
 Skewed data prevents better reduction.
 Two buckets with skew next to each other makes it worse
 Best fit algorithm instead of range partitioning (TEZ-2962)
Reducer 0 - 67 MB
Reducer 1 - 64 MB
Reducer 2 - 64 MB
Reducer 3 - 5 MB
Reducer 4 - 5 MB
Reducer 1 - 128 MB
Reducer 2 - 10 MB
Reducer 0 - 67 MB

Summary
 Transition was a little bumpier and longer than expected (Planned Q4 2015)
 Major initial rollout delay due to Application TimeLine Server not scaling for more than 5K applications
 Tons of issues stability and scalability issues fixed and contributed to Apache
 Full migration by July 2016
 Tez scales and it is now stable. Worked for Yahoo scale.
 Smooth and easy migration from MapReduce
 Really Good UI
 Some cool features currently worked on in the community
 DFS based based shuffle (TEZ-2442)
 Custom edge for CROSS (TEZ-2104)
 Parallelized LIMIT and skip running tasks once desired record count is reached.
 Lot more features, optimizations, better algorithms and crazy ideas yet to be done.
 Tez is an Assembly Language at its core

Acknowledgements
 Jason Lowe - our very own Super Computer for debugging crazy issues
 Koji Noguchi - another of our Pig internals expert and Committer
 Yahoo! Pig users who put up with the most of the pain
 Daniel Dai from Apache Pig PMC
 Apache Tez PMC
 Bikas Saha
 Siddharth Seth
 Hitesh Shah

Yahoo's Experience Running Pig on Tez at Scale

Yahoo's Experience Running Pig on Tez at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Yahoo's Experience Running Pig on Tez at Scale

Similar to Yahoo's Experience Running Pig on Tez at Scale (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Yahoo's Experience Running Pig on Tez at Scale

Editor's Notes