3. Overview
MAPREDUCE TEZ
Mapper and Reducer phases
Shuffle between mapper and reducer tasks
JobControl to run group of jobs with dependencies
Directed Acyclic Graph (DAG) with vertices
Shuffle/One-One/Broadcast between vertex tasks
Whole plan runs in a single DAG
Map Tasks
Reduce Tasks
4. Why Tez?
Mapreduce is not cool anymore
Performance
Utilization
Run faster and also USE SAME OR LESS RESOURCES.
For eg: 5x the speed, but utilize less memory and cpu.
Increasing memory can slow down jobs if there is no capacity
Runtime of job = (Time taken by tasks) * (Number of waves of task launches)
For eg: Queue with 1000GB can run 1000 1GB container tasks in parallel.
1002 tasks - Takes two waves and run time doubles.
3GB containers - Takes three waves and runtime triples.
Run more and more jobs in parallel concurrently
Resource contention - Performance of jobs degrade
7. Peak traffic
Periods of peak utilization followed
by low utilization is a common
pattern
Peak hours
Catch-up processing during delays
Reprocessing
Developing, testing and ad-hoc
queries
Peak hours
Every hour
5min, 10min, 15min, 30min and hourly
feeds collide
Start of the day
00:00 UTC and 00:00 PST
5min, 10min, 15min, 30min, hourly
and daily feeds collide
9. Migration Status
Number of Pig scripts
MAPREDUCE - 58129
TEZ - 148852
Progress : 72%
Number of applications
MAPREDUCE - 147431
TEZ - 150027
Progress : 70%
Number of Pig script runs
MAPREDUCE - 198785
TEZ - 5278
Number of applications
MAPREDUCE - 487699
TEZ - 5418
*Data is for a single day
Oct 1 2015 Jun 17 2016
10. Unwritten Rule
NO
User should be able to run scripts as is
No settings changes or additional settings required
No modifications to the scripts or UDFs required
No special tuning required
Score: 80%
11. Rollout
Multiple version support
Commandline
pig or pig --useversion current
pig --useversion 0.11
pig –useversion 0.14
Oozie sharelib tags
pig_current
pig_11
pig_14
Staged rollout switching from Pig 0.11 to Pig 0.14 as current
Staging – 2 clusters
Research – 3 clusters
Production – 10 clusters
17 internal patch releases for Pig and Tez and 5 YARN releases.
Yahoo! Pig 0.14 = Apache Pig 0.16 minus 40+ patches
Yahoo! Tez 0.7 = Apache Tez 0.7.1 + ~15 patches
12. Feature parity with MapReduce
Add translation for all important mapreduce settings to equivalent Tez settings
Speculative execution, Java opts, Container sizes, Max attempts, Sort buffers, Shuffle tuning, etc
Map settings were mapped to Root vertices
Reduce settings were mapped to Intermediate and Leaf vertices
Add equivalent Tez features for not commonly used mapreduce features.
mapreduce.job.running.{map|reduce}.limit
Number of tasks that can run at a time
mapreduce.{map|reduce}.failures.maxpercent
Percentage of failed tasks to ignore
mapreduce.task.timeout
Timeout based on progress reporting by application
mapreduce.job.classloader
Separate classloader for user code. Not implemented yet
-files, -archives
No plans to add support
14. Misconfigurations
Bundling older versions of pig and hadoop jars with maven assembly plugin
NoClassDefFoundError, NoSuchMethodExecption
Incorrect memory configurations
Containers killed for heap size exceeding container size
mapreduce.reduce.java.opts = -Xmx4096M
mapreduce.reduce.memory.mb=2048
Bzip for shuffle compression instead of lzo
Support added later
15. Bad Programming in UDFs
Static variables instead of instance variables give wrong results with container reuse
groupedData = group A by $0;
aggregate = FOREACH groupedData generate group, MYCOUNT(A.f1);
MAPREDUCE ✔ TEZ ✖
aggregate = FOREACH groupedData generate group, MYCOUNT(A.f1), MYCOUNT(A.f2);
MAPREDUCE ✖ TEZ ✖
public class MYCOUNT extends AccumulatorEvalFunc<Long> {
private static int intermediateCount;
@Override
public void accumulate(Tuple b) throws IOException {
Iterator it = ((DataBag)b.get(0)).iterator();
while (it.hasNext()){
it.next(); intermediateCount ++;
}
}
@Override
public Long getValue() {
return intermediateCount;
}
}
16. Behavior Change
Part file name change
Output file names changed from part-{m|r}-00000 to part-v000-o000-r00000.
Users had hardcoded part-{m|r}-00000 path references in mapreduce.job.cache.files of downstream jobs
Index files
Dictionaries
Union optimization relied on output formats honoring mapreduce.output.basename
Custom OutputFormats that hardcoded filenames as part-{m|r}-00000 had to be fixed.
Database loading tool had problem with longer filenames
Parsing pig job counters from Oozie stats
There is only one job for the script instead of multiple jobs
Users had assumptions about number of jobs when parsing counters
Increased namespace usage with UnionOptimizer
Mapreduce had an extra map stage that processed 128MB per map
Specify PARALLEL with UNION or turn optimization off
Tez
20 20
MapReduce
20 20
1
Tez
no union
optimization
20 20
1
17. OutOfMemoryError – Application Master
Processes more tasks, counters and events
Does the job of N number of MapReduce application masters
Pig auto increases AM heap size based on number of tasks and vertices
Optimizations:
Serialization improvements and reducing buffer copies
Skip storing counters with value as zero.
Task configuration in memory
MapReduce sent job.xml to tasks via distributed cache
Config sent via RPC to tasks
Way lot more configuration and information duplication
Processor
Config per Input, Output and Edge
HCatLoader duplicating configuration and partition split information in UDFContext
Not an issue anymore with all the fixes gone in
18. OutOfMemoryError - Tasks
Auto-parallelism shuffle overload
Replicated join happening in reducer phase in Tez and map phase in MapReduce
User configuration had map heap size higher than reducer heap size
Increased memory utilization due to multiple inputs and outputs
Fetch inputs and finish sorting them and release buffers before creating sort buffers for output
In-Memory aggregation for Group By turned on by default in Pig.
pig.exec.mapPartAgg=true aggregates using a HashMap before combiner
Improvements to Memory based aggregation
Better spilling in Pig’s SpillableMemoryManager
95% of the cases fixed and 5% required memory increase
19. Memory Utilization
MapReduce
Default Heap Size – 1GB
Default io.sort.mb - 256MB
256 MB
256 MB
1 GB 768 MB
768 MB
Tez
256 MB
128 MB
1 GB
256 MB 256 MB
128 MB
128 MB
200 MB
312 MB
256 MB
128 MB
Join
Load
Group By
Tez Max memory for sort buffers
50% of heap – 512MB
Pig memory based aggregation
20% of heap - 200 MB
256 MB 256 MB
21. How complex processing can Tez handle?
Scale
DAGs with 100-200 vertices
Terabytes of data and billions of records flowing from start to finish of the DAG
98% of jobs can be run with same memory configurations as MapReduce
Run 310K tasks in one DAG with just 3.5G heapsize
One user set default_parallel to 5000 for a big DAG (totally unnecessary and wasteful). Job still ran
Stability
45-50K Pig on Tez jobs (not including hive) run in a single cluster each day without issues
Speculative execution
Full fault tolerance
Bad nodes
Bad disks
Shuffle fetch failures
22. Complexity of DAGs – Vertices per DAG
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 3 5 7 9 11
1-11 vertices
Number of DAGs
0
200
400
600
800
1000
1200
1400
12
15
18
21
24
27
30
33
36
39
43
46
49
58
63
69
80
110
155
12-155 vertices
Number of DAGs
*Data for a single day
27. 0
50
100
150
200
250
300
CPI Utilization (in Million vcores-secs)
Mapreduce
Tez
0
500
1,000
1,500
2,000
Total Runtime (in hrs)
Mapreduce
Tez
Before and After
0
10,000
20,000
30,000
40,000
Number of Jobs
Mapreduce
Tez
0
200
400
600
800
Memory Utilization (in PB-secs)
Mapreduce
Tez
*Numbers are for one major user in a single cluster
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Savings per day
400-500 hours of
Runtime (23-30%)
Individual job’s
runtime improvement
vary between 5%-5x
100+ PB-secs of
memory (15-25%)
30-50 million vcores-
sec of CPU (18-28%)
28. Utilization could still be better
Speculative execution happens lot more than mapreduce (TEZ-3316)
Task progress goes from 0 to 100%
Progress should be updated based on % of input records processed
similar to mapreduce
80% of our jobs run with speculative execution
Slow start does not happen with MRInput vertices (TEZ-3274)
Container reuse
Larger size containers are reused for small container requests (TEZ-3315)
< 1% of jobs affected
Start at the
same time
29. Before and After - IO
0
50,000
100,000
150,000
200,000
250,000
300,000
HDFS_BYTES_READ (in GB)
Mapreduce
Tez
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
HDFS_BYTES_WRITTEN (in GB)
Mapreduce
Tez
0
50,000
100,000
150,000
FILE_BYTES_WRITTEN (in GB)
Mapreduce
Tez
0
20,000
40,000
60,000
80,000
100,000
FILE_BYTES_READ (in GB)
*Numbers are for one major user in a single cluster
Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016 Jan 1 to Jun 22, 2016
Jan 1 to Jun 22, 2016
Lower HDFS utilization
as expected with no
intermediate storage
Both
HDFS_BYTES_READ and
HDFS_BYTES_WRITTEN
lower by ~20PB
More savings if 3x
replication is accounted
Higher local file utilization
than expected
30. IO issues to be fixed
More spills with smaller buffers in case of multiple inputs and outputs
Better in-memory merging with inputs
Lots of space wasted with serialization
DataOutputBuffer used for serializing keys and values extends ByteArrayOutputStream
Byte array size is doubled for every expansion - 33MB data occupies 64MB
Chained byte arrays avoiding array copies (TEZ-3159)
Shuffle fetch failures and re-run of upstream tasks impact both runtime and increase IO
Probability of occurrence lot higher than mapreduce
Increased disk activity with with Auto-Parallelism
Disks are hammered during fetch
More on disk activity with shuffle and merge due to spill
31. Performance and Utilization
RunTime Number
of Jobs
Number
of Tasks
Mapreduce 1 hr 15 min 46 45758
Tez 39 min 1 22879
Mapreduce
AM Container Size – 1.5 GB ( 46*1.5 = 69 GB)
Task Container Size – 2 GB
Tez
AM Container Size – 3 GB
Task Container Size – 2 GB
33. Challenges with DAG Processing
DAG is not always better than mapreduce
Some of the Tez optimizations which do well on small scale backfired on large scale
Multiple inputs and outputs
Data skew in unordered partitioning
Auto Parallelism
34. Multiple inputs and Outputs
Above script groups by 59 dimensions and then joins them
Grouping by multiple dimensions is a very common use case
Mapreduce implementation used multiplexing and de-multiplexing logic to do it in a single job.
Tag each record with the index of the group by and run it through corresponding plan
Tez implementation of using separate reducers is generally efficient faster. But performance
degrades as the data volume and number of inputs and outputs increase due to too small sort
buffers and lot of spill.
Solution – Implement older multiplexing and de-multiplexing logic in Tez for > 10 inputs/outputs.
35. Skewed Data
Data skew is one of the biggest causes of slowness
One or two reducers end up processing most of the data
Skewed output in one vertex can keep getting propagated downstream causing more slowness
Self joins which use One-One edges
Unordered partitioning (TEZ-3209)
Storing intermediate data in HDFS acted as autocorrect for skew in mapreduce
Each of the maps, combined splits up to 128MB by default for processing.
36. Auto-Parallelism
Map Vertex
Reduce Vertex
Map Vertex
Reduce Vertex
tez.shuffle-vertex-manager.desired-
task-input-size
128MB per reducer – intermediate
vertex
1GB per reducer – leaf vertex
Increased shuffle fetches per reducer
Before – 3 fetches per reducer
After – 9 fetches in one reducer
tez.shuffle-vertex-manager.min-
task-parallelism – Default is 1
Pig does not use it yet
1000 mappers and 999 reducers –
auto parallelism reduction to 1
reducer
999000 fetches in single reducer
Disk Merge is also costly
37. Improving Auto-Parallelism
Scaling
Composite messaging (TEZ-3222)
Reduce number of DataMovementEvents
Optimize empty partitions handling (TEZ-3145)
Custom Shuffle Handler for Tez
Remember file location of map output. Avoid listing on all volumes.
Ability to fetch multiple partitions in one request
Turn on MemToMem Merger by default
tez.runtime.shuffle.memory-to-memory.enable (mapreduce.reduce.merge.memtomem.enabled in MapReduce)
Still in experimental stage with lots of issues.
Threshold is currently based on number of map outputs
Dynamically merge based on the size and available memory
Range partitioning is not efficient
Easy implementation to start with
Skewed data prevents better reduction.
Two buckets with skew next to each other makes it worse
Best fit algorithm instead of range partitioning (TEZ-2962)
Reducer 0 - 67 MB
Reducer 1 - 64 MB
Reducer 2 - 64 MB
Reducer 3 - 5 MB
Reducer 4 - 5 MB
Reducer 1 - 128 MB
Reducer 2 - 10 MB
Reducer 0 - 67 MB
38. Summary
Transition was a little bumpier and longer than expected (Planned Q4 2015)
Major initial rollout delay due to Application TimeLine Server not scaling for more than 5K applications
Tons of issues stability and scalability issues fixed and contributed to Apache
Full migration by July 2016
Tez scales and it is now stable. Worked for Yahoo scale.
Smooth and easy migration from MapReduce
Really Good UI
Some cool features currently worked on in the community
DFS based based shuffle (TEZ-2442)
Custom edge for CROSS (TEZ-2104)
Parallelized LIMIT and skip running tasks once desired record count is reached.
Lot more features, optimizations, better algorithms and crazy ideas yet to be done.
Tez is an Assembly Language at its core
39. Acknowledgements
Jason Lowe - our very own Super Computer for debugging crazy issues
Koji Noguchi - another of our Pig internals expert and Committer
Yahoo! Pig users who put up with the most of the pain
Daniel Dai from Apache Pig PMC
Apache Tez PMC
Bikas Saha
Siddharth Seth
Hitesh Shah