Submit Search
Upload
Apache Tez – Present and Future
•
Download as PPTX, PDF
•
1 like
•
621 views
Jianfeng Zhang
Follow
Apache Tez – Present and Future
Read less
Read more
Technology
Report
Share
Report
Share
1 of 38
Download now
Recommended
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Apache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
Tuning up with Apache Tez
Tuning up with Apache Tez
Gal Vinograd
Quick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
Recommended
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Apache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
Tuning up with Apache Tez
Tuning up with Apache Tez
Gal Vinograd
Quick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
Tune up Yarn and Hive
Tune up Yarn and Hive
rxu
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
The Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
Spark vstez
Spark vstez
David Groozman
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Rommel Garcia
Hadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Tez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Luke Han
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
More Related Content
What's hot
Tune up Yarn and Hive
Tune up Yarn and Hive
rxu
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
The Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
Spark vstez
Spark vstez
David Groozman
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Rommel Garcia
Hadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Tez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
What's hot
(20)
Tune up Yarn and Hive
Tune up Yarn and Hive
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
The Future of Apache Storm
The Future of Apache Storm
Spark vstez
Spark vstez
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Hadoop scheduler
Hadoop scheduler
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Tez Data Processing over Yarn
Tez Data Processing over Yarn
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Similar to Apache Tez – Present and Future
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Luke Han
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Caserta
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
Hadoop engineering bo_f_final
Hadoop engineering bo_f_final
Ramya Sunil
Tajo_Meetup_20141120
Tajo_Meetup_20141120
Hyoungjun Kim
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Similar to Apache Tez – Present and Future
(20)
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
Hadoop engineering bo_f_final
Hadoop engineering bo_f_final
Tajo_Meetup_20141120
Tajo_Meetup_20141120
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Recently uploaded
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Neo4j
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
V3cube
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
UK Journal
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Recently uploaded
(20)
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Apache Tez – Present and Future
1.
© Hortonworks Inc.
2015 Page 1 Apache Tez – Present and Future Jeff Zhang (@zjffdu) Rajesh Balamohan (@rajeshbalamohan)
2.
© Hortonworks Inc.
2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
3.
© Hortonworks Inc.
2015 I/O Synchronization Barrier I/O Synchronization Barrier Job 1 ( Join a & b ) Job 3 ( Group by of c ) Job 2 (Group by of a Join b) Job 4 (Join of S & R ) Hive - MR Example of MR versus Tez Page 3 Single Job Hive - Tez Join a & b Group by of a Join b Group by of c Job 4 (Join of S & R )
4.
© Hortonworks Inc.
2015 Tez – Introduction Page 4 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph (DAG). • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed.
5.
© Hortonworks Inc.
2015 What is DAG & Why DAG Projection Filter GroupBy … Join Union Intersect … Split … • Directed Acyclic Graph • Any complicated DAG can been composed of the following 3 basic paradigm – Sequential – Merge – Divide
6.
© Hortonworks Inc.
2015 Expressing DAG in Tez API • DAG API (Logic View) –Allow user to build DAG –Topological structure of the data computation flow • Runtime API (Runtime View) –Application logic of each computation unit (vertex) –How to move/read/write data between vertices
7.
© Hortonworks Inc.
2015 DAG API (Logic View) Page 7 • Vertex (Processor, Parallelism, Resource, etc…) • Edge (EdgeProperty) –DataMovement – Scatter Gather (Join, GroupBy … ) – Broadcast ( Pig Replicated Join / Hive Broadcast Join ) – One-to-One ( Pig Order by ) – Custom
8.
© Hortonworks Inc.
2015 Runtime API (Runtime View) Page 8 ProcessorInput Output • Input – Through which processor receives data on an edge – Vertex can have multiple inputs • Processor – Application Logic (One vertex one processor) – Consume the inputs and produce the outputs • Output – Through which processor writes data to an edge – One vertex can have multiple outputs • Example of Input/Output/Processor – MRInput & MROutput (InputFormat/OutputFormat) – OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather) – UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One) – PigProcessor/HiveProcessor
9.
© Hortonworks Inc.
2015 Benefit of DAG • Easier to express computation in DAG • No intermediate data written to HDFS • Less pressure on NameNode • No resource queuing effort & less resource contention • More optimization opportunity with more global context
10.
© Hortonworks Inc.
2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
11.
© Hortonworks Inc.
2015 Container-Reuse • Reuse the same container across DAG/Vertices/Tasks • Benefit of Container-Reuse –Less resources consumed –Reduce overhead of launching JVM –Reduce overhead of negotiate with Resource Manager –Reduce overhead of resource localization –Reduce network IO –Object Caching (Object Sharing)
12.
© Hortonworks Inc.
2015 Tez Session • Multiple Jobs/DAGs in one AM • Container-reuse across Jobs/DAGs • Data sharing between Jobs/DAGs
13.
© Hortonworks Inc.
2015 Dynamic Parallelism Estimation • VertexManager –Listen to the other vertices status –Coordinate and schedule its tasks –Communication between vertices
14.
© Hortonworks Inc.
2015 ATS Integration • Tez is fully integrated with YARN ATS (Application Timeline Service) –DAG Status, DAG Metrics, Task Status, Task Metrics are captured • Diagnostics & Performance analysis –Data Source for monitoring & diagnostics –Data Source for performance analysis
15.
© Hortonworks Inc.
2015 Recovery • AM can crash in corner cases –OOM –Node failure –… • Continue from the last checkpoint • Transparent to end users AM Crash
16.
© Hortonworks Inc.
2015 Order By of Pig f = Load ‘foo’ as (x, y); o = Order f by x;Load Sample (Calculate Histogram) HDFS Partition Sort Broadcast Load Sample (Calculate Histogram) Partition Sort One-to-One Scatter Gather Scatter Gather
17.
© Hortonworks Inc.
2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
18.
© Hortonworks Inc.
2015 • Performance –Speculation –Intermediate File Improvements –Better use of JVM Memory –Shuffle Improvements • Debuggability –Tez UI –Local mode –Job Analysis Tools –Shuffle Performance Analysis Tool
19.
© Hortonworks Inc.
2015 Speculation • Good for clusters having good/slow nodes or heterogeneous hardware. • Maintains periodic runtime statistics of tasks • Triggers speculative attempt when estimated runtime > mean runtime
20.
© Hortonworks Inc.
2015 Intermediate File Format Improvements • Used for storing intermediate data in Tez • Drawbacks of earlier format –Needs larger buffer in memory (due to duplicate keys) –Bigger file size in disk –Not ideal for all use cases • New Intermediate File Format –Works based on (K, List<V>) –Provides 57% memory efficiency and 23% improvement in disk storage Task Spill 1 Spill 2 Spill 3 Merged Spill ……………………… New IFile Format Key Len K1Value Len V1 Value Len V2 V_ENDRLE Value Len V3 … Key Len K2Value Len V1 Value Len V5 V_ENDRLE Value Len V6 … Old IFile Format Key Len Value Len K1 V1 Key Len Value Len K1 V2 Key Len Value Len K1 V3 Key Len Value Len K2 V1 ……………………… Key Len Value Len K2 V5 Key Len Value Len K2 V6
21.
© Hortonworks Inc.
2015 Better use of JVM Memory • BytesWritable Improvements –Provides FastByteSerialization –Saves 8 bytes per key-value pair –Reduces IFile size by 25% –Reduces SERDE costs • PipelinedSorter can support > 2 GB sort buffers –Containers with higher RAM no longer limited by 2 GB sort buffer limits –Avoids unnecessary spills in large jobs • Reduced key comparison costs in PipelinedSorter Key Valu e Key Size Bytes Val Size Bytes Key Size BytesSize Val Size BytesSize Serialize to memory Serialize to memory Serialize to disk Serialize to disk
22.
© Hortonworks Inc.
2015 Better use of JVM Memory - Contd • Enabled RLE in reducer codepath –Reduced key comparisons in merge codepath –Improved Job Runtime (observed 10% improvement) –Reduced CPU cost Without Fix 691 seconds With Fix 621 seconds
23.
© Hortonworks Inc.
2015 Better use of JVM Memory - Contd • WeightedMemoryDistributor for better memory management in tasks –Observed 26% runtime improvement in tasks
24.
© Hortonworks Inc.
2015 Source Task …. …. Broadcast Shuffle Improvements Task 1 Task 2 Task N … Task 1 Task 2 Task N … Task 1 Task 2 Task N … Broadcast From local disk From local disk Source Task …. …. Task 1 Task 2 Task N … Task 1 Task 2 Task N … Task 1 Task 2 Task N … Broadcast Before Fix After Fix
25.
© Hortonworks Inc.
2015 PipelinedShuffle Improvments • Final merge in source task is avoided. – Less IO • Consumers are informed about spill events in advance – Better usage of network bandwidth – Overlap CPU with network – For sorted/unsorted outputs, send data to consumers in chunks • Observed 20% runtime improvement in queries involving heavy skews Task 1 Spill 1 Task 2 Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N ….. ….. ….. ….. Spill 1 Spill 2 Spill 3 Task 1 Spill 1 Task 2 Spill 1 Spill 2 Spill 3 Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N ….. ….. ….. ….. Merged Spill Normal Shuffle Path Pipelined Shuffle Path
26.
© Hortonworks Inc.
2015 PipelinedShuffle Improvements Job Runtime : 925 seconds Job Runtime : 680 seconds - 26% improvement - Avoids final merge (less IO, CPU cost) - Downstream can consume data whenever a spill is generated
27.
© Hortonworks Inc.
2015 • Performance –Speculation –Better use of JVM Memory –Intermediate File Improvements –Shuffle Improvements • Debuggability –Tez UI –Local mode –Job Analysis Tools –Shuffle Performance Analysis Tool
28.
© Hortonworks Inc.
2015 Tez UI
29.
© Hortonworks Inc.
2015 Tez UI
30.
Tez UI 30 Download data
from ATS
31.
© Hortonworks Inc.
2015 Better Debuggability– Local Mode • Test Tez Jobs without Hadoop Cluster • Enables Fast Prototyping • Fast Unit Testing • Runs on Single JVM (easy for debugging) • Scheduling / RPC invocations Skipped
32.
© Hortonworks Inc.
2015 Job Analysis Tools • DAG Swimlane –“$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>” Prewarm Container Reuse Remote Reads
33.
© Hortonworks Inc.
2015 Shuffle Performance Analysis Tools • Analyze Tez logs in Hadoop • Analyze shuffle performance between source / destination nodes Data transferred from node 7 to rest of the nodes are slow
34.
© Hortonworks Inc.
2015 Shuffle Performance Analysis Tools • Analyze shuffle performance between source / destination nodes
35.
© Hortonworks Inc.
2015 RoadMap • Shared output edges –Same output to multiple vertices • Local mode stabilization • Optimizing (include/exclude) vertex at runtime • Partial completion VertexManager • Co-Scheduling • Framework stats for better runtime decisions
36.
© Hortonworks Inc.
2015 Tez – Adoption • Apache Hive • Start from Hive 0.13 • set hive.exec.engine = tez • Apache Pig • Start from Pig 0.14 • pig -x tez • Cascading • Flink Page 36
37.
© Hortonworks Inc.
2015 Tez Community • Useful Links –http://tez.apache.org/ –JIRA : https://issues.apache.org/jira/browse/TEZ –Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git –Mailing Lists – Dev List: dev@tez.apache.org – User List: user@tez.apache.org – Issues List: issues@tez.apache.org • Tez Meetup –http://www.meetup.com/Apache-Tez-User-Group
38.
© Hortonworks Inc.
2015 Thank You! Questions & Answers Page 38
Editor's Notes
application_1428021179455_0281 vs application_1428021179455_0282 691 vs 626 seconds
application_1428021179455_0240 680 seconds application_1428021179455_0257 925 seconds
Hive has written it’s own processor
Download now