Submit Search
Upload
Apache Impala (incubating) 2.5 Performance Update
•
Download as PPTX, PDF
•
3 likes
•
1,945 views
Cloudera, Inc.
Follow
Presented at SF Hadoop Users Group meetup on May 3, 2016
Read less
Read more
Software
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 53
Download now
Recommended
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence
Cloudera, Inc.
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
Cloudera, Inc.
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Cloudera, Inc.
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
Cloudera, Inc.
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
Recommended
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence
Cloudera, Inc.
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
Cloudera, Inc.
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Cloudera, Inc.
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
Cloudera, Inc.
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
Kudu Forrester Webinar
Kudu Forrester Webinar
Cloudera, Inc.
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera, Inc.
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
Cloudera, Inc.
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
Cloudera, Inc.
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Cloudera, Inc.
Data Drive Applications_Webinar
Data Drive Applications_Webinar
Sean Spediacci
Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera
Cloudera, Inc.
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Cloudera, Inc.
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
Cloudera, Inc.
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
Cloudera, Inc.
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
Nested Types in Impala
Nested Types in Impala
Cloudera, Inc.
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
More Related Content
What's hot
Kudu Forrester Webinar
Kudu Forrester Webinar
Cloudera, Inc.
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera, Inc.
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
Cloudera, Inc.
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
Cloudera, Inc.
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Cloudera, Inc.
Data Drive Applications_Webinar
Data Drive Applications_Webinar
Sean Spediacci
Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera
Cloudera, Inc.
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Cloudera, Inc.
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
Cloudera, Inc.
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
Cloudera, Inc.
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
Cloudera, Inc.
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
What's hot
(20)
Kudu Forrester Webinar
Kudu Forrester Webinar
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Data Drive Applications_Webinar
Data Drive Applications_Webinar
Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Viewers also liked
Nested Types in Impala
Nested Types in Impala
Cloudera, Inc.
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
How Impala Works
How Impala Works
Yue Chen
Admission Control in Impala
Admission Control in Impala
Cloudera, Inc.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Impala Architecture presentation
Impala Architecture presentation
hadooparchbook
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedIn
Yael Garten
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
Yael Garten
White paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Yue Chen
Impala SQL Support
Impala SQL Support
Yue Chen
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
Yael Garten
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security Systems
Yue Chen
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Cloudera, Inc.
Impala use case @ Zoosk
Impala use case @ Zoosk
Cloudera, Inc.
Architecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
Viewers also liked
(20)
Nested Types in Impala
Nested Types in Impala
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
How Impala Works
How Impala Works
Admission Control in Impala
Admission Control in Impala
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Impala Architecture presentation
Impala Architecture presentation
The Impala Cookbook
The Impala Cookbook
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedIn
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
White paper hadoop performancetuning
White paper hadoop performancetuning
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Impala SQL Support
Impala SQL Support
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security Systems
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Impala use case @ Zoosk
Impala use case @ Zoosk
Architecting next generation big data platform
Architecting next generation big data platform
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Similar to Apache Impala (incubating) 2.5 Performance Update
Recent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future Presto
Kai Sasaki
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
InfluxData
Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7
Morgan Tocker
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
AtScale
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinar
Minnie Seungmin Cho
Apache Druid Design and Future prospect
Apache Druid Design and Future prospect
c-bslim
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE
Everything You Need to Know About Sharding
Everything You Need to Know About Sharding
MongoDB
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Databricks
Unifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Amazon Web Services
What's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0
Safe Software
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
AtScale
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
Embarcadero Technologies
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
Neo4j
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
Similar to Apache Impala (incubating) 2.5 Performance Update
(20)
Recent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future Presto
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinar
Apache Druid Design and Future prospect
Apache Druid Design and Future prospect
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
Everything You Need to Know About Sharding
Everything You Need to Know About Sharding
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Unifying your data management with Hadoop
Unifying your data management with Hadoop
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
What's New in Apache Hive
What's New in Apache Hive
Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
More from Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
More from Cloudera, Inc.
(20)
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Recently uploaded
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
vaddepallysandeep122
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Dinusha Kumarasiri
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
Philip Schwarz
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
VICTOR MAESTRE RAMIREZ
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
Livetecs LLC
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
kzayra69
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
Technogeeks
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
Alina Yurenko
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
qr0udbr0
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Hr365.us smith
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
Hanief Utama
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Ahmed Mohamed
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Mater
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
umasea
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
AnoyGreter
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Marharyta Nedzelska
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Andreas Granig
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
OnePlan Solutions
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
stazi3110
Recently uploaded
(20)
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Apache Impala (incubating) 2.5 Performance Update
1.
1© Cloudera, Inc.
All rights reserved. Apache Impala 2.5 (Incubating) Performance improvements overview
2.
2© Cloudera, Inc.
All rights reserved. Agenda • What is Impala? • Impala at Apache • What is new in Impala 2.5 (CDH 5.7) • Impala performance update • Roadmap • Q&A
3.
3© Cloudera, Inc.
All rights reserved. SQL-on-Hadoop engines SQL Impala SQL-on-Apache Hadoop – Choosing the right tool for the right job
4.
4© Cloudera, Inc.
All rights reserved. • General-purpose SQL engine • Real-time queries in Apache Hadoop • General availability (v1.0) release out since April 2013 • Analytic SQL functionality (v2.0) since October 2014 • Apache incubator project since December 2015 • Previous release 2.3 (CDH 5.5) released November 2015 • Current release 2.5 (CDH 5.7) April 2016 What is Impala? Today’s topic
5.
5© Cloudera, Inc.
All rights reserved. • Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS • General-purpose SQL query engine: • Targeted for analytical workloads • Supports queries that take from milliseconds to hours • Runs directly within Hadoop: • reads widely used Hadoop file formats • talks to widely used Hadoop storage managers • runs on same nodes that run Hadoop processes • Highly available • High performance: • C++ instead of Java • Run time code generation Impala overview
6.
6© Cloudera, Inc.
All rights reserved. Impala Use Cases •Interactive BI/analytics on more data •Asking new questions – exploration, ML (Ibis) •Data processing with tight SLAs •Query-able archive w/full fidelity
7.
7© Cloudera, Inc.
All rights reserved. • Incubator project since December 2015 • Development process slowly moving to ASF infrastructure (see IMPALA-3221) • Help wanted! Where to find the Impala community: dev@impala.incubator.apache.org user@impala.incubator.apache.org http://impala.io @apacheimpala Impala at Apache
8.
8© Cloudera, Inc.
All rights reserved. New in Impala 2.5 Usability Enhancements • Admission Control Improvements • Null-safe join/equals Performance and Scalability • Runtime filters • Improved Cardinality Estimation and Join Ordering • Query start-up improvements • Additional codegen and code optimizations • Decimal arithmetic improvements • Fast min/max values on partition columns(with query option) Integrations •Support for EMC DSSD
9.
9© Cloudera, Inc.
All rights reserved. New in Impala 2.5 Performance and Scalability • Runtime filters • Improved Cardinality Estimation and Join Ordering • Query start-up improvements • Additional codegen and code optimizations • Decimal arithmetic improvements • Incremental metadata updates (DDL) • Fast min/max values on partition columns(with query option) Covered today
10.
10© Cloudera, Inc.
All rights reserved. Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5) • 2.2x speedup for TPC-H • 1.7x speedup for TPC-H (Nested) • 4.3X speedup for TPC-DS
11.
11© Cloudera, Inc.
All rights reserved. Runtime filtering • General idea: some predicates can only be computed at runtime • Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND dt.d_moy = 12; • How does Impala execute this query?
12.
12© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows
13.
13© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Runtime filters: the opportunity ● The planner doesn’t know what the set of ss_sold_date_sk and ss_item_sk contains - even with statistics. ● opportunity to save some work - why bother sending 43 billion of those rows to the joins? ● Runtime filters computes this predicate at runtime.
14.
14© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 1: planner tells Join #1 to produce bloom filter qualifying i_item_sk & Join #2 to produce bloom filter for qualifying d_date_sk
15.
15© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 2: Join reads all rows from build side (right input), and computes filter containing all distinct values of i_item_sk and d_date_sk
16.
16© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 3: Join #1 & #2 sends filter to store_sales scan. Scan eliminates rows that don’t have a match in the bloom filters.
17.
17© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 47 million rows item 198 rows Broadcast Join #1 47 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows store_sales scan uses bloom filter from Join #2 to filter out partitions (ss_sold_date_sk)and bloom filter from Join #1 to filter out rows that don’t qualify (ss_item_sk)
18.
18© Cloudera, Inc.
All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 47 million rows item 198 rows Broadcast Join #1 47 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows 914x reduction in number of rows coming out of scan 43 billion -> 47 million 6x reduction in number of rows coming out of join 290 million -> 47 million
19.
19© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
20.
20© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
21.
21© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
22.
22© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Reduced customer rows by 826X 3.8 million to 4,600 rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle
23.
23© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan
24.
24© Cloudera, Inc.
All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 49 million rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 49 million rows customer 4,600 rows Shuffle Shuffle 877x reduction in rows 43 billion -> 49 million rows set RUNTIME_FILTER_MODE=GLOBAL;
25.
25© Cloudera, Inc.
All rights reserved. Runtime filters: real-world results • Runtime filters can be highly effective. Some benchmark queries are more than 30 times faster in Impala 2.5.0. • As always, depends on your queries, your schemas and your cluster environment. • By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL. • Other runtime filter parameters include : • RUNTIME_BLOOM_FILTER_SIZE: [1048576] • RUNTIME_FILTER_WAIT_TIME_MS: [0]
26.
26© Cloudera, Inc.
All rights reserved. Improved Cardinality Estimates and Join Order 1. More robust scan cardinality estimation • Mitigate correlated predicates (exponential backoff) 2. Improved join cardinality estimation • Special treatment of common case of PK/FK joins • Detect selective joins by applying the selectivity of build-side predicates to the estimated join cardinality • TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5) SELECT * FROM cars WHERE cars.make = 'Toyota' AND cars.model = 'Camry'
27.
27© Cloudera, Inc.
All rights reserved. Query start-up: performance impact
28.
28© Cloudera, Inc.
All rights reserved. LLVM Codegen Support in Impala Operations: • Hash join • Aggregation • Scans: Text, Sequence, Avro • Expressions in all operators • Sort • Top-N Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMALNew in Impala 2.5 Extended in Impala 2.5
29.
29© Cloudera, Inc.
All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
30.
30© Cloudera, Inc.
All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
31.
31© Cloudera, Inc.
All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } 10x more efficient code
32.
32© Cloudera, Inc.
All rights reserved. Float/Double Vs Decimal? Pros for Float/Double • Uses less memory. • Faster because floating point math operations are natively supported by processors. (Note: Decimal uses fixed-point hardware types - int64 and __int128) • Can represent a larger range of numbers. Cons for Float/Double • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation No go for applications requiring high precision & accuracy What about performance penalty?
33.
33© Cloudera, Inc.
All rights reserved. Decimal arithmetic and aggregation SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE FROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus 3x speedup ● Simplified overflow check for decimal. ● Extended Codegen framework to support aggregations involving decimal. ● Bridged the performance gap between double and decimal
34.
34© Cloudera, Inc.
All rights reserved. Network Distributed Aggregations in Impala Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Impala aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer.
35.
35© Cloudera, Inc.
All rights reserved. Network Downsides of Pre-aggregations Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk
36.
36© Cloudera, Inc.
All rights reserved. Network Streaming Pre-aggregations in Impala 2.5 Merge Merge Merge select distinct * from sales; Scan ScanScan • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 40% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk
37.
37© Cloudera, Inc.
All rights reserved. Streaming Pre-aggregations in Impala 2.5 Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE 03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB 00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE 03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING 00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders Baseline finished in 23.13 seconds With stream pre-aggregation enabled finished in 14.9 seconds
38.
38© Cloudera, Inc.
All rights reserved. Optimization for partition keys scan • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization
39.
39© Cloudera, Inc.
All rights reserved. 21x node cluster each with Hardware ● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz ● 12 disk drives at 932GB each (one for the OS, the rest for HDFS) Comparative Set ● Impala 2.5 ○ RUNTIME_FILTER_MODE = 2; ● Spark SQL 1.6 ○ Thrift JDBC server used to avoid startup cost ○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240 Workload ● TPC-DS 15TB stored in Parquet file format (default of 256MB block size) ● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98 ● Caveats: ○ Spark-SQL failed running : ■ Q25 : Bad plan ■ Q47 : StackOverflowError ■ Q89 : StackOverflowError Competitive benchmark : TPC-DS
40.
40© Cloudera, Inc.
All rights reserved. Q25 (Fact to fact joins) SELECT i_item_id,i_item_desc, s_store_id, s_store_name, Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit) AS catalog_sales_profit FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk = sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10 AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001 GROUP BY i_item_id, i_item_desc, s_store_id, s_store_name ORDER BY i_item_id, i_item_desc, s_store_id, s_store_name LIMIT 100; Competitive benchmark Query complexity varied from Q3 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
41.
41© Cloudera, Inc.
All rights reserved. Competitive benchmark
42.
42© Cloudera, Inc.
All rights reserved. Competitive benchmark Impala 2.5 is 11x faster (based on geomean)
43.
43© Cloudera, Inc.
All rights reserved. Performance Benchmark Takeaways • Impala unlocks BI usage directly on Hadoop • Meets BI low-latency and multi-user requirements • Advantage expands for single-user vs just 10 users • Spark SQL enables easier Spark application development • Enables mixed procedural Spark (Java/Scala) and SQL job development • Mid-term trends will further favor Impala’s design approach for latency and concurrency • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets
44.
44© Cloudera, Inc.
All rights reserved. • Available today in Impala 2.5: • All the same Impala functionality, performance, and third-party integrations • Supported across our cloud partners • Deployment via Director • Modular architecture enables cloud’s decoupled storage and elasticity future • Available soon in Impala 2.6: • Impala read/write to S3 in addition to local HDFS IMPALA-1878 • Dynamically sized runtime filters • Parquet scanner optimization • Faster joins, aggregations, sorts and decimal arithmetic • Rack aware scheduling • Faster code generation Impala and Cloud
45.
45© Cloudera, Inc.
All rights reserved. Impala Roadmap 2H 2015 1H 2016 2016 • SQL Support & Usability • Nested structures • Kudu updates (beta) • Management & Security • Record reader service (beta) • Finer-grained security (Sentry) • Integration • Isilon support • Python interface (Ibis) • Performance & Scale • Improved predictability under concurrency • Performance & Scale • Continued scalability and concurrency • Initial perf/scale improvements • Management & Security • Improved admission control • Resource utilization and showback • SQL Support & Usability • Dynamic partitioning • Performance & Scale • >20x performance • Multi-threaded joins/aggregations • Continued scale work • Cloud • S3 read/write support • Management & Security • Improved YARN integration • Automated metadata • SQL Support & Usability • Data type improvements • Added SQL extensions
46.
46© Cloudera, Inc.
All rights reserved. Appendix.
47.
47© Cloudera, Inc.
All rights reserved.
48.
48© Cloudera, Inc.
All rights reserved. • Pre Impala 2.5: • Coordinator starts receiving fragments before senders • Problem: • Serializes startup • Scale and plan complexity ~ slower startup • Impala 2.5: • Coordinator starts fragments in any order • Added wait logic for senders and receivers Query start-up improvements
49.
49© Cloudera, Inc.
All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads. Improvement: Pick impalad at random.
50.
50© Cloudera, Inc.
All rights reserved. New Query Option: random_replica Disabled by default. set random_replica = 1; Also has a corresponding query hint: SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
51.
51© Cloudera, Inc.
All rights reserved. Where It Can Help • Large number of small queries, each with few input tables. • High load on only one of multiple replicas of a table. • Queries are CPU bound. • Benefit: Distribute load more evenly over replicas. • Tradeoff: Distribution of local reads will increase buffer cache usage. What’s Next • Add possibility to prefer remote reads. • Switch remote impalad selection from round-robin to load-based. • Add rack-awareness.
52.
52© Cloudera, Inc.
All rights reserved. Catalog Improvements Incrementally update table metadata instead of force-reloading all table metadata during DDL/DML operations Reload metadata of only ‘dirty’ partitions Reuse descriptors of HDFS files to avoid loading file/block metadata for files that haven’t been modified Significantly reduce the latency of DDL/DML operations that change a small fraction of table metadata (e.g. alter table foo partition (year = 2010) set location ‘blah’)
53.
53© Cloudera, Inc.
All rights reserved. Catalog Improvements - Results
Download now