SlideShare a Scribd company logo
1 of 40
How to use Impala &
Kudu to optimize
performance for
Analytic Workloads
David Alves| david.alves@cloudera.com
‹#›‹#›
Impala: A Modern, Open-Source SQL Engine
• Implementation of an MPP SQL query engine for the Hadoop environment
• Designed for performance: brand-new engine, written in C++
• Maintains Hadoop flexibility by utilizing standard Hadoop components
(HDFS, Kudu, HBase, MetaStore, Yarn)
• Plays well with traditional BI tools:
exposes/interacts with industry-standard interfaces (odbc/jdbc, Kerberos and
LDAP, ANSI SQL)
‹#›‹#›
Kudu
Storage for Fast Analytics on Fast Data
• New updatable column store for
Hadoop
• Currently incubating as an
Apache project
• Beta now available
(kudu.apache.org)Columnar Store
Kudu
‹#›‹#›
Impala - Introduction
4
‹#›‹#›
Impala Architecture: Distributed System
• Daemon process (impalad) runs on every node with data
• Each node handles user requests
• Load balancer configuration for multi-user environments recommended
• Metadata management: catalog service (single node)
• System state repository and distribution: statestore (single node)
• Catalog service and statestore are stateless
‹#›‹#›
Impala Query Execution
• Query execution phases:
• client requests arrive via odbc/jdbc
• query planner turns request into collection of plan fragments
• coordinator initiates execution on remote impala’s
• During execution:
• intermediate results are streamed between query executors
• query results are streamed back to client
• subject to limitation imposed by blocking operators (top-n, aggregation,
sorting)
‹#›‹#›
Impala Query Execution
Request arrives via odbc/jdbc
‹#›‹#›
Impala Query Execution
Planner turns request into collection of plan fragments
Coordinator initiates execution on remote impalad nodes
‹#›‹#›
Impala Query Execution
Intermediate results are streamed between impalad’s
Query results are streamed back to client
‹#›‹#›
Query Planning: Overview
• Two-phase planning process
• single-node plan: tree of plan operators
• partitioning of operator tree into plan fragments for parallel execution
• Parallelization of operators across nodes
• all query operators are fully distributed
• Cost-based join order optimization
• Cost-based join distribution optimization
‹#›‹#›
Impala Execution Engine
• Written in C++ for minimal cycle and memory overhead
• Leverages existing parallel DB research
• data-partitioned parallelism
• pipelined relational operators
• batch-at-a-time runtime
• Focussed on speed and efficiency
• intrinsics/machine code for text parsing, hashing, etc.
• runtime code generation with llvm
‹#›‹#›
Kudu - Introduction
12
‹#›‹#›
• High throughput for big scans (columnar
storage and replication)
Goal: Within 2x of Parquet
• Low-latency for short accesses (primary
key indexes and quorum replication)
Goal: 1ms read/write on SSD
• Database-like semantics (initially single-
row ACID)
• Relational data model
• SQL query
• “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals
‹#›‹#›
Kudu Design Goals
how effectively primary key filters can be pushed down to Kudu. 
What do I use Kudu for? 
We talked about how Kudu is made for SQL, allows fast scans, and allows fast mutability at 
scale.  With that in context, let’s look at the variety of use cases done in Hadoop today and see 
where Kudu fits in. 
 
 
If we look at Kudu in the above figure, we will see that many of the traditional SQL use cases 
‹#›‹#›
Kudu Usage
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING,
BINARY, TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with Impala, Spark, MapReduce
• more to come!
15
‹#›‹#›
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination
of sequential and random reads and writes
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk
monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
‹#›‹#›
Real-Time	Analytics	in	Hadoop	Today
Fraud	Detection	in	the	Real	World	=	Storage	Complexity
Considerations:
● How	do	I	handle	failure	
during	 this	process?
● How	often	do	I	reorganize	
data	streaming	in	into	a	
format	appropriate	for	
reporting?
● When	reporting,	 how	do	I	see	
data	that	has	not	yet	been	
reorganized?
● How	do	I	ensure	that	
important	jobs	aren’t	
interrupted	 by	maintenance?
New	Partition
Most	Recent	Partition
Historic	Data
HBase
Parquet	
File
Have	we	
accumulated	
enough	data?
Reorganize	
HBase	file	
into	Parquet
• Wait	for	running	operations	to	complete	
• Define	new	Impala	partition	referencing	
the	newly	written	Parquet	file
Incoming	Data	
(Messaging	
System)
Reporting	
Request
Impala	on	HDFS
‹#›‹#›
Real-Time	Analytics	in	Hadoop	with	Kudu
Improvements:
● One	system to	operate
● No	cron	jobs	or	background	
processes
● Handle	late	arrivals	or	data	
corrections	with	ease
● New	data	available	
immediately	for	analytics	or	
operations	
Historical	and	Real-time
Data
Incoming	Data	
(Messaging	
System)
Reporting	
Request
Storage	in	Kudu
‹#›‹#›
Tables and Tablets
• Table is horizontally partitioned into tablets
• Range or hash partitioning
• Each tablet has N replicas (3 or 5), with Raft consensus
• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets
• Store data on local disks (no HDFS)
19
‹#›‹#›
‹#›‹#›
Tablet design
• Inserts buffered in an in-memory store (like HBase’s memstore)
• Flushed to disk
• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)
• Allow “SELECT AS OF <timestamp>” queries and consistent cross-
tablet scans
• Near-optimal read path for “current time” scans
• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
21
‹#›‹#›
Impala & Kudu –
Better together
22
‹#›‹#›
Impala + Kudu Architecture
Impalad
Kudu
Tablet
Server
Impal
ad
…
Statestore
Catalog
Service
Hive
Metastore
Hadoop
Namenode
Impalad
Kudu
Tablet
Server
Impalad
Kudu
Tablet
Server
Kudu
Master
‹#›‹#›
Impala & Kudu integration – User features
• Table Create/Delete
• Advanced partitioning schemes
• Easily load/store data to/from kudu:
– “Create table kudu_table as select * from hdfs_table”
– ”Insert into hdfs_table select * from kudu_table AS PARQUET”
‹#›‹#›
Impala & Kudu integration – User features
• Table Create/Delete
• Advanced partitioning schemes
• Easily load/store data to/from kudu:
– “Create table kudu_table as select * from hdfs_table”
– ”Insert into hdfs_table select * from kudu_table AS PARQUET”
‹#›‹#›
Impala & Kudu integration - Partitioning
• Range partitioning
– PRIMARY KEY (host, metric, timestamp) DISTRIBUTE
BY HASH(timestamp) INTO 100 BUCKETS
• Hash partitioning
– PRIMARY KEY (last_name, first_name)DISTRIBUTE BY
RANGE (last_name, first_name)
• Range + Hash partitioning
– PRIMARY KEY (last_name, first_name)DISTRIBUTE BY
RANGE (last_name, first_name) INTO 100 BUCKETS
‹#›‹#›
Impala & Kudu integration – Runtime Features
• Optimized data layout
• Predicate pushdown
• Data locality
• Tolerance to Kudu faults
‹#›‹#›
Impala & Kudu integration – Roadmap
• Shared memory between impala and Kudu
• Scan Tokens – Forward encoded partitioning information to the Kudu client
• Timetravel scans
• Memory layout matching
• More predicate pushdown (like bloomfilters).
‹#›‹#›
Benchmarks
29
‹#›‹#›
TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
30
© Cloudera, Inc. All rights reserved.
- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
‹#›‹#›
What about Apache Phoenix?
• 10 node cluster (9 worker, 1 master)
• HBase 1.0, Phoenix 4.3
• TPC-H LINEITEM table only (6B rows)
32
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet
‹#›‹#›
What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot
• 10 node cluster
(9 worker, 1 master)
• HBase 1.0
• 100M rows, 10M ops
33
© Cloudera, Inc. All rights reserved.
Kudu & Impala @ Xiaomi
Mobile service monitoring and tracing tool
Requirements
u High write throughput
>5 Billion records/day and growing
u Query latest data and quick response
Identify and resolve issues quickly
u Can search for individual records
Easy for troubleshooting
Gather important RPC tracing events from mobile
app and backend service.
Service monitoring & troubleshooting tool.
© Cloudera, Inc. All rights reserved.
Benchmark
Environment
u 71 Node cluster
u Hardware:
CPU: E5-2620 2.1GHz * 24 core Memory: 64GB
Network: 1Gb Disk: 12 HDD
u Software: Hadoop2.6/Impala 2.1/Kudu
Data
u 1 D of tracingdata:
~2.6 B rows, ~270
bytes/row
17 columns, 5 key columns
Workload
u Mix of analytical more lookup style queries
u Compared vs Parquet
© Cloudera, Inc. All rights reserved.
Benchmark Results
1.4 2.0 2.3
3.1
1.3 0.91.3
2.8
4.0
5.7
7.5
16.7
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M record/s 39.5k record/s
Parquet 114.6 23.5M record/s 331k records/s
Bulk load using impala (INSERT INTO):
Query latency:
* HDFS parquet file replication = 3 , kudu table replication = 3
* Each query run 5 times then take average
‹#›‹#›
http://kudu.apache.org/
@ApacheKudu
http://impala.io/
@RideImpala
Thank you
David Alves
@dribeiroalves
‹#›‹#›
This an example
segue slide on a blue
background. This could
also be a quote slide.
This is an optional subtitle or space for attribution
‹#›‹#›
This an example
segue slide on a blue
background. This could
also be a quote slide.
This is an optional subtitle or space for attribution
‹#›‹#›
This an example
segue slide on a blue
background. This could
also be a quote slide.
This is an optional subtitle or space for attribution

More Related Content

What's hot

Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 

What's hot (20)

Apache kudu
Apache kuduApache kudu
Apache kudu
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 

Viewers also liked

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott JarData Con LA
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Data Con LA
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
A Glimpse of Second Life
A Glimpse of Second LifeA Glimpse of Second Life
A Glimpse of Second LifeKalki Avator
 
Certificate_1
Certificate_1Certificate_1
Certificate_1Nyathina
 
Ddd physics content for 118
Ddd physics content for 118Ddd physics content for 118
Ddd physics content for 118estreeter
 

Viewers also liked (20)

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott Jar
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
A Glimpse of Second Life
A Glimpse of Second LifeA Glimpse of Second Life
A Glimpse of Second Life
 
Portada libro id 2
Portada libro id 2Portada libro id 2
Portada libro id 2
 
Certificate_1
Certificate_1Certificate_1
Certificate_1
 
Uft bioética y ética ambiental.
Uft bioética y ética ambiental.Uft bioética y ética ambiental.
Uft bioética y ética ambiental.
 
Ddd physics content for 118
Ddd physics content for 118Ddd physics content for 118
Ddd physics content for 118
 

Similar to Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize Performance for Analytic Workloads, David Alves - Software Engineer - Cloudera

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The SnailMarcus Deglos
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overviewKognitio
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 

Similar to Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize Performance for Analytic Workloads, David Alves - Software Engineer - Cloudera (20)

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize Performance for Analytic Workloads, David Alves - Software Engineer - Cloudera

  • 1. How to use Impala & Kudu to optimize performance for Analytic Workloads David Alves| david.alves@cloudera.com
  • 2. ‹#›‹#› Impala: A Modern, Open-Source SQL Engine • Implementation of an MPP SQL query engine for the Hadoop environment • Designed for performance: brand-new engine, written in C++ • Maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Kudu, HBase, MetaStore, Yarn) • Plays well with traditional BI tools: exposes/interacts with industry-standard interfaces (odbc/jdbc, Kerberos and LDAP, ANSI SQL)
  • 3. ‹#›‹#› Kudu Storage for Fast Analytics on Fast Data • New updatable column store for Hadoop • Currently incubating as an Apache project • Beta now available (kudu.apache.org)Columnar Store Kudu
  • 5. ‹#›‹#› Impala Architecture: Distributed System • Daemon process (impalad) runs on every node with data • Each node handles user requests • Load balancer configuration for multi-user environments recommended • Metadata management: catalog service (single node) • System state repository and distribution: statestore (single node) • Catalog service and statestore are stateless
  • 6. ‹#›‹#› Impala Query Execution • Query execution phases: • client requests arrive via odbc/jdbc • query planner turns request into collection of plan fragments • coordinator initiates execution on remote impala’s • During execution: • intermediate results are streamed between query executors • query results are streamed back to client • subject to limitation imposed by blocking operators (top-n, aggregation, sorting)
  • 8. ‹#›‹#› Impala Query Execution Planner turns request into collection of plan fragments Coordinator initiates execution on remote impalad nodes
  • 9. ‹#›‹#› Impala Query Execution Intermediate results are streamed between impalad’s Query results are streamed back to client
  • 10. ‹#›‹#› Query Planning: Overview • Two-phase planning process • single-node plan: tree of plan operators • partitioning of operator tree into plan fragments for parallel execution • Parallelization of operators across nodes • all query operators are fully distributed • Cost-based join order optimization • Cost-based join distribution optimization
  • 11. ‹#›‹#› Impala Execution Engine • Written in C++ for minimal cycle and memory overhead • Leverages existing parallel DB research • data-partitioned parallelism • pipelined relational operators • batch-at-a-time runtime • Focussed on speed and efficiency • intrinsics/machine code for text parsing, hashing, etc. • runtime code generation with llvm
  • 13. ‹#›‹#› • High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum replication) Goal: 1ms read/write on SSD • Database-like semantics (initially single- row ACID) • Relational data model • SQL query • “NoSQL” style scan/insert/update (Java client) Kudu Design Goals
  • 15. ‹#›‹#› Kudu Usage • Table has a SQL-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan() • Integrations with Impala, Spark, MapReduce • more to come! 15
  • 16. ‹#›‹#› Kudu Use Cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time Series ○ Examples: Stream market data; fraud detection & prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine Data Analytics ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online Reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  • 17. ‹#›‹#› Real-Time Analytics in Hadoop Today Fraud Detection in the Real World = Storage Complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request Impala on HDFS
  • 18. ‹#›‹#› Real-Time Analytics in Hadoop with Kudu Improvements: ● One system to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Incoming Data (Messaging System) Reporting Request Storage in Kudu
  • 19. ‹#›‹#› Tables and Tablets • Table is horizontally partitioned into tablets • Range or hash partitioning • Each tablet has N replicas (3 or 5), with Raft consensus • Allow read from any replica, plus leader-driven writes with low MTTR • Tablet servers host tablets • Store data on local disks (no HDFS) 19
  • 21. ‹#›‹#› Tablet design • Inserts buffered in an in-memory store (like HBase’s memstore) • Flushed to disk • Columnar layout, similar to Apache Parquet • Updates use MVCC (updates tagged with timestamp, not in-place) • Allow “SELECT AS OF <timestamp>” queries and consistent cross- tablet scans • Near-optimal read path for “current time” scans • No per row branches, fast vectorized decoding and predicate evaluation • Performance worsens based on number of recent updates 21
  • 22. ‹#›‹#› Impala & Kudu – Better together 22
  • 23. ‹#›‹#› Impala + Kudu Architecture Impalad Kudu Tablet Server Impal ad … Statestore Catalog Service Hive Metastore Hadoop Namenode Impalad Kudu Tablet Server Impalad Kudu Tablet Server Kudu Master
  • 24. ‹#›‹#› Impala & Kudu integration – User features • Table Create/Delete • Advanced partitioning schemes • Easily load/store data to/from kudu: – “Create table kudu_table as select * from hdfs_table” – ”Insert into hdfs_table select * from kudu_table AS PARQUET”
  • 25. ‹#›‹#› Impala & Kudu integration – User features • Table Create/Delete • Advanced partitioning schemes • Easily load/store data to/from kudu: – “Create table kudu_table as select * from hdfs_table” – ”Insert into hdfs_table select * from kudu_table AS PARQUET”
  • 26. ‹#›‹#› Impala & Kudu integration - Partitioning • Range partitioning – PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Hash partitioning – PRIMARY KEY (last_name, first_name)DISTRIBUTE BY RANGE (last_name, first_name) • Range + Hash partitioning – PRIMARY KEY (last_name, first_name)DISTRIBUTE BY RANGE (last_name, first_name) INTO 100 BUCKETS
  • 27. ‹#›‹#› Impala & Kudu integration – Runtime Features • Optimized data layout • Predicate pushdown • Data locality • Tolerance to Kudu faults
  • 28. ‹#›‹#› Impala & Kudu integration – Roadmap • Shared memory between impala and Kudu • Scan Tokens – Forward encoded partitioning information to the Kudu client • Timetravel scans • Memory layout matching • More predicate pushdown (like bloomfilters).
  • 30. ‹#›‹#› TPC-H (Analytics benchmark) • 75TS + 1 master cluster • 12 (spinning) disk each, enough RAM to fit dataset • Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4 • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 30
  • 31. © Cloudera, Inc. All rights reserved. - Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data - Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
  • 32. ‹#›‹#› What about Apache Phoenix? • 10 node cluster (9 worker, 1 master) • HBase 1.0, Phoenix 4.3 • TPC-H LINEITEM table only (6B rows) 32 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH Q1 COUNT(*) COUNT(*) WHERE… single-row lookup Time(sec) Phoenix Kudu Parquet
  • 33. ‹#›‹#› What about NoSQL-style random access? (YCSB) • YCSB 0.5.0-snapshot • 10 node cluster (9 worker, 1 master) • HBase 1.0 • 100M rows, 10M ops 33
  • 34. © Cloudera, Inc. All rights reserved. Kudu & Impala @ Xiaomi Mobile service monitoring and tracing tool Requirements u High write throughput >5 Billion records/day and growing u Query latest data and quick response Identify and resolve issues quickly u Can search for individual records Easy for troubleshooting Gather important RPC tracing events from mobile app and backend service. Service monitoring & troubleshooting tool.
  • 35. © Cloudera, Inc. All rights reserved. Benchmark Environment u 71 Node cluster u Hardware: CPU: E5-2620 2.1GHz * 24 core Memory: 64GB Network: 1Gb Disk: 12 HDD u Software: Hadoop2.6/Impala 2.1/Kudu Data u 1 D of tracingdata: ~2.6 B rows, ~270 bytes/row 17 columns, 5 key columns Workload u Mix of analytical more lookup style queries u Compared vs Parquet
  • 36. © Cloudera, Inc. All rights reserved. Benchmark Results 1.4 2.0 2.3 3.1 1.3 0.91.3 2.8 4.0 5.7 7.5 16.7 Q1 Q2 Q3 Q4 Q5 Q6 kudu parquet Total Time(s) Throughput(Total) Throughput(per node) Kudu 961.1 2.8M record/s 39.5k record/s Parquet 114.6 23.5M record/s 331k records/s Bulk load using impala (INSERT INTO): Query latency: * HDFS parquet file replication = 3 , kudu table replication = 3 * Each query run 5 times then take average
  • 38. ‹#›‹#› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 39. ‹#›‹#› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 40. ‹#›‹#› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution