SlideShare a Scribd company logo
1 of 43
Download to read offline
Fast Analytics with
Apache Kudu
(incubating)
Ryan Bosshart//Systems Engineer
bosshart@cloudera.com
2
Business
Give me
realtime!!!
Hadoop Architect
3
How would we build an IOT Analytics System Today?
Click to enter confidentiality information
Kafka /
Pub-sub HDFS	
Analyst
App
Servers
Sensor
Sensor
Sensor
Spark
Streaming
4
What Makes This Hard?
Click to enter confidentiality information
Analyst
Duplicate Events
Late-Arriving Data
Data Center Replication
Partitioning
Random-reads
Compactions
Updates
Small Files
Sensor
Sensor
Sensor
Kafka /
Pub-sub HDFS	
App
Servers
Spark
Streaming
5
Real-Time	Analy1cs	in	Hadoop	Today	
Real1me	Analy1cs	in	the	Real	World	=	Storage	Complexity	
Considera*ons:	
●  How	do	I	handle	failure	
during	this	process?	
	
●  How	oEen	do	I	reorganize	
data	streaming	in	into	a	
format	appropriate	for	
repor1ng?	
	
●  When	repor1ng,	how	do	I	see	
data	that	has	not	yet	been	
reorganized?	
	
●  How	do	I	ensure	that	
important	jobs	aren’t	
interrupted	by	maintenance?	
New	Par11on	
Most	Recent	Par11on	
Historic	Data	
HBase	
Parquet	
File	
Have	we	
accumulated	
enough	data?	
Reorganize	
HBase	file	
into	Parquet	
•  Wait	for	running	opera1ons	to	complete		
•  Define	new	Impala	par11on	referencing	
the	newly	wriRen	Parquet	file	
Incoming	Data	
(Messaging	
System)	
Repor1ng	
Request	
Impala	on	HDFS
6
Previous storage landscape of the Hadoop ecosystem
HDFS (GFS) excels at:
•  Batch ingest only (eg hourly)
•  Efficiently scanning large amounts of
data (analytics)
HBase (BigTable) excels at:
•  Efficiently finding and writing
individual rows
•  Making data mutable
Gaps exist when these properties are
needed simultaneously
7
•  High throughput for big scans
Goal: Within 2x of Parquet
•  Low-latency for short accesses
Goal: 1ms read/write on SSD
•  Database-like semantics
(initially single-row ACID)
•  Relational data model
–  SQL queries are easy
–  “NoSQL” style scan/insert/update (Java/C++ client)
Kudu design goals
8
Kudu for Fast Analytics
Why Now
9
Major Changes in Storage Landscape
All	spinning	disks	
Limited	RAM	
	
	
SSD/NAND	cost	effec1ve	
RAM	much	cheaper	
Intel	3Dxpoint	
256GB,	512GB	RAM	common	
.		
	
The	next	boRleneck	is	CPU	
[2007ish] [2013ish] [2017+]
50
50000
10000000
1 1000 1000000
3D Xpoint
SSD
Spinning Disk
Seek Time (in nanoseconds)
3D Xpoint
SSD
Spinning Disk
10
IOT, Real-time, and Reporting Use-Cases
There are more use cases requiring a simultaneous combination of
sequential and random reads and writes
•  Machine data analytics
–  Example: IOT, Connected Cars, Network threat detection
–  Workload: Inserts, scans, lookups
•  Time series
–  Examples: Streaming market data, fraud detection / prevention, risk monitoring
–  Workload: Insert, updates, scans, lookups
•  Online reporting
–  Example: Operational data store (ODS)
–  Workload: Inserts, updates, scans, lookups
11
IOT Use-Cases
•  Analytical
–  R&D wants to know part performance
over time.
–  Train predictive models on machine or
part failure.
•  Real-time
–  Machine Service – e.g. grab an up-to-date
“diagnosis bundle” before or during
service.
–  Rolled out a software update – need to
find out performance ASAP!
12
IOT Use-Cases
•  Analytical
–  R&D wants to know optimal part
performance over time.
–  Train predictive models on machine or
part failure.
•  Real-time
–  Machine Service – e.g. grab an up-to-date
“diagnosis bundle” before or during
service.
–  Rolled out a software update – need to
find out performance ASAP!
fast, efficient scans
= HDFS
fast inserts/lookups
= HBase
13
Hybrid	big	data	analy1cs	pipeline	
Before	Kudu	
Connected
Cars
Kafka /
Pub-sub
Events
HBase
Operational
Consumer
HDFS (Storage)
Random	Reads	
Analyst
Analy1cs	
Snapshot	
&	Convert	to	
Parquet	
Compact	late	
arriving	data
14
Kudu-Based	Analy1cs	Pipeline	
Robots Kafka /
Pub-sub
Events
Kudu
ConsumerRandom	Reads	
Analyst
Analy1cs	
Kudu supports simultaneous combination of
sequential and random reads and writes
15
How it worksReplication and fault tolerance
16
Kudu Basic Design
•  Basic Construct: Tables
–  Tables broken down into Tablets (roughly equivalent to regions or partitions)
•  Typed storage
•  Maintains consistency via:
–  Multi-Version Concurrency Control (MVCC)
–  Raft Consensus1 to replicate operations
•  Architecture supports geographically disparate, active/active systems
–  Not in the initial implementation
1http://thesecretlivesofdata.com/raft/
17
Client
Meta Cache
18
Client
Hey Master! Where is the row for
‘ryan@cloudera.com’ in table “T”?Meta Cache
19
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might care
about: T1, T2, T3, …
Meta Cache
20
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might care
about: T1, T2, T3, …
Meta Cache
T1: …
T2: …
T3: …
21
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might care
about: T1, T2, T3, …
UPDATE
ryan@cloudera.com SET
…
Meta Cache
T1: …
T2: …
T3: …
22
Metadata
•  Replicated master
–  Acts as a tablet directory
–  Acts as a catalog (which tables exist, etc)
–  Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)
•  Caches all metadata in RAM for high performance
•  Client configured with master addresses
–  Asks master for tablet locations as needed and caches them
23
Fault tolerance
•  Operations replicated using Raft consensus
–  Strict quorum algorithm. See Raft paper for details
•  Transient failures:
–  Follower failure: Leader can still achieve majority
–  Leader failure: automatic leader election (~5 seconds)
–  Restart dead TS within 5 min and it will rejoin transparently
•  Permanent failures
–  After 5 minutes, automatically creates a new follower replica and copies data
•  N replicas can tolerate maximum of (N-1)/2 failures
24
What Kudu is *NOT*
•  Not a SQL interface itself
– It’s just the storage layer
•  Not an application that runs on HDFS
– It’s an alternative, native Hadoop storage engine
•  Not a replacement for HDFS or HBase
– Select the right storage for the right use case
– Cloudera will continue to support and invest in all three
25
Kudu Trade-Offs (vs Hbase)
•  Random updates will be slower
– HBase model allows random updates without incurring a disk seek
– Kudu requires a key lookup before update, Bloom lookup before insert
•  Single-row reads may be slower
– Columnar design is optimized for scans
– Future: may introduce “column groups” for applications where single-row
access is more important
26
How it works
Replication and fault tolerance
27
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
28
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
29
Columnar compression
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
Created_at Diff(created_at)
1442865158 n/a
1442828307 -36851
1442865156 36849
1442865155 -1
64 bits each 17 bits each
•  Many columns can compress to
a few bits per row!
•  Especially:
–  Timestamps
–  Time series values
–  Low-cardinality strings
•  Massive space savings and
throughput increase!
30
Handling inserts and updates
•  Inserts go to an in-memory row store (MemRowSet)
–  Durable due to write-ahead logging
–  Later flush to columnar format on disk
•  Updates go to in-memory “delta store”
–  Later flush to “delta files” on disk
–  Eventually “compact” into the previously-written columnar data files
•  Details elided here due to time constraints
–  Read the Kudu whitepaper at http://getkudu.io/kudu.pdf to learn more!
31
Integrations
32
Spark Integration (WIP, available in 0.9)
val df = sqlContext.read.options(kuduOptions)
.format("org.kududb.spark.kudu").load
val changedDF = df.limit(1)
.withColumn("key", df("key”).plus(100))
.withColumn("c2_s", lit("abc"))
changedDF.write.options(kuduOptions)
.mode("append")
.format("org.kududb.spark.kudu").save
33
Impala integration
•  CREATE	TABLE	…	DISTRIBUTE	BY	HASH(vehicle_id)	INTO	16	
BUCKETS	AS	SELECT	…	FROM	…	
•  INSERT/UPDATE/DELETE	
	
•  Optimizations like predicate pushdown, scan parallelism, plans for
more on the way
34
MapReduce integration
•  Most Kudu integration/correctness testing via MapReduce
•  Multi-framework cluster (MR + HDFS + Kudu on the same disks)
•  KuduTableInputFormat / KuduTableOutputFormat
– Support for pushing down predicates, column projections, etc.
35
Performance
36
TPC-H (analytics benchmark)
•  75 server cluster
–  12 (spinning) disks each, enough RAM to fit dataset
–  TPC-H Scale Factor 100 (100GB)
•  Example query:
–  SELECT	n_name,	sum(l_extendedprice	*	(1	-	l_discount))	as	revenue	FROM	customer,	orders,	
lineitem,	supplier,	nation,	region	WHERE	c_custkey	=	o_custkey	AND	l_orderkey	=	
o_orderkey	AND	l_suppkey	=	s_suppkey	AND	c_nationkey	=	s_nationkey	AND	s_nationkey	=	
n_nationkey	AND	n_regionkey	=	r_regionkey	AND	r_name	=	'ASIA'	AND	o_orderdate	>=	date	
'1994-01-01'	AND	o_orderdate	<	'1995-01-01’	GROUP	BY	n_name	ORDER	BY	revenue	desc;
37
38
Versus other NoSQL storage
•  Apache Phoenix: OLTP SQL engine built on HBase
•  10 node cluster (9 worker, 1 master)
•  TPC-H LINEITEM table only (6B rows)
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet
39
What about NoSQL-style random access? (YCSB)
•  YCSB 0.5.0-snapshot
•  10 node cluster
(9 worker, 1 master)
•  100M row data set
•  10M operations each
workload
40
Getting started with
Kudu
41
Getting started as a user
•  http://getkudu.io
•  kudu-user@googlegroups.com
•  http://getkudu-slack.herokuapp.com/
•  Quickstart VM
–  Easiest way to get started
–  Impala and Kudu in an easy-to-install VM
•  CSD and Parcels
–  For installation on a Cloudera Manager-managed cluster
42
Questions?
http://getkudu.io
bosshart@cloudera.com
43
BETA	SAFE	HARBOR	WARNING	
•  Kudu	is	BETA	(DO	NOT	PUT	IT	IN	PRODUCTION)		
•  Please	play	with	it,	and	let	us	know	your	feedback	
•  Please	consider	this	when	building	out	architectures	for	
second	half	of	2016	
•  Why?	
•  Storage	is	important	and	needs	to	be	stable		
•  (That	said:	we	have	not	experienced	data	loss.	
Kudu	is	reasonably	stable,	almost	no	crashes	
reported)	
•  S1ll	requires	some	expert	assistance,	and	you’ll	
probably	find	some	bugs

More Related Content

What's hot

A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 

What's hot (20)

Apache kudu
Apache kuduApache kudu
Apache kudu
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 

Viewers also liked

Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 

Viewers also liked (11)

Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Similar to Kudu - Fast Analytics on Fast Data

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
3450 - Writing and optimising applications for performance in a hybrid messag...
3450 - Writing and optimising applications for performance in a hybrid messag...3450 - Writing and optimising applications for performance in a hybrid messag...
3450 - Writing and optimising applications for performance in a hybrid messag...Timothy McCormick
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs FasterBob Ward
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike, Inc.
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 

Similar to Kudu - Fast Analytics on Fast Data (20)

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Fast Analytics
Fast Analytics Fast Analytics
Fast Analytics
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
3450 - Writing and optimising applications for performance in a hybrid messag...
3450 - Writing and optimising applications for performance in a hybrid messag...3450 - Writing and optimising applications for performance in a hybrid messag...
3450 - Writing and optimising applications for performance in a hybrid messag...
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Kudu - Fast Analytics on Fast Data