SlideShare a Scribd company logo
1 of 57
1© Cloudera, Inc. All rights reserved.
Introduction to Kudu:
Hadoop Storage for Fast Analytics
on Fast Data
Jeff Holoman
Sr. Systems Engineer
2© Cloudera, Inc. All rights reserved.
Agenda
What is Kudu? (Motivations & Design Goals)
Use Cases
Overview of Design & Internals
Simple Benchmark
Status & Getting Started
3© Cloudera, Inc. All rights reserved.
What is Kudu?
4© Cloudera, Inc. All rights reserved.
But First….
5© Cloudera, Inc. All rights reserved.
• Efficiently scanning large amounts of
data
• Accumulating data with high throughput
• Multiple SQL Options
• All processing engines
• Single Row Access is problematic
• Mutation is problematic
• “Fast Data” access is problematic
Excels at…
However…
GFS paper published in 2003!
6© Cloudera, Inc. All rights reserved.
• Efficiently finding and writing individual
rows
• Accumulating data with high throughput
• Scans are problematic
• High cardinality access is problematic
• SQL support is so/so due to the above
Excels at…
However…
Big Table Paper published in 2006!
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
In 2006…
DID NOT EXIST!
9© Cloudera, Inc. All rights reserved.
$0
$20
$40
$60
$80
$100
$120
$140
$160
$180
$200
5 10 13 15 15
RAM / GB
RAM / GB
10© Cloudera, Inc. All rights reserved.
11© Cloudera, Inc. All rights reserved.
Today…Changing Hardware Landscape
• Spinning disk -> solid state storage
• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec
write throughput, at a price of less than $3/GB and dropping
• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant
• 64->128->256GB over last few years
Takeaway 1:
The next bottleneck is CPU, and current storage systems weren’t designed with CPU
efficiency in mind.
Takeaway 2:
Column stores are feasible for random access.
12© Cloudera, Inc. All rights reserved.
Current Storage Landscape in Hadoop
Gaps exist when these properties
are needed simultaneously
The Hadoop
Storage “Gap”
13© Cloudera, Inc. All rights reserved.
The Kudu Elevator Pitch
Storage for Fast Analytics on Fast Data
• New updating column store for
Hadoop
• Simplifies the architecture for building
analytic applications on changing data
• Designed for fast analytic performance
• Natively integrated with Hadoop
• Apache-licensed open source (with
pending ASF Incubator proposal)
• Beta now available
FILESYSTEM
HDFS
NoSQL
HBASE
INGEST – SQOOP, FLUME, KAFKA
DATA INTEGRATION & STORAGE
SECURITY – SENTRY
RESOURCE MANAGEMENT – YARN
UNIFIED DATA SERVICES
BATCH STREAM SQL SEARCH MODEL ONLINE
DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS
SPARK,
HIVE, PIG
SPARK IMPALA SOLR SPARK HBASE
RELATIONAL
KUDU
14© Cloudera, Inc. All rights reserved.
• High throughput for big scans
• Low-latency for random accesses
• High CPU performance to better take advantage
of RAM and Flash
• Single-column scan rate 10-100x faster than HBase
• High IO efficiency
• True column store with type-specific encodings
• Efficient analytics when only certain columns are
accessed
• Expressive and evolvable data model
• Architecture that supports multi-data center
operation
Kudu Design Goals
15© Cloudera, Inc. All rights reserved.
16© Cloudera, Inc. All rights reserved.
Using Kudu
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala
• more to come!
16
17© Cloudera, Inc. All rights reserved.
What Kudu is *NOT*
• Not a SQL interface itself
• It’s just the storage layer – “Bring Your Own SQL” (eg Impala or Spark)
• Not an application that runs on HDFS
• It’s an alternative, native Hadoop storage engine
• Colocation with HDFS is expected
• Not a replacement for HDFS or HBase
• Select the right storage for the right use case
• Cloudera will support and invest in all three
18© Cloudera, Inc. All rights reserved.
Use Cases for Kudu
19© Cloudera, Inc. All rights reserved.
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination of
sequential and random reads and writes, e.g.:
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
20© Cloudera, Inc. All rights reserved.
Industry Examples
• Streaming market
data
• Real-time fraud
detection &
prevention
• Risk monitoring
• Real-time offers
• Location-based
targeting
• Geospatial
monitoring
• Risk and threat
detection (real time)
Financial Services Retail Public Sector
21© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop Today
Fraud Detection in the Real World = Storage Complexity
Considerations:
● How do I handle failure
during this process?
● How often do I reorganize
data streaming in into a
format appropriate for
reporting?
● When reporting, how do I see
data that has not yet been
reorganized?
● How do I ensure that
important jobs aren’t
interrupted by maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
Impala on HDFS
22© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop with Kudu
Simpler Architecture, Superior Performance over Hybrid Approaches
Impala on Kudu
Incoming Data
(Messaging
System)
Reporting
Request
23© Cloudera, Inc. All rights reserved.
Design & Internals
24© Cloudera, Inc. All rights reserved.
Kudu Basic Design
• Typed storage
• Basic Construct: Tables
• Tables broken down into Tablets (roughly equivalent to partitions)
• Maintains consistency through a Paxos-like quorum model (Raft)
• Architecture supports geographically disparate, active/active
systems
25© Cloudera, Inc. All rights reserved.
Columnar Data Store
A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3
Row-Based Storage
Columnar Storage
26© Cloudera, Inc. All rights reserved.
Tables and Tablets
• Table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with Raft consensus
• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets
• Store data on local disks (no HDFS)
26
27© Cloudera, Inc. All rights reserved.
Tables and Tablets(2)
• CREATE TABLE customers (
• first_name STRING NOT NULL,
• last_name STRING NOT NULL,
• order_count INT32,
• PRIMARY KEY (last_name, first_name),
• )
• Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows
total) will result in the creation of 26 tablets, with each tablet containing a range
of customer surnames all beginning with a given letter. This is an effective
partition schema for a workload where customers are inserted and updated
uniformly by last name, and scans are typically performed over a range of
surnames.
28© Cloudera, Inc. All rights reserved.
Client
Meta Cache
29© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?Meta Cache
30© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
Meta Cache
31© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
Meta Cache
T1: …
T2: …
T3: …
32© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘todd@cloudera.com’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
UPDATE
todd@cloudera.com
SET …
Meta Cache
T1: …
T2: …
T3: …
33© Cloudera, Inc. All rights reserved.
Tablet Design
• Inserts buffered in an in-memory store (like HBase’s memstore)
• Flushed to disk
• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)
• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans
• Near-optimal read path for “current time” scans
• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
33
34© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master*
• Acts as a tablet directory (“META” table)
• Acts as a catalog (table schemas, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• 80-node load test, GetTableLocations RPC perf:
• 99th percentile: 68us, 99.99th percentile: 657us
• <2% peak CPU usage
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
34
35© Cloudera, Inc. All rights reserved.
Kudu Trade-Offs
• Random updates will be slower
• HBase model allows random updates without incurring a disk seek
• Kudu requires a key lookup before update, Bloom lookup before insert
• Single-row reads may be slower
• Columnar design is optimized for scans
• Future: may introduce “column groups” for applications where single-row
access is more important
36© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
36
37© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure:
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica
• Leader copies the data over to the new one, which joins as a new FOLLOWER
37
38© Cloudera, Inc. All rights reserved.
Benchmarks
39© Cloudera, Inc. All rights reserved.
TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
39
40© Cloudera, Inc. All rights reserved.
- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
41© Cloudera, Inc. All rights reserved.
What about Apache Phoenix?
• 10 node cluster (9 worker, 1 master)
• HBase 1.0, Phoenix 4.3
• TPC-H LINEITEM table only (6B rows)
41
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet
42© Cloudera, Inc. All rights reserved.
What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot
• 10 node cluster
(9 worker, 1 master)
• HBase 1.0
• 100M rows, 10M ops
42
43© Cloudera, Inc. All rights reserved.
Xiaomi Use Case
• Gather important RPC tracing events from mobile app and backend service
• Service monitoring & troubleshooting tool
• High write throughput
• >5 Billion records/day and growing
• Query latest data and quick response
• Identify and resolve issues quickly
• Can search for individual records
• Easy for troubleshooting
44© Cloudera, Inc. All rights reserved.
Big Data Analytics Pipeline
Before Kudu
• Long pipeline
high latency(1 hour ~ 1 day), data conversion pains
• No ordering
Log arrival(storage) order not exactly logical order
e.g. read 2-3 days of log for data in 1 day
45© Cloudera, Inc. All rights reserved.
Big Data Analysis Pipeline
Simplified With Kudu
• ETL Pipeline(0~10s latency)
Apps that need to prevent backpressure or require ETL
• Direct Pipeline(no latency)
Apps that don’t require ETL and no backpressure issues
OLAP scan
Side table lookup
Result store
46© Cloudera, Inc. All rights reserved.
Use Case 1: Benchmark
Environment
• 71 Node cluster
• Hardware
• CPU: E5-2620 2.1GHz * 24 core Memory: 64GB
• Network: 1Gb Disk: 12 HDD
• Software
• Hadoop2.6/Impala 2.1/Kudu
Data
• 1 day of server side tracing data
• ~2.6 Billion rows
• ~270 bytes/row
• 17 columns, 5 key columns
47© Cloudera, Inc. All rights reserved.
Use Case 1: Benchmark Results
1.4 2.0 2.3
3.1
1.3 0.91.3
2.8
4.0
5.7
7.5
16.7
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M record/s 39.5k record/s
Parquet 114.6 23.5M record/s 331k records/s
Bulk load using impala (INSERT INTO):
Query latency:
* HDFS parquet file replication = 3 , kudu table replication = 3
* Each query run 5 times then take average
48© Cloudera, Inc. All rights reserved.
Status & Getting Started
49© Cloudera, Inc. All rights reserved.
With Kudu:
• Ingest and serve data simultaneously
• Support analytic and real-time operations on the same data set
• Make existing storage architectures simpler, and enable new architectures that previously
weren’t possible
Basic Features:
• High availability, no single point of failure
• Consistency by consensus, options for “tunable consistency”
• Horizontally scalable
• Efficient use of modern storage and processors
Basic Kudu value proposition
50© Cloudera, Inc. All rights reserved.
Current Status
✔ Completed all components core to the architecture
✔ Java and C++ API
✔ Impala, MapReduce, and Spark integration
✔ Support for SSDs and spinning disk
✔ Fault recovery
✔ Public beta available
51© Cloudera, Inc. All rights reserved.
Getting Started
Users:
Install the Beta or try a VM:
getkudu.io
Get help:
kudu-user@googlegroups.com
Read the white paper:
getkudu.io/kudu.pdf
Developers:
Contribute:
github.com/cloudera/kudu (commits)
gerrit.cloudera.org (reviews)
issues.cloudera.org (JIRAs going back to 2013)
Join the Dev list:
kudu-dev@googlegroups.com
Contributions/participation are welcome and
encouraged!
52© Cloudera, Inc. All rights reserved.
Questions?
53© Cloudera, Inc. All rights reserved.
Appendix
54© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
54
55© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure:
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica
• Leader copies the data over to the new one, which joins as a new FOLLOWER
55
56© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
56
57© Cloudera, Inc. All rights reserved.
Kudu storage – Compaction policy
• Solves an optimization problem (knapsack problem)
• Minimize “height” of rowsets for the average key lookup
• Bound on number of seeks for write or random-read
• Restrict total IO of any compaction to a budget (128MB)
• No long compactions, ever
• No “minor” vs “major” distinction
• Always be compacting or flushing
• Low IO priority maintenance threads
57

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 

Viewers also liked

Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 

Viewers also liked (11)

Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Similar to Introduction to Apache Kudu

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 

Similar to Introduction to Apache Kudu (20)

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Introduction to Apache Kudu

  • 1. 1© Cloudera, Inc. All rights reserved. Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data Jeff Holoman Sr. Systems Engineer
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda What is Kudu? (Motivations & Design Goals) Use Cases Overview of Design & Internals Simple Benchmark Status & Getting Started
  • 3. 3© Cloudera, Inc. All rights reserved. What is Kudu?
  • 4. 4© Cloudera, Inc. All rights reserved. But First….
  • 5. 5© Cloudera, Inc. All rights reserved. • Efficiently scanning large amounts of data • Accumulating data with high throughput • Multiple SQL Options • All processing engines • Single Row Access is problematic • Mutation is problematic • “Fast Data” access is problematic Excels at… However… GFS paper published in 2003!
  • 6. 6© Cloudera, Inc. All rights reserved. • Efficiently finding and writing individual rows • Accumulating data with high throughput • Scans are problematic • High cardinality access is problematic • SQL support is so/so due to the above Excels at… However… Big Table Paper published in 2006!
  • 7. 7© Cloudera, Inc. All rights reserved.
  • 8. 8© Cloudera, Inc. All rights reserved. In 2006… DID NOT EXIST!
  • 9. 9© Cloudera, Inc. All rights reserved. $0 $20 $40 $60 $80 $100 $120 $140 $160 $180 $200 5 10 13 15 15 RAM / GB RAM / GB
  • 10. 10© Cloudera, Inc. All rights reserved.
  • 11. 11© Cloudera, Inc. All rights reserved. Today…Changing Hardware Landscape • Spinning disk -> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant • 64->128->256GB over last few years Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind. Takeaway 2: Column stores are feasible for random access.
  • 12. 12© Cloudera, Inc. All rights reserved. Current Storage Landscape in Hadoop Gaps exist when these properties are needed simultaneously The Hadoop Storage “Gap”
  • 13. 13© Cloudera, Inc. All rights reserved. The Kudu Elevator Pitch Storage for Fast Analytics on Fast Data • New updating column store for Hadoop • Simplifies the architecture for building analytic applications on changing data • Designed for fast analytic performance • Natively integrated with Hadoop • Apache-licensed open source (with pending ASF Incubator proposal) • Beta now available FILESYSTEM HDFS NoSQL HBASE INGEST – SQOOP, FLUME, KAFKA DATA INTEGRATION & STORAGE SECURITY – SENTRY RESOURCE MANAGEMENT – YARN UNIFIED DATA SERVICES BATCH STREAM SQL SEARCH MODEL ONLINE DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE RELATIONAL KUDU
  • 14. 14© Cloudera, Inc. All rights reserved. • High throughput for big scans • Low-latency for random accesses • High CPU performance to better take advantage of RAM and Flash • Single-column scan rate 10-100x faster than HBase • High IO efficiency • True column store with type-specific encodings • Efficient analytics when only certain columns are accessed • Expressive and evolvable data model • Architecture that supports multi-data center operation Kudu Design Goals
  • 15. 15© Cloudera, Inc. All rights reserved.
  • 16. 16© Cloudera, Inc. All rights reserved. Using Kudu • Table has a SQL-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan() • Integrations with MapReduce, Spark, and Impala • more to come! 16
  • 17. 17© Cloudera, Inc. All rights reserved. What Kudu is *NOT* • Not a SQL interface itself • It’s just the storage layer – “Bring Your Own SQL” (eg Impala or Spark) • Not an application that runs on HDFS • It’s an alternative, native Hadoop storage engine • Colocation with HDFS is expected • Not a replacement for HDFS or HBase • Select the right storage for the right use case • Cloudera will support and invest in all three
  • 18. 18© Cloudera, Inc. All rights reserved. Use Cases for Kudu
  • 19. 19© Cloudera, Inc. All rights reserved. Kudu Use Cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes, e.g.: ● Time Series ○ Examples: Stream market data; fraud detection & prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine Data Analytics ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online Reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  • 20. 20© Cloudera, Inc. All rights reserved. Industry Examples • Streaming market data • Real-time fraud detection & prevention • Risk monitoring • Real-time offers • Location-based targeting • Geospatial monitoring • Risk and threat detection (real time) Financial Services Retail Public Sector
  • 21. 21© Cloudera, Inc. All rights reserved. Real-Time Analytics in Hadoop Today Fraud Detection in the Real World = Storage Complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request Impala on HDFS
  • 22. 22© Cloudera, Inc. All rights reserved. Real-Time Analytics in Hadoop with Kudu Simpler Architecture, Superior Performance over Hybrid Approaches Impala on Kudu Incoming Data (Messaging System) Reporting Request
  • 23. 23© Cloudera, Inc. All rights reserved. Design & Internals
  • 24. 24© Cloudera, Inc. All rights reserved. Kudu Basic Design • Typed storage • Basic Construct: Tables • Tables broken down into Tablets (roughly equivalent to partitions) • Maintains consistency through a Paxos-like quorum model (Raft) • Architecture supports geographically disparate, active/active systems
  • 25. 25© Cloudera, Inc. All rights reserved. Columnar Data Store A B C A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 A2 A3 B1 B2 B3 C1 C2 C3 Row-Based Storage Columnar Storage
  • 26. 26© Cloudera, Inc. All rights reserved. Tables and Tablets • Table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5), with Raft consensus • Allow read from any replica, plus leader-driven writes with low MTTR • Tablet servers host tablets • Store data on local disks (no HDFS) 26
  • 27. 27© Cloudera, Inc. All rights reserved. Tables and Tablets(2) • CREATE TABLE customers ( • first_name STRING NOT NULL, • last_name STRING NOT NULL, • order_count INT32, • PRIMARY KEY (last_name, first_name), • ) • Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows total) will result in the creation of 26 tablets, with each tablet containing a range of customer surnames all beginning with a given letter. This is an effective partition schema for a workload where customers are inserted and updated uniformly by last name, and scans are typically performed over a range of surnames.
  • 28. 28© Cloudera, Inc. All rights reserved. Client Meta Cache
  • 29. 29© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘todd@cloudera.com’ in table “T”?Meta Cache
  • 30. 30© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘todd@cloudera.com’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … Meta Cache
  • 31. 31© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘todd@cloudera.com’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … Meta Cache T1: … T2: … T3: …
  • 32. 32© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘todd@cloudera.com’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE todd@cloudera.com SET … Meta Cache T1: … T2: … T3: …
  • 33. 33© Cloudera, Inc. All rights reserved. Tablet Design • Inserts buffered in an in-memory store (like HBase’s memstore) • Flushed to disk • Columnar layout, similar to Apache Parquet • Updates use MVCC (updates tagged with timestamp, not in-place) • Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans • Near-optimal read path for “current time” scans • No per row branches, fast vectorized decoding and predicate evaluation • Performance worsens based on number of recent updates 33
  • 34. 34© Cloudera, Inc. All rights reserved. Metadata • Replicated master* • Acts as a tablet directory (“META” table) • Acts as a catalog (table schemas, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • 80-node load test, GetTableLocations RPC perf: • 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage • Client configured with master addresses • Asks master for tablet locations as needed and caches them 34
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu Trade-Offs • Random updates will be slower • HBase model allows random updates without incurring a disk seek • Kudu requires a key lookup before update, Bloom lookup before insert • Single-row reads may be slower • Columnar design is optimized for scans • Future: may introduce “column groups” for applications where single-row access is more important
  • 36. 36© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures 36
  • 37. 37© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure: • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica • Leader copies the data over to the new one, which joins as a new FOLLOWER 37
  • 38. 38© Cloudera, Inc. All rights reserved. Benchmarks
  • 39. 39© Cloudera, Inc. All rights reserved. TPC-H (Analytics benchmark) • 75TS + 1 master cluster • 12 (spinning) disk each, enough RAM to fit dataset • Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4 • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 39
  • 40. 40© Cloudera, Inc. All rights reserved. - Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data - Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
  • 41. 41© Cloudera, Inc. All rights reserved. What about Apache Phoenix? • 10 node cluster (9 worker, 1 master) • HBase 1.0, Phoenix 4.3 • TPC-H LINEITEM table only (6B rows) 41 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH Q1 COUNT(*) COUNT(*) WHERE… single-row lookup Time(sec) Phoenix Kudu Parquet
  • 42. 42© Cloudera, Inc. All rights reserved. What about NoSQL-style random access? (YCSB) • YCSB 0.5.0-snapshot • 10 node cluster (9 worker, 1 master) • HBase 1.0 • 100M rows, 10M ops 42
  • 43. 43© Cloudera, Inc. All rights reserved. Xiaomi Use Case • Gather important RPC tracing events from mobile app and backend service • Service monitoring & troubleshooting tool • High write throughput • >5 Billion records/day and growing • Query latest data and quick response • Identify and resolve issues quickly • Can search for individual records • Easy for troubleshooting
  • 44. 44© Cloudera, Inc. All rights reserved. Big Data Analytics Pipeline Before Kudu • Long pipeline high latency(1 hour ~ 1 day), data conversion pains • No ordering Log arrival(storage) order not exactly logical order e.g. read 2-3 days of log for data in 1 day
  • 45. 45© Cloudera, Inc. All rights reserved. Big Data Analysis Pipeline Simplified With Kudu • ETL Pipeline(0~10s latency) Apps that need to prevent backpressure or require ETL • Direct Pipeline(no latency) Apps that don’t require ETL and no backpressure issues OLAP scan Side table lookup Result store
  • 46. 46© Cloudera, Inc. All rights reserved. Use Case 1: Benchmark Environment • 71 Node cluster • Hardware • CPU: E5-2620 2.1GHz * 24 core Memory: 64GB • Network: 1Gb Disk: 12 HDD • Software • Hadoop2.6/Impala 2.1/Kudu Data • 1 day of server side tracing data • ~2.6 Billion rows • ~270 bytes/row • 17 columns, 5 key columns
  • 47. 47© Cloudera, Inc. All rights reserved. Use Case 1: Benchmark Results 1.4 2.0 2.3 3.1 1.3 0.91.3 2.8 4.0 5.7 7.5 16.7 Q1 Q2 Q3 Q4 Q5 Q6 kudu parquet Total Time(s) Throughput(Total) Throughput(per node) Kudu 961.1 2.8M record/s 39.5k record/s Parquet 114.6 23.5M record/s 331k records/s Bulk load using impala (INSERT INTO): Query latency: * HDFS parquet file replication = 3 , kudu table replication = 3 * Each query run 5 times then take average
  • 48. 48© Cloudera, Inc. All rights reserved. Status & Getting Started
  • 49. 49© Cloudera, Inc. All rights reserved. With Kudu: • Ingest and serve data simultaneously • Support analytic and real-time operations on the same data set • Make existing storage architectures simpler, and enable new architectures that previously weren’t possible Basic Features: • High availability, no single point of failure • Consistency by consensus, options for “tunable consistency” • Horizontally scalable • Efficient use of modern storage and processors Basic Kudu value proposition
  • 50. 50© Cloudera, Inc. All rights reserved. Current Status ✔ Completed all components core to the architecture ✔ Java and C++ API ✔ Impala, MapReduce, and Spark integration ✔ Support for SSDs and spinning disk ✔ Fault recovery ✔ Public beta available
  • 51. 51© Cloudera, Inc. All rights reserved. Getting Started Users: Install the Beta or try a VM: getkudu.io Get help: kudu-user@googlegroups.com Read the white paper: getkudu.io/kudu.pdf Developers: Contribute: github.com/cloudera/kudu (commits) gerrit.cloudera.org (reviews) issues.cloudera.org (JIRAs going back to 2013) Join the Dev list: kudu-dev@googlegroups.com Contributions/participation are welcome and encouraged!
  • 52. 52© Cloudera, Inc. All rights reserved. Questions?
  • 53. 53© Cloudera, Inc. All rights reserved. Appendix
  • 54. 54© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures 54
  • 55. 55© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure: • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica • Leader copies the data over to the new one, which joins as a new FOLLOWER 55
  • 56. 56© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans) 56
  • 57. 57© Cloudera, Inc. All rights reserved. Kudu storage – Compaction policy • Solves an optimization problem (knapsack problem) • Minimize “height” of rowsets for the average key lookup • Bound on number of seeks for write or random-read • Restrict total IO of any compaction to a budget (128MB) • No long compactions, ever • No “minor” vs “major” distinction • Always be compacting or flushing • Low IO priority maintenance threads 57