SlideShare a Scribd company logo
1 of 47
Download to read offline
DATA
“Uber, Your Hadoop Has Arrived”
Vinoth Chandar
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber’s Mission
“Transportation as reliable as running water,
everywhere, for everyone”
400+ Cities 69 Countries
And growing...
Agenda
Bringing Hadoop To Uber
Hadoop Ecosystem
Challenges Ahead
Data @ Uber : Impact
1. City OPS
○ Data Users Operating a massive transportation system
2. Analysts & Execs
○ Marketing Spend, Forecasting
3. Engineers & Data Scientists
○ Fraud Detection & Discovery
4. Critical Business Operations
○ Incentive Payments/Background Checks
5. Fending Off Uber’s Legal/Regulatory Challenges
○ “You have to produce this data in X hours”
Data @ Uber : Circa 2014
Kafka7 Logs
Schemaless
Databases
RDBMS Tables
Vertica
Applications
- Incentive Payments
- Machine Learning
- Safety
- Background Checks
uploader Amazon
S3
EMR
Wall-e ETL
Adhoc SQL
- City Ops/DOPS
- Data Scientists
Pain Point #1: Data Reliability
- Free Form python/node objects -> heavily nested JSON
- Word-of-mouth Schema communication
Lots of
Engineers &
Lots of
services
Lots of
City OPS
Producers
Data Team
Consumers
$$$$ Data Pipeline
Pain Point #2: System Scalability
- Kafka7 : Heavy Topics/No HA
- Wall-e : Celery workers unable to keep up with Kafka/Schemaless
- Vertica Queries : More & More Raw Data piling on
H1 2014
H2 2014
& beyond
Pain Point #3: Fragile Ingestion Model
- Multiple fetching from sources
- Painful Backfills, since projections & transformation are in pipelines
mezzanine
trips_table1
trips_table2
trips_table3 Warehouse
mezzanine
trips_table1
trips_table2
trips_table3 Warehouse
VS
DataPool?
Pain Point #4: No Multi-DC Support
- No Unified view of data, More complexity from consumer
- Wasteful use of WAN traffic
DC1
DC2
Global Warehouse
Hadoop Data Lake: Pain,Pain Go Away!
- (Pain 1) Schematize All Data (old & new)
- Heatpipe/Schema Service/Paricon
- (Pain 2) All Infrastructure Shall Scale Horizontally
- Kafka8 & Hadoop
- Streamific/Sqoop (Deliver data to HDFS)
- Lizzie(Feed Vertica)/Komondor(Feed Hive)
- (Pain 3) Store raw data in nested glory in Hadoop
- Json -> Avro records -> Parquet!
- (Pain 4) Global View Of All Data
- Unified tables! Yay!
Uber’s Schema Talk : Tomorrow, 2:40PM
Hadoop Ecosystem: Overview
Kafka8 Logs
Schemaless
Databases
SOA Tables
Vertica
Adhoc SQL
(Web UI)
Lizzie ETL
(Spark)
Streamific
Json,
Avro
Hive
(parquet)
Streamific
Sqoop
ETL
(Modeled
Tables)
Janus
Fraud (Hive)
Machine Learning (Spark)
Safety Apps (Spark)
Backfill Pipelines (Spark)
Hadoop
ETL Modeled Tables (Hive)
Back to
- Hive
- Kafka
- NoSQL
flat table
modeled table
Komondor (Spark)
Hadoop Ecosystem: Data Ingestion
Row Based
(HBase/SequenceFiles)
(Parquet)
Columnar
HDFS
Komondor
(Batch)
Kafka Logs
DB Redo Logs
DC1
DC2
DC1
DC2
Streamific
(Streaming,duh..)
Hadoop Ecosystem: Streamific
Long-running service
- Backfills/Catch-up don’t hurt
sources
Low Latency delivery into
row-oriented storage
- HBase/HDFS Append**
Deployed/Monitored the
‘uber’ way.
- Can run on DCs without YARN
etc
Core (HA, Checkpointing, Ordering, Throttling, Atleast-once guarantees) + Pluggable In/Out streams.
Akka (Peak 900MB/sec),Helix (300K partitions)
HBase
HDFS
Kafka
Kafka
Schema
-less
S3
Hadoop Ecosystem: Komondor
The YARN/Spark Muscle
- Parquet writing is expensive
- 1->N mapping from raw to
parquet/Hive table
Control Data Quality
- Schema Enforcement
- Cleaning JSON
- Hive Partitioning
File Stitching
- Keeps NN happy & queries
performant
Let’s “micro batch”?
- HDFS iNotify stability issues
Kafka logs
DB Changelogs
Full Snapshot - Trips (partitioned by
request date)
- User (partitioned by join
date)
- Kafka events
(partitioned by event
publish date)
- Transaction history
(partitioned by charge
date)
Snapshot tables
Incremental tables
Full dump
New Files
(HBase)
(HDFS)
(HDFS)
Hadoop Ecosystem: Consuming Data
1. Adhoc SQL
a. Gateway service => Janus
i. Keep bad queries out!
ii. Choose YARN queues
b. Hits HiveServer2/Tez
2. Data Apps
a. Spark/SparkSQL via HiveContext
b. Support for saving results to Hive/Kafka
c. Monitoring/Debugging the ‘uber’ way
3. Lightweight Apps
a. Python apps hitting gateway
b. Fetch Small results via WebHDFS
Hadoop Ecosystem: Feeding Data-marts
Vertica
- SparkSQL/Oozie ETL framework to produce flattened tables
- High Volume
- Simple projections/row-level xforms
- HiveQL to produce well-modelled tables
- + Complex joins
- Also lands tables into Hive
Real-time Dashboarding
- Batch layer for lambda architecture
- Memsql/Riak as the real-time stores
Hadoop Ecosystem: Some Numbers
HDFS
- (Total) 4 PB in 1 HDFS Cluster
- (Growth) ~3 TB/day and growing
YARN
- 6300 VCores (total)
- Number of daily jobs - 60K
- Number of compute hours daily - 78K (3250 days)
Hadoop Ecosystem: 2015 Wins
1. Hadoop is source-of-truth for analytics data
a. 100% of All analytics
2. Powered Critical Business Operations
a. Partner Incentive Payments
3. Unlocked Data
a. Data in Hadoop >> Data in Vertica
We (almost) caught up!
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our City Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
#1- Interactive SQL at Scale: Motivation
Vertica
Fast
Can’t cheaply
scale
Powerful, scales reliably
Slowww….
Hive
#1- Interactive SQL at Scale: Presto
Fast
- (-er than SparkSQL, -errr
than Hive-on-tez)
Deployed at Scale
- (FB/Netflix)
Lack of UDF Interop
- Hive ⇔ Spark UDF interop is
great!
Out-of-box Geo support
- ESRI/Magellan
Other Challenges:
- Heavy joins in 100K+ existing queries
- Vertica degrades more gracefully
- Colocation With Hadoop
- Network isolation
#1- Interactive SQL at Scale: Spark Notebooks
1. Great for data scientists!
- Iterative prototyping/exploration
2. Zeppelin/JupyterHub on HDFS
- Run off mesos clusters
3. Of course, Spark Shell!
- Pipeline troubleshooting
#1- Interactive SQL at Scale: Plan
1. Get Presto up & running
- Work off “modelled tables” out of Hive
- Equivalent of Vertica usage today
2. Presto on Raw nested data
- Raw data in Hive (will be) available at low latency
- Uber’s scalable near real-time warehouse
3. Support Spark Notebook use cases
- Similar QoS issues hitting HDFS from Mesos
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
#2- All-Active: Motivation
Low availability, SPOF
Data From All DCs replicated
to single global data lake
Data copied in-out, high SLA
Assumes unbounded WAN
links
Less Operational overhead
#2- All-Active: Plan**
Same HA as online services
(You failover, with data readily
available)
Maintain N Hadoop Lakes?
Data is replicated to peer
data centers and into global
data lakes (solid lines).
#2- All-Active: Challenges
1. Cross DC replicator design
- File Level vs Record Level Replication
2. Policy Management
- Which data is worth replicating
- Which fields are PII?
3. Reducing Storage Footprint
- 9 copies!! (2 Local Lakes + 1 Global Lake = 3 * 3 times from HDFS)
- Federation across WAN?
4. Capacity Management for Failover
- Degraded mode or hot standby?
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
#3- Fresher Data in Hadoop: Motivation
1. Uber’s business is inherently ‘real-time’
- Uber’s City Ops fresh data, to ‘debug’ Uber.
2. All the Data is in Hadoop Anyway
- Reduce mindless data movement
3. Leverage Power Of Existing SQL Engines
- Standard SQL Interface & Mature Join support
Vertica
#3- Fresher Data: Trips on Hadoop (Today)
Schemaless
(dc1)
Schemaless
(dc2)
Hadoop
Streamific
trips
(raw
table)
Rows
(tripid
=> row)
(new/updated
trip rows)
Vertica
Cells
(Streaming)
Cells
(Streaming)
Streaming
10 mins, file
uploads 1 hr (tunable)
6 hr, snapshot
Incremental Update
Snapshot: Inefficient & Slow
a) Snapshot job => 100s of Mappers
b) Reads TBs, Writes TBs
c) But, just X GB actual data per day!!!!
HiveHBase
Changes To
HBase
6 hrs
~1 hr
trips (flat table)
fact_trip (modeled
table)
#3- Fresher Data: Modelled Tables In Hadoop
Schemaless
(dc1)
Schemaless
(dc2)
Hadoop
Streamific
trips
(raw
table)
fact_trip
(modelled table)
(~7-8+hrs)
Rows
(tripid
=> row)
Vertica
Cells
(Streaming)
Cells
(Streaming)
Streaming
6 hr, snapshot
Latency & Inefficiency worsen further
a)Spark/Presto on modelled tables goes from 1-
2 hrs to 7-8 hrs!!
b)Resource Usage shoots up
HiveHBase
(new/updated
trip rows)
Changes To
HBase
fact_trip
(modelled
table)
Hive
1-2 hr, snapshot
10 mins, file
uploads
7-8+ hr
#3- Fresher Data: Let’s incrementally update?
Schemaless
(dc1)
Schemaless
(dc2)
Hadoop
Streamific
trips
(raw
table)Rows
(tripid
=> row)
Cells
(Streaming)
Cells
(Streaming)
Streaming
So Problem
Solved, right?
a) Same pattern as
Vertica load
b) Saves a bunch of
resources
c) And shrinks down
latency.
Hive
HBase
(new/updated
trip rows)
Changes To
HBase
trips
(modelled
table)
Hive
30 mins, Incremental
Update
Incremental Update
30 mins
10 mins, file
uploads
< 1 hr
~1 hr
#3- Fresher Data: HDFS/Hive Updates are tedious
Hadoop
So Problem
Solved, right?
Except
HBase
Changes To
HBase
Hive
Cells
(Streaming)
Streamific
10 mins, file
uploads
(new/updated
trip rows)
Incremental Update
30 mins
30 mins, Incremental
Update
trips
(modelled
table)
trips
(raw
table)
Schemaless
(dc1)
Schemaless
(dc2)
Rows
(tripid
=> row)
Cells
(Streaming)
Streaming
Hive
Update!
#3- Fresher Data: Trip Updates Problem
Raw Trips Table in Hive
New
trips/Updated
Trips
2010-2014
2016/01/02
2016/01/03New Data
Unaffected Data
Updated Data
2015/12/(01-31)
Incremental
update
2015/(01-05)
2015/(06-11)
Last 1 hr
Day level partitions
#3- Fresher Data: HDFS/Hive Updates are tedious
Hadoop
So Problem
Solved, right?
Yes, except…
HBase
Changes To
HBase
Hive
Cells
(Streaming)
Streamific
10 mins, file
uploads
(new/updated
trip rows)
Incremental Update
30 mins
30 mins, Incremental
Update
trips
(modelled
table)
trips
(raw
table)
Schemaless
(dc1)
Schemaless
(dc2)
Rows
(tripid
=> row)
Cells
(Streaming)
Streaming
Hive
Update!Good News: Solve this & everything becomes
#3- Fresher Data: Solving Updates
1. Simple Folder/Partition Re-writing
- Most commonly used approach in Hadoop land
2. File Level Updates
- Similar to a, but at file level
3. Record Level Updates
- Feels like a k-v store on parquet (and thus more complex)
- Similar to Kudu/Hive transactions
#3- Fresher Data: Plan
● Pick File Level Update approach
○ Establish all the machinery (Custom InputFormat, Spark/Presto
Connectors)
○ Get latency down to 15mins - 1 hour
● Record Level Update approach, if needed
○ Study numbers from production
○ Switch will be transparent to consumers
● In Summary,
○ Unlocks interactive SQL on raw “nested” table at low latency
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
#4- Incremental Computation: Recurring Jobs
● State of the art : Consume Complete/Immutable
Partitions
- Determine when partition/time window is complete
- Trigger workflows waiting on that partition/time window
● As partition size , incoming data into old time bucket
- With 1-min/10-min partitions, keep getting new data for old time buckets
● Fundamental tradeoff
- More Latency => More Completeness
#4- Incremental Computation: Use Cases
- Diff apps => Diff
needs
a. Hadoop has one mode
“completeness”
- Apps must be able to
choose
a. job.trigger.
atCompleteness(90)
b. job.schedule.atInterval
(10 mins)
- Closest thing out there -
Google CloudflowCompleteness
Latency
Incentive
payments
Fraud Detection
Backfill
Pipelines
Business
Dashboards/ETL
Days
Hour
< 1 hr
Day
Data Science
Safety App
#4- Incremental Computation: Tailing Tables
● Adding a new style of consuming data
- Obtain new records loaded into table, since last run, across partitions
- Consume in 15-30 min batches
● Favours Latency - Providing new data quickly
- Consumer logic responsible for reconciliation with previous results
● Need a special marker to denote consumption point
- commitTime: For each record, the time at which it was last updated
#4- Incremental Computation: Plan
● Add Metadata at Record-level to enable tailing
○ Extra book-keeping to map commitTime to Hive Partitions/Files
■ Avoid disastrous full scans
○ Can be combined with granular Hive partitions if needed
■ 15 min Hive Partitions => ~200K partitions for trip table
● Open Items:
○ Late arrival handling
■ Tracking when a time-window become complete
■ Design to (re)trigger workflows
○ Incrementally Recomputing aggregates
Hadoop Ecosystem: 2016 Challenges
1. Interactive SQL at Scale
a. Put the power of data in our Ops’s hands
2. All-Active
a. Keep data apps working during failovers
3. Fresher Data in Hadoop
a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica
4. Incremental Computation
a. Most Jobs run daily off raw tables
b. Intra hour jobs to build modeled tables
Summary
Today
- Living, breathing data ecosystem
- Catch(-ing) up to the state-of-art
Tomorrow
- Push edges based on Uber’s needs
- Near Real-time Warehouse
- Incremental Compute
- All-Active
- Make Every Decision (Human/Machine) data driven
Thank you!

More Related Content

What's hot

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfAlkin Tezuysal
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 

Viewers also liked

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to ProductionVinoth Chandar
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...Institute of Contemporary Sciences
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
All vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/HadoopAll vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/HadoopMahmoud Parsian
 

Viewers also liked (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Uber's Business Model
Uber's Business ModelUber's Business Model
Uber's Business Model
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Uber
UberUber
Uber
 
UBER
UBERUBER
UBER
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
UBER Strategy
UBER StrategyUBER Strategy
UBER Strategy
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...
Recursive Bayesian estimation of ECHO state recurrent neural networks - Brani...
 
Windowing in apex
Windowing in apexWindowing in apex
Windowing in apex
 
Product management
Product managementProduct management
Product management
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
All vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/HadoopAll vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/Hadoop
 

Similar to Hadoop Strata Talk - Uber, your hadoop has arrived

Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 

Similar to Hadoop Strata Talk - Uber, your hadoop has arrived (20)

Apache drill
Apache drillApache drill
Apache drill
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop
HadoopHadoop
Hadoop
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 

More from Vinoth Chandar

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
DistributeddatabasesforchallengednetVinoth Chandar
 

More from Vinoth Chandar (7)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based Grouping
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
 
Bluetube
BluetubeBluetube
Bluetube
 

Recently uploaded

Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 

Recently uploaded (20)

Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 

Hadoop Strata Talk - Uber, your hadoop has arrived

  • 1. DATA “Uber, Your Hadoop Has Arrived” Vinoth Chandar
  • 2. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber’s Mission “Transportation as reliable as running water, everywhere, for everyone” 400+ Cities 69 Countries And growing...
  • 3. Agenda Bringing Hadoop To Uber Hadoop Ecosystem Challenges Ahead
  • 4. Data @ Uber : Impact 1. City OPS ○ Data Users Operating a massive transportation system 2. Analysts & Execs ○ Marketing Spend, Forecasting 3. Engineers & Data Scientists ○ Fraud Detection & Discovery 4. Critical Business Operations ○ Incentive Payments/Background Checks 5. Fending Off Uber’s Legal/Regulatory Challenges ○ “You have to produce this data in X hours”
  • 5. Data @ Uber : Circa 2014 Kafka7 Logs Schemaless Databases RDBMS Tables Vertica Applications - Incentive Payments - Machine Learning - Safety - Background Checks uploader Amazon S3 EMR Wall-e ETL Adhoc SQL - City Ops/DOPS - Data Scientists
  • 6. Pain Point #1: Data Reliability - Free Form python/node objects -> heavily nested JSON - Word-of-mouth Schema communication Lots of Engineers & Lots of services Lots of City OPS Producers Data Team Consumers $$$$ Data Pipeline
  • 7. Pain Point #2: System Scalability - Kafka7 : Heavy Topics/No HA - Wall-e : Celery workers unable to keep up with Kafka/Schemaless - Vertica Queries : More & More Raw Data piling on H1 2014 H2 2014 & beyond
  • 8. Pain Point #3: Fragile Ingestion Model - Multiple fetching from sources - Painful Backfills, since projections & transformation are in pipelines mezzanine trips_table1 trips_table2 trips_table3 Warehouse mezzanine trips_table1 trips_table2 trips_table3 Warehouse VS DataPool?
  • 9. Pain Point #4: No Multi-DC Support - No Unified view of data, More complexity from consumer - Wasteful use of WAN traffic DC1 DC2 Global Warehouse
  • 10. Hadoop Data Lake: Pain,Pain Go Away! - (Pain 1) Schematize All Data (old & new) - Heatpipe/Schema Service/Paricon - (Pain 2) All Infrastructure Shall Scale Horizontally - Kafka8 & Hadoop - Streamific/Sqoop (Deliver data to HDFS) - Lizzie(Feed Vertica)/Komondor(Feed Hive) - (Pain 3) Store raw data in nested glory in Hadoop - Json -> Avro records -> Parquet! - (Pain 4) Global View Of All Data - Unified tables! Yay!
  • 11. Uber’s Schema Talk : Tomorrow, 2:40PM
  • 12. Hadoop Ecosystem: Overview Kafka8 Logs Schemaless Databases SOA Tables Vertica Adhoc SQL (Web UI) Lizzie ETL (Spark) Streamific Json, Avro Hive (parquet) Streamific Sqoop ETL (Modeled Tables) Janus Fraud (Hive) Machine Learning (Spark) Safety Apps (Spark) Backfill Pipelines (Spark) Hadoop ETL Modeled Tables (Hive) Back to - Hive - Kafka - NoSQL flat table modeled table Komondor (Spark)
  • 13. Hadoop Ecosystem: Data Ingestion Row Based (HBase/SequenceFiles) (Parquet) Columnar HDFS Komondor (Batch) Kafka Logs DB Redo Logs DC1 DC2 DC1 DC2 Streamific (Streaming,duh..)
  • 14. Hadoop Ecosystem: Streamific Long-running service - Backfills/Catch-up don’t hurt sources Low Latency delivery into row-oriented storage - HBase/HDFS Append** Deployed/Monitored the ‘uber’ way. - Can run on DCs without YARN etc Core (HA, Checkpointing, Ordering, Throttling, Atleast-once guarantees) + Pluggable In/Out streams. Akka (Peak 900MB/sec),Helix (300K partitions) HBase HDFS Kafka Kafka Schema -less S3
  • 15. Hadoop Ecosystem: Komondor The YARN/Spark Muscle - Parquet writing is expensive - 1->N mapping from raw to parquet/Hive table Control Data Quality - Schema Enforcement - Cleaning JSON - Hive Partitioning File Stitching - Keeps NN happy & queries performant Let’s “micro batch”? - HDFS iNotify stability issues Kafka logs DB Changelogs Full Snapshot - Trips (partitioned by request date) - User (partitioned by join date) - Kafka events (partitioned by event publish date) - Transaction history (partitioned by charge date) Snapshot tables Incremental tables Full dump New Files (HBase) (HDFS) (HDFS)
  • 16. Hadoop Ecosystem: Consuming Data 1. Adhoc SQL a. Gateway service => Janus i. Keep bad queries out! ii. Choose YARN queues b. Hits HiveServer2/Tez 2. Data Apps a. Spark/SparkSQL via HiveContext b. Support for saving results to Hive/Kafka c. Monitoring/Debugging the ‘uber’ way 3. Lightweight Apps a. Python apps hitting gateway b. Fetch Small results via WebHDFS
  • 17. Hadoop Ecosystem: Feeding Data-marts Vertica - SparkSQL/Oozie ETL framework to produce flattened tables - High Volume - Simple projections/row-level xforms - HiveQL to produce well-modelled tables - + Complex joins - Also lands tables into Hive Real-time Dashboarding - Batch layer for lambda architecture - Memsql/Riak as the real-time stores
  • 18. Hadoop Ecosystem: Some Numbers HDFS - (Total) 4 PB in 1 HDFS Cluster - (Growth) ~3 TB/day and growing YARN - 6300 VCores (total) - Number of daily jobs - 60K - Number of compute hours daily - 78K (3250 days)
  • 19. Hadoop Ecosystem: 2015 Wins 1. Hadoop is source-of-truth for analytics data a. 100% of All analytics 2. Powered Critical Business Operations a. Partner Incentive Payments 3. Unlocked Data a. Data in Hadoop >> Data in Vertica We (almost) caught up!
  • 20. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our City Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 21. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 22. #1- Interactive SQL at Scale: Motivation Vertica Fast Can’t cheaply scale Powerful, scales reliably Slowww…. Hive
  • 23. #1- Interactive SQL at Scale: Presto Fast - (-er than SparkSQL, -errr than Hive-on-tez) Deployed at Scale - (FB/Netflix) Lack of UDF Interop - Hive ⇔ Spark UDF interop is great! Out-of-box Geo support - ESRI/Magellan Other Challenges: - Heavy joins in 100K+ existing queries - Vertica degrades more gracefully - Colocation With Hadoop - Network isolation
  • 24. #1- Interactive SQL at Scale: Spark Notebooks 1. Great for data scientists! - Iterative prototyping/exploration 2. Zeppelin/JupyterHub on HDFS - Run off mesos clusters 3. Of course, Spark Shell! - Pipeline troubleshooting
  • 25. #1- Interactive SQL at Scale: Plan 1. Get Presto up & running - Work off “modelled tables” out of Hive - Equivalent of Vertica usage today 2. Presto on Raw nested data - Raw data in Hive (will be) available at low latency - Uber’s scalable near real-time warehouse 3. Support Spark Notebook use cases - Similar QoS issues hitting HDFS from Mesos
  • 26. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 27. #2- All-Active: Motivation Low availability, SPOF Data From All DCs replicated to single global data lake Data copied in-out, high SLA Assumes unbounded WAN links Less Operational overhead
  • 28. #2- All-Active: Plan** Same HA as online services (You failover, with data readily available) Maintain N Hadoop Lakes? Data is replicated to peer data centers and into global data lakes (solid lines).
  • 29. #2- All-Active: Challenges 1. Cross DC replicator design - File Level vs Record Level Replication 2. Policy Management - Which data is worth replicating - Which fields are PII? 3. Reducing Storage Footprint - 9 copies!! (2 Local Lakes + 1 Global Lake = 3 * 3 times from HDFS) - Federation across WAN? 4. Capacity Management for Failover - Degraded mode or hot standby?
  • 30. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 31. #3- Fresher Data in Hadoop: Motivation 1. Uber’s business is inherently ‘real-time’ - Uber’s City Ops fresh data, to ‘debug’ Uber. 2. All the Data is in Hadoop Anyway - Reduce mindless data movement 3. Leverage Power Of Existing SQL Engines - Standard SQL Interface & Mature Join support
  • 32. Vertica #3- Fresher Data: Trips on Hadoop (Today) Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table) Rows (tripid => row) (new/updated trip rows) Vertica Cells (Streaming) Cells (Streaming) Streaming 10 mins, file uploads 1 hr (tunable) 6 hr, snapshot Incremental Update Snapshot: Inefficient & Slow a) Snapshot job => 100s of Mappers b) Reads TBs, Writes TBs c) But, just X GB actual data per day!!!! HiveHBase Changes To HBase 6 hrs ~1 hr trips (flat table) fact_trip (modeled table)
  • 33. #3- Fresher Data: Modelled Tables In Hadoop Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table) fact_trip (modelled table) (~7-8+hrs) Rows (tripid => row) Vertica Cells (Streaming) Cells (Streaming) Streaming 6 hr, snapshot Latency & Inefficiency worsen further a)Spark/Presto on modelled tables goes from 1- 2 hrs to 7-8 hrs!! b)Resource Usage shoots up HiveHBase (new/updated trip rows) Changes To HBase fact_trip (modelled table) Hive 1-2 hr, snapshot 10 mins, file uploads 7-8+ hr
  • 34. #3- Fresher Data: Let’s incrementally update? Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table)Rows (tripid => row) Cells (Streaming) Cells (Streaming) Streaming So Problem Solved, right? a) Same pattern as Vertica load b) Saves a bunch of resources c) And shrinks down latency. Hive HBase (new/updated trip rows) Changes To HBase trips (modelled table) Hive 30 mins, Incremental Update Incremental Update 30 mins 10 mins, file uploads < 1 hr ~1 hr
  • 35. #3- Fresher Data: HDFS/Hive Updates are tedious Hadoop So Problem Solved, right? Except HBase Changes To HBase Hive Cells (Streaming) Streamific 10 mins, file uploads (new/updated trip rows) Incremental Update 30 mins 30 mins, Incremental Update trips (modelled table) trips (raw table) Schemaless (dc1) Schemaless (dc2) Rows (tripid => row) Cells (Streaming) Streaming Hive Update!
  • 36. #3- Fresher Data: Trip Updates Problem Raw Trips Table in Hive New trips/Updated Trips 2010-2014 2016/01/02 2016/01/03New Data Unaffected Data Updated Data 2015/12/(01-31) Incremental update 2015/(01-05) 2015/(06-11) Last 1 hr Day level partitions
  • 37. #3- Fresher Data: HDFS/Hive Updates are tedious Hadoop So Problem Solved, right? Yes, except… HBase Changes To HBase Hive Cells (Streaming) Streamific 10 mins, file uploads (new/updated trip rows) Incremental Update 30 mins 30 mins, Incremental Update trips (modelled table) trips (raw table) Schemaless (dc1) Schemaless (dc2) Rows (tripid => row) Cells (Streaming) Streaming Hive Update!Good News: Solve this & everything becomes
  • 38. #3- Fresher Data: Solving Updates 1. Simple Folder/Partition Re-writing - Most commonly used approach in Hadoop land 2. File Level Updates - Similar to a, but at file level 3. Record Level Updates - Feels like a k-v store on parquet (and thus more complex) - Similar to Kudu/Hive transactions
  • 39. #3- Fresher Data: Plan ● Pick File Level Update approach ○ Establish all the machinery (Custom InputFormat, Spark/Presto Connectors) ○ Get latency down to 15mins - 1 hour ● Record Level Update approach, if needed ○ Study numbers from production ○ Switch will be transparent to consumers ● In Summary, ○ Unlocks interactive SQL on raw “nested” table at low latency
  • 40. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 41. #4- Incremental Computation: Recurring Jobs ● State of the art : Consume Complete/Immutable Partitions - Determine when partition/time window is complete - Trigger workflows waiting on that partition/time window ● As partition size , incoming data into old time bucket - With 1-min/10-min partitions, keep getting new data for old time buckets ● Fundamental tradeoff - More Latency => More Completeness
  • 42. #4- Incremental Computation: Use Cases - Diff apps => Diff needs a. Hadoop has one mode “completeness” - Apps must be able to choose a. job.trigger. atCompleteness(90) b. job.schedule.atInterval (10 mins) - Closest thing out there - Google CloudflowCompleteness Latency Incentive payments Fraud Detection Backfill Pipelines Business Dashboards/ETL Days Hour < 1 hr Day Data Science Safety App
  • 43. #4- Incremental Computation: Tailing Tables ● Adding a new style of consuming data - Obtain new records loaded into table, since last run, across partitions - Consume in 15-30 min batches ● Favours Latency - Providing new data quickly - Consumer logic responsible for reconciliation with previous results ● Need a special marker to denote consumption point - commitTime: For each record, the time at which it was last updated
  • 44. #4- Incremental Computation: Plan ● Add Metadata at Record-level to enable tailing ○ Extra book-keeping to map commitTime to Hive Partitions/Files ■ Avoid disastrous full scans ○ Can be combined with granular Hive partitions if needed ■ 15 min Hive Partitions => ~200K partitions for trip table ● Open Items: ○ Late arrival handling ■ Tracking when a time-window become complete ■ Design to (re)trigger workflows ○ Incrementally Recomputing aggregates
  • 45. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  • 46. Summary Today - Living, breathing data ecosystem - Catch(-ing) up to the state-of-art Tomorrow - Push edges based on Uber’s needs - Near Real-time Warehouse - Incremental Compute - All-Active - Make Every Decision (Human/Machine) data driven