SlideShare a Scribd company logo
1 of 46
Download to read offline
DATA
Spark & Hadoop @ Uber
Who We Are
Early Engineers On Hadoop team @ Uber
Kelvin Chu Reza ShiftehfarVinoth Chandar
Agenda
● Intro to Data @ Uber
● Trips Pipeline Into Warehouse
● Paricon
● INotify DStream
● Future
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber’s Mission
“Transportation as reliable as running water,
everywhere, for everyone”
300+ Cities 60+ Countries
And growing...
Data @ Uber
● Impact of Data is Huge!
○ 2000+ Unique Users Operating a massive transportation system
● Running critical business operations
○ Payments, Fraud, Marketing Spend, Background Checks …
● Unique & Interesting Problems
○ Supply vs Demand - Growth
○ Geo-Temporal Analytics
● Latency Is King
○ Enormous business value in making data available asap
Data Architecture: Circa 2014
Kafka Logs
Schemaless
Databases
RDBMS Tables
OLAP
Warehouse
Applications
Bulk Uploader
Amazon S3
EMR
Celery/Python
ETL Adhoc SQL
Challenges
● Scaling to high volume Kafka streams
○ eg: Event data coming from phones
● Merged Views of DB Changelogs across DCs
○ Some of the most important data - trips (duh!)
● Fragile ingestion model
○ Projections/Transformation in pipelines
○ Data Lake philosophy - raw data on HDFS, transform later using Spark
● Free-form JSON data → Data Breakages
● First order of business - Reliable Data
New World Order: Hadoop & Spark
Kafka Logs
Schemaless
Databases
RDBMS Tables
Amazon S3
HDFS
OLAP
Warehouse
Applications
Adhoc SQL
Applications Adhoc SQL Machine Learning
Paricon
Spark
SQL
Spark
/Hive
Spark
Jobs/
Oozie
Spark?
Data
Delivery
Services
Raw
Data
Cooked
Spark/Spark
Streaming
Trips Pipeline : Problem
● Most Valuable Dataset in Uber (100% Accuracy)
● Trips stored in Uber’s ‘schemaless’ datastores (sharded
Mysql), across DCs, cross replicated
● Need a consolidated view across dcs, quickly (~1-2 hr
end-end)
Trip Store
(DC1)
Trip Store
(DC2)
Writes in DC1 Writes in DC2
Multi Master XDC
Replication
Trips Pipeline : Architecture
Trips Pipeline : ETL via SparkSQL
● Decouples raw ingestion from Relational Warehouse
table model
○ Ability to provision multiple tables off same data set
● Picks latest changelog entry in the files
○ Applies them in order
● Applies projections & row level transformations
○ Produce ingestible data into Warehouse
● Uses HiveContext to gain access to UDFs
○ explode() etc to flatten JSON arrays.
● Scheduled Spark Job via Oozie
○ Runs every hour (tunable)
Paricon : PARquet Inference and CONversion
● Running in production since Feburary 2015
○ first Spark application at Uber
Motivation 1: Data Breakage & Evolution
Upstream Data Producers
Downstream Data Consumers
JSON at S3 data evolving over time … and one day
Motivation 1: Why Schema
● Contract
○ multiple teams
○ producers
○ consumers
● Avoid data breakage
○ because we have schema evolution systems
● Data to persist in a typed manner
○ analytics
● Serve as documentation
○ understand data faster
● Unit testable
Paricon : Workflow
Transfer
Convert
Infer
Validate
JSON / Gzip / S3
Avro
schema
Parquet /
In-house HDFSSchema
Repository
and
Management
Systems
reviewed /
consumed
Motivation 2: Why Parquet
● Supports schema
● 2 to 4 times FASTER than json/gzip
○ column pruning
■ wide tables at Uber
○ filter predicate push-down
○ compression
● Strong Spark support
○ SparkSQL
○ schema evolution
■ schema merging in Spark v1.3
■ merge old and new compatible schema versions
■ no “Alter table ...”
Paricon : Transfer
● distcp on Spark
○ only subset of command-line options currently
● Approach
○ compute the files list and assign them to RDD partitions
○ avoid stragglers by randomly grouping different dates
● Extras
○ Uber specific logic
■ filename conventions
■ backup policies
○ internal Spark eco-system
○ faster homegrown delta computation
○ get around s3a problem in Hadoop 2.6
Paricon : Infer
● Infer by JsonRDD
○ but not directly
● Challenge: Data is dirty
○ garbage in garbage out
● Two passes approach
○ first: data cleaning
○ second: JsonRDD inference
Paricon : Infer
● Data cleaning
○ structured as rules-based engine
○ each rule is an expectation
○ all rules are heuristics
■ based on business domain knowledge
○ the rules are pluggable based on topics
● Struct@JsonRDD vs Avro:
○ illegal characters in field names
○ repeating group names
○ more
Paricon : Convert
● Incremental conversion
○ assign days to RDD partitions
○ computation and checkpoint unit: day
○ new job or after failure: work on those partial days only
● Most number of codes among the four tasks
○ multiple source formats (encoded vs non-encoded)
○ data cleaning based on inferred schema
○ home grown JSON decoder for Avro
○ file stitching
Stitching : Motivation
File size
Number of files
HDFS
block size
● Inefficient for HDFS
● Many large files
○ break them
● But a lot more small files
○ stitch them
Stitching : Goal
HDFS Block HDFS Block HDFS Block HDFS Block
Parquet Block Parquet Block Parquet Block
HDFS Block HDFS Block HDFS Block HDFS Block
Parquet File Parquet File Parquet File Parquet File
● One parquet block per file
● Parquet file slightly less than HDFS
block
Stitching : Algorithms
● Algo1: Estimate a constant before conversion
○ pros: easy to do
○ cons: not work well with temporal variation
● Algo2: Estimate during conversion per RDD partition
○ each day has its own estimate
○ may even self-tuned during the day
Stitching : Experiments
●
○ N: number of Parquet files
○ Si: size of the i-th Parquet file
○ B: HDFS block size
○ First part: local I/O - files slightly smaller HDFS block
○ Second part: network I/O - penalty of files going over a block
● Benchmark queries
Paricon : Validate
● Modeled as “Source and converted tables join”
○ equi-join on primary key
○ compare the counts
○ compare the columns content
● SparkSQL
○ easy for implementation
○ hard for performance tuning
● Debugging tools
Some Production Numbers
● Inferred: >120 topics
● Converted: >40 topics
● Largest single job so far
○ process 15TB compressed (140TB uncompressed) data
○ one single topic
○ recover from multiple failures by checkpoints
● Numbers are increasing ...
Lessons
● Implement custom finer checkpointing
○ S3 data network fee
○ jobs/tasks failure -> download all data repeatedly
○ to save money and time
● There is no perfect data cleaning
○ 100% clean is not needed often
● Schema parsing implementation
○ tricky and takes much time for testing
Komondor: Problem Statement
● Current Kafka->HDFS ingestion service does too much
work:
○ Consume from Kafka -> Write Sequence Files -> Convert to Parquet ->
Upload to HDFS, HIVE compatible way
○ Parquet generation needs a lot of memory
○ Local writing and uploading is slow
● Need to decouple raw ingestion from consumable data
○ Move heavy lifting into Spark -> Keep raw-data delivery service lean
● Streaming job to keep converting raw data into Parquet,
as they land!
Komondor: Kafka Ingestion Service
Komondor
Streaming Raw
Data Delivery
Kafka
HDFS
Streaming
Ingestion
Batch Verification
& File Stitching
Raw Data
Consumable
Data
Komondor: Goals
● Fast raw data into permanent storage
● Spark Streaming Ingestor to ‘cook’ raw data
○ For now, Parquet generation
○ But opens up polyglot world for ORC, RCFile,....
● De-duplicate of raw data before consumption
○ Shields downstream consumers from at-least-once delivery of pipelines
○ Simply replay events for an entire day, in the event of pipeline outages
● Improved wellness of HDFS
○ Avoiding too many small files in HDFS
○ File stitcher job to combine small files from past days
INotify DStream: Komondor De-Duplication
INotify DStream: Motivation
● Streaming Job to pick up raw data files
○ Keeps end-to-end latency low vs batch job
● Spark Streaming FileDStream not sufficient
○ Only works 1 directory deep,
■ At least have two levels for <topic>/<dc>/
○ Provides the file contents directly
■ Loses valuable information in file name. eg: partition num
○ Checkpoint contains an entire file list
■ Will not scale to millions of files
○ Too much overhead to run one Job Per Topic
INotify DStream: HDFS INotify
● Similar to Linux iNotify to watch file system changes
● Exposes the HDFS Edit Log as an event stream
○ CREATE, CLOSE, APPEND, RENAME, METADATA, UNLINK events
○ Introduced in Hadoop Summit 2015
● Provides transaction id
○ Client can use to resume from a given position
● Event Log Purged every time the FSImage is uploaded
INotify DStream: Implementation
● Provides the HDFS INotify events as a Spark DStream
○ Implementation very similar to KafkaDirectDStream
● Checkpointing is straightforward:
○ Transactions have unique IDs.
○ Just save Transaction ID to permanent storage
● Filed SPARK-10555, vote up if you think it is useful :)
INotify DStream: Early Results
● Pretty stable when running on YARN
● HDFS iNotify reads ALL events from NameNode
● Have to add filtering
○ to catch only events of interests (Paths/Ext.)
○ Performed at Spark level
● Memory usage increases on NN when iNotify is running
INotify DStream: Future Uses, XDC Replication
● Open possibility, provided INotify is a charm in production
● Uber’s thinking about all active-active data architecture
○ This means n HDFS clusters that need to be in-sync
● Typical batch-based distcp creates bursty network
utilization
○ Or go through scheduling trouble to smoothen it out
○ INotify DStream provides way to keep shipping files as they land
○ Power of Spark to do any heavy lifting such as filtering sensitive data
Future/Ongoing Work
Our engines are revved up
Forecast: Sunny & Awesome with lots of Spark!
Future/Ongoing Work
● Spark SQL Based ETL-Platform
○ Powers all tables into warehouse
● Open up SQL-On-Hadoop via Spark SQL/Hive
○ Spark Shell is already so nifty!
● Machine Learning Platform using Spark
○ MLLib /GraphX Possibilities
● Standardized Support For Spark jobs
● Apollo: Next Gen Real-time analytics using Spark
Streaming
○ Watch for our next talk! ;)
We Are Hiring!!! :)
Thank You
(Special kudos to Uber Facilities & Security)
Questions?
Extra Slides
Trips Pipeline : Consolidating Changelogs
● Data Model, very similar to BigTable/HBase
○ row_key : uuid for trip
○ col_key : One column in trip record
○ version & body : version of the column & json blob
○ cell : Unique tuple of {row_key, col_key, version}
● Provides REST endpoint to tail cell change log for
every shard
Trips Pipeline : Challenge
● Existing ingestion turned cell changes into Warehouse
upserts,
○ Losing the version information
○ Unable to reject older (& duplicate) cell changes in logs, coming
from XDC replication
{ trip-xxx :
{“FARE”=>{f1:{body:12.35,version:11},
f2:{body:val2,version:10}},
{“ETA”=>{f3:{body:val3,version:13},
f4:{body:val4,version:10}}
}
trip-uuid FARE_f1 ETA_f3 ETA_f4
trip-xxx 12.35 4 5
trip-xyz 14.50 2 1
Spark At Uber
● Today
○ Paricon: Turn Historical Json Into Parquet Gold Mine
○ Streamio/Spark SQL : Deliver Global View of Trip Database
into Warehouse in near real-time
● Tomorrow
○ INotify DStream :
■ Komondor - The ‘Uber’ data ingestor
■ XDC Data Replicator
○ Adhoc SQL Access to data: Hive On Spark/Spark SQL
○ Spark Apps: Directly accessing data on HDFS
Trips Pipeline : Raw Row Images in HDFS
● Streamio : Generic connector of partitioned streams
○ Pluggable in & out stream implementations
● Tails cell changes from both DCs into a Kafka topic
● Uses HBase to construct full row image (latest value
for each column for a trip)
○ Logs ‘row changelog’ to HDFS
● Preserves version of latest cell for each column/row
○ Can efficiently de-duplicate/reconcile.
● Extensible to all Schemaless datastores

More Related Content

What's hot

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFSDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 

What's hot (20)

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFS
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 

Similar to Spark Meetup at Uber

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 

Similar to Spark Meetup at Uber (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Recently uploaded (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Spark Meetup at Uber

  • 2. Who We Are Early Engineers On Hadoop team @ Uber Kelvin Chu Reza ShiftehfarVinoth Chandar
  • 3. Agenda ● Intro to Data @ Uber ● Trips Pipeline Into Warehouse ● Paricon ● INotify DStream ● Future
  • 4. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber’s Mission “Transportation as reliable as running water, everywhere, for everyone” 300+ Cities 60+ Countries And growing...
  • 5. Data @ Uber ● Impact of Data is Huge! ○ 2000+ Unique Users Operating a massive transportation system ● Running critical business operations ○ Payments, Fraud, Marketing Spend, Background Checks … ● Unique & Interesting Problems ○ Supply vs Demand - Growth ○ Geo-Temporal Analytics ● Latency Is King ○ Enormous business value in making data available asap
  • 6. Data Architecture: Circa 2014 Kafka Logs Schemaless Databases RDBMS Tables OLAP Warehouse Applications Bulk Uploader Amazon S3 EMR Celery/Python ETL Adhoc SQL
  • 7. Challenges ● Scaling to high volume Kafka streams ○ eg: Event data coming from phones ● Merged Views of DB Changelogs across DCs ○ Some of the most important data - trips (duh!) ● Fragile ingestion model ○ Projections/Transformation in pipelines ○ Data Lake philosophy - raw data on HDFS, transform later using Spark ● Free-form JSON data → Data Breakages ● First order of business - Reliable Data
  • 8. New World Order: Hadoop & Spark Kafka Logs Schemaless Databases RDBMS Tables Amazon S3 HDFS OLAP Warehouse Applications Adhoc SQL Applications Adhoc SQL Machine Learning Paricon Spark SQL Spark /Hive Spark Jobs/ Oozie Spark? Data Delivery Services Raw Data Cooked Spark/Spark Streaming
  • 9. Trips Pipeline : Problem ● Most Valuable Dataset in Uber (100% Accuracy) ● Trips stored in Uber’s ‘schemaless’ datastores (sharded Mysql), across DCs, cross replicated ● Need a consolidated view across dcs, quickly (~1-2 hr end-end) Trip Store (DC1) Trip Store (DC2) Writes in DC1 Writes in DC2 Multi Master XDC Replication
  • 10. Trips Pipeline : Architecture
  • 11. Trips Pipeline : ETL via SparkSQL ● Decouples raw ingestion from Relational Warehouse table model ○ Ability to provision multiple tables off same data set ● Picks latest changelog entry in the files ○ Applies them in order ● Applies projections & row level transformations ○ Produce ingestible data into Warehouse ● Uses HiveContext to gain access to UDFs ○ explode() etc to flatten JSON arrays. ● Scheduled Spark Job via Oozie ○ Runs every hour (tunable)
  • 12. Paricon : PARquet Inference and CONversion ● Running in production since Feburary 2015 ○ first Spark application at Uber
  • 13. Motivation 1: Data Breakage & Evolution Upstream Data Producers Downstream Data Consumers JSON at S3 data evolving over time … and one day
  • 14. Motivation 1: Why Schema ● Contract ○ multiple teams ○ producers ○ consumers ● Avoid data breakage ○ because we have schema evolution systems ● Data to persist in a typed manner ○ analytics ● Serve as documentation ○ understand data faster ● Unit testable
  • 15. Paricon : Workflow Transfer Convert Infer Validate JSON / Gzip / S3 Avro schema Parquet / In-house HDFSSchema Repository and Management Systems reviewed / consumed
  • 16. Motivation 2: Why Parquet ● Supports schema ● 2 to 4 times FASTER than json/gzip ○ column pruning ■ wide tables at Uber ○ filter predicate push-down ○ compression ● Strong Spark support ○ SparkSQL ○ schema evolution ■ schema merging in Spark v1.3 ■ merge old and new compatible schema versions ■ no “Alter table ...”
  • 17. Paricon : Transfer ● distcp on Spark ○ only subset of command-line options currently ● Approach ○ compute the files list and assign them to RDD partitions ○ avoid stragglers by randomly grouping different dates ● Extras ○ Uber specific logic ■ filename conventions ■ backup policies ○ internal Spark eco-system ○ faster homegrown delta computation ○ get around s3a problem in Hadoop 2.6
  • 18. Paricon : Infer ● Infer by JsonRDD ○ but not directly ● Challenge: Data is dirty ○ garbage in garbage out ● Two passes approach ○ first: data cleaning ○ second: JsonRDD inference
  • 19. Paricon : Infer ● Data cleaning ○ structured as rules-based engine ○ each rule is an expectation ○ all rules are heuristics ■ based on business domain knowledge ○ the rules are pluggable based on topics ● Struct@JsonRDD vs Avro: ○ illegal characters in field names ○ repeating group names ○ more
  • 20. Paricon : Convert ● Incremental conversion ○ assign days to RDD partitions ○ computation and checkpoint unit: day ○ new job or after failure: work on those partial days only ● Most number of codes among the four tasks ○ multiple source formats (encoded vs non-encoded) ○ data cleaning based on inferred schema ○ home grown JSON decoder for Avro ○ file stitching
  • 21. Stitching : Motivation File size Number of files HDFS block size ● Inefficient for HDFS ● Many large files ○ break them ● But a lot more small files ○ stitch them
  • 22. Stitching : Goal HDFS Block HDFS Block HDFS Block HDFS Block Parquet Block Parquet Block Parquet Block HDFS Block HDFS Block HDFS Block HDFS Block Parquet File Parquet File Parquet File Parquet File ● One parquet block per file ● Parquet file slightly less than HDFS block
  • 23. Stitching : Algorithms ● Algo1: Estimate a constant before conversion ○ pros: easy to do ○ cons: not work well with temporal variation ● Algo2: Estimate during conversion per RDD partition ○ each day has its own estimate ○ may even self-tuned during the day
  • 24. Stitching : Experiments ● ○ N: number of Parquet files ○ Si: size of the i-th Parquet file ○ B: HDFS block size ○ First part: local I/O - files slightly smaller HDFS block ○ Second part: network I/O - penalty of files going over a block ● Benchmark queries
  • 25. Paricon : Validate ● Modeled as “Source and converted tables join” ○ equi-join on primary key ○ compare the counts ○ compare the columns content ● SparkSQL ○ easy for implementation ○ hard for performance tuning ● Debugging tools
  • 26. Some Production Numbers ● Inferred: >120 topics ● Converted: >40 topics ● Largest single job so far ○ process 15TB compressed (140TB uncompressed) data ○ one single topic ○ recover from multiple failures by checkpoints ● Numbers are increasing ...
  • 27. Lessons ● Implement custom finer checkpointing ○ S3 data network fee ○ jobs/tasks failure -> download all data repeatedly ○ to save money and time ● There is no perfect data cleaning ○ 100% clean is not needed often ● Schema parsing implementation ○ tricky and takes much time for testing
  • 28. Komondor: Problem Statement ● Current Kafka->HDFS ingestion service does too much work: ○ Consume from Kafka -> Write Sequence Files -> Convert to Parquet -> Upload to HDFS, HIVE compatible way ○ Parquet generation needs a lot of memory ○ Local writing and uploading is slow ● Need to decouple raw ingestion from consumable data ○ Move heavy lifting into Spark -> Keep raw-data delivery service lean ● Streaming job to keep converting raw data into Parquet, as they land!
  • 29. Komondor: Kafka Ingestion Service Komondor Streaming Raw Data Delivery Kafka HDFS Streaming Ingestion Batch Verification & File Stitching Raw Data Consumable Data
  • 30. Komondor: Goals ● Fast raw data into permanent storage ● Spark Streaming Ingestor to ‘cook’ raw data ○ For now, Parquet generation ○ But opens up polyglot world for ORC, RCFile,.... ● De-duplicate of raw data before consumption ○ Shields downstream consumers from at-least-once delivery of pipelines ○ Simply replay events for an entire day, in the event of pipeline outages ● Improved wellness of HDFS ○ Avoiding too many small files in HDFS ○ File stitcher job to combine small files from past days
  • 31. INotify DStream: Komondor De-Duplication
  • 32. INotify DStream: Motivation ● Streaming Job to pick up raw data files ○ Keeps end-to-end latency low vs batch job ● Spark Streaming FileDStream not sufficient ○ Only works 1 directory deep, ■ At least have two levels for <topic>/<dc>/ ○ Provides the file contents directly ■ Loses valuable information in file name. eg: partition num ○ Checkpoint contains an entire file list ■ Will not scale to millions of files ○ Too much overhead to run one Job Per Topic
  • 33. INotify DStream: HDFS INotify ● Similar to Linux iNotify to watch file system changes ● Exposes the HDFS Edit Log as an event stream ○ CREATE, CLOSE, APPEND, RENAME, METADATA, UNLINK events ○ Introduced in Hadoop Summit 2015 ● Provides transaction id ○ Client can use to resume from a given position ● Event Log Purged every time the FSImage is uploaded
  • 34. INotify DStream: Implementation ● Provides the HDFS INotify events as a Spark DStream ○ Implementation very similar to KafkaDirectDStream ● Checkpointing is straightforward: ○ Transactions have unique IDs. ○ Just save Transaction ID to permanent storage ● Filed SPARK-10555, vote up if you think it is useful :)
  • 35. INotify DStream: Early Results ● Pretty stable when running on YARN ● HDFS iNotify reads ALL events from NameNode ● Have to add filtering ○ to catch only events of interests (Paths/Ext.) ○ Performed at Spark level ● Memory usage increases on NN when iNotify is running
  • 36. INotify DStream: Future Uses, XDC Replication ● Open possibility, provided INotify is a charm in production ● Uber’s thinking about all active-active data architecture ○ This means n HDFS clusters that need to be in-sync ● Typical batch-based distcp creates bursty network utilization ○ Or go through scheduling trouble to smoothen it out ○ INotify DStream provides way to keep shipping files as they land ○ Power of Spark to do any heavy lifting such as filtering sensitive data
  • 37. Future/Ongoing Work Our engines are revved up Forecast: Sunny & Awesome with lots of Spark!
  • 38. Future/Ongoing Work ● Spark SQL Based ETL-Platform ○ Powers all tables into warehouse ● Open up SQL-On-Hadoop via Spark SQL/Hive ○ Spark Shell is already so nifty! ● Machine Learning Platform using Spark ○ MLLib /GraphX Possibilities ● Standardized Support For Spark jobs ● Apollo: Next Gen Real-time analytics using Spark Streaming ○ Watch for our next talk! ;)
  • 40. Thank You (Special kudos to Uber Facilities & Security)
  • 43. Trips Pipeline : Consolidating Changelogs ● Data Model, very similar to BigTable/HBase ○ row_key : uuid for trip ○ col_key : One column in trip record ○ version & body : version of the column & json blob ○ cell : Unique tuple of {row_key, col_key, version} ● Provides REST endpoint to tail cell change log for every shard
  • 44. Trips Pipeline : Challenge ● Existing ingestion turned cell changes into Warehouse upserts, ○ Losing the version information ○ Unable to reject older (& duplicate) cell changes in logs, coming from XDC replication { trip-xxx : {“FARE”=>{f1:{body:12.35,version:11}, f2:{body:val2,version:10}}, {“ETA”=>{f3:{body:val3,version:13}, f4:{body:val4,version:10}} } trip-uuid FARE_f1 ETA_f3 ETA_f4 trip-xxx 12.35 4 5 trip-xyz 14.50 2 1
  • 45. Spark At Uber ● Today ○ Paricon: Turn Historical Json Into Parquet Gold Mine ○ Streamio/Spark SQL : Deliver Global View of Trip Database into Warehouse in near real-time ● Tomorrow ○ INotify DStream : ■ Komondor - The ‘Uber’ data ingestor ■ XDC Data Replicator ○ Adhoc SQL Access to data: Hive On Spark/Spark SQL ○ Spark Apps: Directly accessing data on HDFS
  • 46. Trips Pipeline : Raw Row Images in HDFS ● Streamio : Generic connector of partitioned streams ○ Pluggable in & out stream implementations ● Tails cell changes from both DCs into a Kafka topic ● Uses HBase to construct full row image (latest value for each column for a trip) ○ Logs ‘row changelog’ to HDFS ● Preserves version of latest cell for each column/row ○ Can efficiently de-duplicate/reconcile. ● Extensible to all Schemaless datastores