SlideShare a Scribd company logo
1 of 28
Download to read offline
Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.
About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪ Previously at Databricks and UC Berkeley
▪ Cheng Su
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Contributor (Spark SQL)
▪ Previously worked on Hive & Hadoop at Facebook
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
PredictionsModel
Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Physical Encodings)
2. Feature Reaping
3. Feature Injection
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (multiple PBs/day)
userId: BIGINT
adId: BIGINT
features: MAP<INT, DOUBLE>
…
Feature Tables
- Tables to store all possible features (many of them aren’t promoted in training data
table)
- Smaller volume (low-100s of TBs/ day)
userId: BIGINT
features: MAP<INT, DOUBLE>
…
gender likes …
age
state
country
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
2. Feature Reaping: Removing unnecessary features (id and value) from training
data. Think “deleting existing keys from a map”
gender likes …
age
state
country
Feature Injection
Feature Reaping
Training Data Table Feature Tables
Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separately
▪ PRESET, DATA, LENGTH stream for each column
▪ Different encoding and compression
strategy for each column
How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- Key stream: k1, k1, k2, k1, k2
- Value stream: v1, v2, v3 v5, v4
▪ Each stream is individually encoded and compressed
▪ Reading or deleting specific keys (i.e., feature reaping) becomes a
problem
- Need to read (decompress and decode) and re-write ALL keys and values
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
INT
col 0, node: 2
DOUBLE
col 0, node: 3
col 0, node: 1
k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- k1 stream: v1, v2, v5
- k2 stream: NULL, v3, v4
- Stores map like a struct
▪ Each key’s value stream is individually encoded and compressed
▪ Reading or deleting specific keys becomes very efficient!
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
Value (k1)
col 0, node: 3, seq: 1
Value (k2)
col 0, node: 1
v1, v2, v5 NULL, v3, v4
col 0, node: 3, seq: 2
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped feature ids
▪ For each reaping SQL query, Spark has special
customization in query planner, execution
engine and commit protocol
▪ Each Spark task launches a SQL transform
process, and uses native/C++ binary to do
efficient flat map operations
SparkJavaExecutor
c++ reaper
transform
SparkJavaExecutor
c++ reaper
transform
training_data_v1_1.orc training_data_v1_2.orc
training_data_v2_1.orc training_data_v2_2.orc
Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB data)
Naïve Flat Map
0
500000
1000000
1500000
2000000
300PB
CPU(days)
CPU cost for flat map vs naïve solution*
(89x better on 300PB data)
Naïve Flat Map
▪ Case 1
▪ Input data size: 20PB
▪ # of reaped features: 200
▪ # total features: ~1k
▪ Case 2
▪ Input data size: 300PB
▪ # of reaped features: 200
▪ # total features: ~10k
*Naïve solution: A Spark SQL query to re-write all data
with removing required features from map column with
UDF/Lambda.
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Introducing: Aligned Tables!
Training Data Table Feature Tables
Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2 separate row-by-row aligned tables
▪ An aligned table is a table that has the same layout as the original
table
- Same number of files
- Same file names
- Same number of rows (and their order) in each file.
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Query Plan for Aligned Table
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Scan
(training table)
Scan
(feature table)
Project
(…, file_name,
row_order)
Join
(LEFT OUTER)
Shuffle
(file_name)
Sort
(file_name,
row_order)
InsertIntoHadoopFsRelationComman
d (Aligned Table)
Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the corresponding original table file in one task
▪ Read row-by-row according to row order
▪ Merge aligned table columns per row with corresponding original table columns per row
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
training table
file_1.orc file_2.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table aligned tabletraining table
reader task 1 reader task 2
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
Aligned Tables vs Lookup Hash
Join
Compute Savings: 1.5x
Storage Savings: 2.1x
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch Inference
▪ Training
MERGE training_table
PARTITION(ds='2020-10-28', pipeline='...', ts)
USING (
SELECT ...) AS f
ON features[0][0] = f.key
WHEN MATCHED THEN UPDATE
SET float_features = MAP_CONCAT(float_features,
f.densefeatures)
Thank you!
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFsDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 

What's hot (20)

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 

Similar to Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 

Similar to Scaling Machine Learning Feature Engineering in Apache Spark at Facebook (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 

Recently uploaded (17)

AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

  • 1. Scaling ML Feature Engineering with Apache Spark at Facebook Cheng Su & Sameer Agarwal Facebook Inc.
  • 2. About Us ▪ Sameer Agarwal ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Committer (Spark Core/SQL) ▪ Previously at Databricks and UC Berkeley ▪ Cheng Su ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Contributor (Spark SQL) ▪ Previously worked on Hive & Hadoop at Facebook
  • 3. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 4. Machine Learning at Facebook1 Data Features Training Inference 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018 PredictionsModel
  • 5. Machine Learning at Facebook1 Data Features Training Inferenc e PredictionsModel This Talk 1. Data Layouts (Tables and Physical Encodings) 2. Feature Reaping 3. Feature Injection 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
  • 6. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 7. Data Layouts (Tables and Physical Encodings) Training Data Table - Table to store data for ML training - Huge volume (multiple PBs/day) userId: BIGINT adId: BIGINT features: MAP<INT, DOUBLE> … Feature Tables - Tables to store all possible features (many of them aren’t promoted in training data table) - Smaller volume (low-100s of TBs/ day) userId: BIGINT features: MAP<INT, DOUBLE> … gender likes … age state country
  • 8. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 9. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” 2. Feature Reaping: Removing unnecessary features (id and value) from training data. Think “deleting existing keys from a map” gender likes … age state country Feature Injection Feature Reaping Training Data Table Feature Tables
  • 10. Background: Apache ORC ▪ Stripe (Row Group) ▪ Rows are divided into multiple groups ▪ Stream ▪ Columns are stored separately ▪ PRESET, DATA, LENGTH stream for each column ▪ Different encoding and compression strategy for each column
  • 11. How is a Feature Map Stored in ORC? ▪ Key and value are stored as separate streams/columns - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - Key stream: k1, k1, k2, k1, k2 - Value stream: v1, v2, v3 v5, v4 ▪ Each stream is individually encoded and compressed ▪ Reading or deleting specific keys (i.e., feature reaping) becomes a problem - Need to read (decompress and decode) and re-write ALL keys and values features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP INT col 0, node: 2 DOUBLE col 0, node: 3 col 0, node: 1 k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
  • 12. Introducing: ORC Flattened Map ▪ Values that correspond to each key are stored as separate streams - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - k1 stream: v1, v2, v5 - k2 stream: NULL, v3, v4 - Stores map like a struct ▪ Each key’s value stream is individually encoded and compressed ▪ Reading or deleting specific keys becomes very efficient! features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP Value (k1) col 0, node: 3, seq: 1 Value (k2) col 0, node: 1 v1, v2, v5 NULL, v3, v4 col 0, node: 3, seq: 2
  • 13. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 14. Feature Reaping ▪ Feature Reaping frameworks generate Spark SQL queries based on table name, partitions, and reaped feature ids ▪ For each reaping SQL query, Spark has special customization in query planner, execution engine and commit protocol ▪ Each Spark task launches a SQL transform process, and uses native/C++ binary to do efficient flat map operations SparkJavaExecutor c++ reaper transform SparkJavaExecutor c++ reaper transform training_data_v1_1.orc training_data_v1_2.orc training_data_v2_1.orc training_data_v2_2.orc
  • 15. Performance 0 10000 20000 30000 40000 50000 20PB CPU(days) CPU cost for flat map vs naïve solution* (14x better on 20PB data) Naïve Flat Map 0 500000 1000000 1500000 2000000 300PB CPU(days) CPU cost for flat map vs naïve solution* (89x better on 300PB data) Naïve Flat Map ▪ Case 1 ▪ Input data size: 20PB ▪ # of reaped features: 200 ▪ # total features: ~1k ▪ Case 2 ▪ Input data size: 300PB ▪ # of reaped features: 200 ▪ # total features: ~10k *Naïve solution: A Spark SQL query to re-write all data with removing required features from map column with UDF/Lambda.
  • 16. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 17. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 18. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Introducing: Aligned Tables! Training Data Table Feature Tables
  • 19. Introducing: Aligned Table ▪ Intuition: Store the output of the join between the training table and the feature table in 2 separate row-by-row aligned tables ▪ An aligned table is a table that has the same layout as the original table - Same number of files - Same file names - Same number of rows (and their order) in each file. col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table
  • 20. Query Plan for Aligned Table col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table Scan (training table) Scan (feature table) Project (…, file_name, row_order) Join (LEFT OUTER) Shuffle (file_name) Sort (file_name, row_order) InsertIntoHadoopFsRelationComman d (Aligned Table)
  • 21. Reading Aligned Tables ▪ FB-ORC aligned table row-by-row merge reader ▪ Read each aligned table file with the corresponding original table file in one task ▪ Read row-by-row according to row order ▪ Merge aligned table columns per row with corresponding original table columns per row id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... training table file_1.orc file_2.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table aligned tabletraining table reader task 1 reader task 2
  • 22. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time
  • 23. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 24. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 25. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x Aligned Tables vs Lookup Hash Join Compute Savings: 1.5x Storage Savings: 2.1x
  • 26. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 27. Future Work ▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs) ▪ Onboarding more ML use cases to Spark ▪ Batch Inference ▪ Training MERGE training_table PARTITION(ds='2020-10-28', pipeline='...', ts) USING ( SELECT ...) AS f ON features[0][0] = f.key WHEN MATCHED THEN UPDATE SET float_features = MAP_CONCAT(float_features, f.densefeatures)
  • 28. Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.