Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
1. Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.
2. About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪ Previously at Databricks and UC Berkeley
▪ Cheng Su
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Contributor (Spark SQL)
▪ Previously worked on Hive & Hadoop at Facebook
3. Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
4. Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
PredictionsModel
5. Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Physical Encodings)
2. Feature Reaping
3. Feature Injection
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
6. Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
7. Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (multiple PBs/day)
userId: BIGINT
adId: BIGINT
features: MAP<INT, DOUBLE>
…
Feature Tables
- Tables to store all possible features (many of them aren’t promoted in training data
table)
- Smaller volume (low-100s of TBs/ day)
userId: BIGINT
features: MAP<INT, DOUBLE>
…
gender likes …
age
state
country
8. Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
9. Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
2. Feature Reaping: Removing unnecessary features (id and value) from training
data. Think “deleting existing keys from a map”
gender likes …
age
state
country
Feature Injection
Feature Reaping
Training Data Table Feature Tables
10. Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separately
▪ PRESET, DATA, LENGTH stream for each column
▪ Different encoding and compression
strategy for each column
11. How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- Key stream: k1, k1, k2, k1, k2
- Value stream: v1, v2, v3 v5, v4
▪ Each stream is individually encoded and compressed
▪ Reading or deleting specific keys (i.e., feature reaping) becomes a
problem
- Need to read (decompress and decode) and re-write ALL keys and values
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
INT
col 0, node: 2
DOUBLE
col 0, node: 3
col 0, node: 1
k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
12. Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- k1 stream: v1, v2, v5
- k2 stream: NULL, v3, v4
- Stores map like a struct
▪ Each key’s value stream is individually encoded and compressed
▪ Reading or deleting specific keys becomes very efficient!
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
Value (k1)
col 0, node: 3, seq: 1
Value (k2)
col 0, node: 1
v1, v2, v5 NULL, v3, v4
col 0, node: 3, seq: 2
13. Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
14. Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped feature ids
▪ For each reaping SQL query, Spark has special
customization in query planner, execution
engine and commit protocol
▪ Each Spark task launches a SQL transform
process, and uses native/C++ binary to do
efficient flat map operations
SparkJavaExecutor
c++ reaper
transform
SparkJavaExecutor
c++ reaper
transform
training_data_v1_1.orc training_data_v1_2.orc
training_data_v2_1.orc training_data_v2_2.orc
15. Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB data)
Naïve Flat Map
0
500000
1000000
1500000
2000000
300PB
CPU(days)
CPU cost for flat map vs naïve solution*
(89x better on 300PB data)
Naïve Flat Map
▪ Case 1
▪ Input data size: 20PB
▪ # of reaped features: 200
▪ # total features: ~1k
▪ Case 2
▪ Input data size: 300PB
▪ # of reaped features: 200
▪ # total features: ~10k
*Naïve solution: A Spark SQL query to re-write all data
with removing required features from map column with
UDF/Lambda.
16. Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
17. Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
18. Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Introducing: Aligned Tables!
Training Data Table Feature Tables
19. Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2 separate row-by-row aligned tables
▪ An aligned table is a table that has the same layout as the original
table
- Same number of files
- Same file names
- Same number of rows (and their order) in each file.
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
20. Query Plan for Aligned Table
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Scan
(training table)
Scan
(feature table)
Project
(…, file_name,
row_order)
Join
(LEFT OUTER)
Shuffle
(file_name)
Sort
(file_name,
row_order)
InsertIntoHadoopFsRelationComman
d (Aligned Table)
21. Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the corresponding original table file in one task
▪ Read row-by-row according to row order
▪ Merge aligned table columns per row with corresponding original table columns per row
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
training table
file_1.orc file_2.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table aligned tabletraining table
reader task 1 reader task 2
22. End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
23. End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
24. End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
25. End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
Aligned Tables vs Lookup Hash
Join
Compute Savings: 1.5x
Storage Savings: 2.1x
26. Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
27. Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch Inference
▪ Training
MERGE training_table
PARTITION(ds='2020-10-28', pipeline='...', ts)
USING (
SELECT ...) AS f
ON features[0][0] = f.key
WHEN MATCHED THEN UPDATE
SET float_features = MAP_CONCAT(float_features,
f.densefeatures)