Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.
About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪...
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection...
Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A ...
Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Phy...
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection...
Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (mult...
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features ...
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features ...
Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separate...
How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- ...
Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1...
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection...
Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped featur...
Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB da...
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection...
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
...
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
...
Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2...
Query Plan for Aligned Table
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
...
Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the correspondin...
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into trai...
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into trai...
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into trai...
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into trai...
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection...
Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch...
Thank you!
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Upcoming SlideShare
Loading in …5
×

of

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 1 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 2 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 3 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 4 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 5 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 6 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 7 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 8 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 9 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 10 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 11 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 12 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 13 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 14 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 15 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 16 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 17 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 18 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 19 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 20 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 21 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 22 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 23 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 24 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 25 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 26 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 27 Scaling Machine Learning Feature Engineering in Apache Spark at Facebook Slide 28
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Download to read offline

Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.

  • Be the first to like this

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

  1. 1. Scaling ML Feature Engineering with Apache Spark at Facebook Cheng Su & Sameer Agarwal Facebook Inc.
  2. 2. About Us ▪ Sameer Agarwal ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Committer (Spark Core/SQL) ▪ Previously at Databricks and UC Berkeley ▪ Cheng Su ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Contributor (Spark SQL) ▪ Previously worked on Hive & Hadoop at Facebook
  3. 3. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  4. 4. Machine Learning at Facebook1 Data Features Training Inference 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018 PredictionsModel
  5. 5. Machine Learning at Facebook1 Data Features Training Inferenc e PredictionsModel This Talk 1. Data Layouts (Tables and Physical Encodings) 2. Feature Reaping 3. Feature Injection 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
  6. 6. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  7. 7. Data Layouts (Tables and Physical Encodings) Training Data Table - Table to store data for ML training - Huge volume (multiple PBs/day) userId: BIGINT adId: BIGINT features: MAP<INT, DOUBLE> … Feature Tables - Tables to store all possible features (many of them aren’t promoted in training data table) - Smaller volume (low-100s of TBs/ day) userId: BIGINT features: MAP<INT, DOUBLE> … gender likes … age state country
  8. 8. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” gender likes … age state country Feature Injection Training Data Table Feature Tables
  9. 9. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” 2. Feature Reaping: Removing unnecessary features (id and value) from training data. Think “deleting existing keys from a map” gender likes … age state country Feature Injection Feature Reaping Training Data Table Feature Tables
  10. 10. Background: Apache ORC ▪ Stripe (Row Group) ▪ Rows are divided into multiple groups ▪ Stream ▪ Columns are stored separately ▪ PRESET, DATA, LENGTH stream for each column ▪ Different encoding and compression strategy for each column
  11. 11. How is a Feature Map Stored in ORC? ▪ Key and value are stored as separate streams/columns - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - Key stream: k1, k1, k2, k1, k2 - Value stream: v1, v2, v3 v5, v4 ▪ Each stream is individually encoded and compressed ▪ Reading or deleting specific keys (i.e., feature reaping) becomes a problem - Need to read (decompress and decode) and re-write ALL keys and values features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP INT col 0, node: 2 DOUBLE col 0, node: 3 col 0, node: 1 k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
  12. 12. Introducing: ORC Flattened Map ▪ Values that correspond to each key are stored as separate streams - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - k1 stream: v1, v2, v5 - k2 stream: NULL, v3, v4 - Stores map like a struct ▪ Each key’s value stream is individually encoded and compressed ▪ Reading or deleting specific keys becomes very efficient! features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP Value (k1) col 0, node: 3, seq: 1 Value (k2) col 0, node: 1 v1, v2, v5 NULL, v3, v4 col 0, node: 3, seq: 2
  13. 13. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  14. 14. Feature Reaping ▪ Feature Reaping frameworks generate Spark SQL queries based on table name, partitions, and reaped feature ids ▪ For each reaping SQL query, Spark has special customization in query planner, execution engine and commit protocol ▪ Each Spark task launches a SQL transform process, and uses native/C++ binary to do efficient flat map operations SparkJavaExecutor c++ reaper transform SparkJavaExecutor c++ reaper transform training_data_v1_1.orc training_data_v1_2.orc training_data_v2_1.orc training_data_v2_2.orc
  15. 15. Performance 0 10000 20000 30000 40000 50000 20PB CPU(days) CPU cost for flat map vs naïve solution* (14x better on 20PB data) Naïve Flat Map 0 500000 1000000 1500000 2000000 300PB CPU(days) CPU cost for flat map vs naïve solution* (89x better on 300PB data) Naïve Flat Map ▪ Case 1 ▪ Input data size: 20PB ▪ # of reaped features: 200 ▪ # total features: ~1k ▪ Case 2 ▪ Input data size: 300PB ▪ # of reaped features: 200 ▪ # total features: ~10k *Naïve solution: A Spark SQL query to re-write all data with removing required features from map column with UDF/Lambda.
  16. 16. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  17. 17. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Training Data Table Feature Tables
  18. 18. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Introducing: Aligned Tables! Training Data Table Feature Tables
  19. 19. Introducing: Aligned Table ▪ Intuition: Store the output of the join between the training table and the feature table in 2 separate row-by-row aligned tables ▪ An aligned table is a table that has the same layout as the original table - Same number of files - Same file names - Same number of rows (and their order) in each file. col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table
  20. 20. Query Plan for Aligned Table col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table Scan (training table) Scan (feature table) Project (…, file_name, row_order) Join (LEFT OUTER) Shuffle (file_name) Sort (file_name, row_order) InsertIntoHadoopFsRelationComman d (Aligned Table)
  21. 21. Reading Aligned Tables ▪ FB-ORC aligned table row-by-row merge reader ▪ Read each aligned table file with the corresponding original table file in one task ▪ Read row-by-row according to row order ▪ Merge aligned table columns per row with corresponding original table columns per row id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... training table file_1.orc file_2.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table aligned tabletraining table reader task 1 reader task 2
  22. 22. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time
  23. 23. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  24. 24. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  25. 25. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x Aligned Tables vs Lookup Hash Join Compute Savings: 1.5x Storage Savings: 2.1x
  26. 26. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  27. 27. Future Work ▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs) ▪ Onboarding more ML use cases to Spark ▪ Batch Inference ▪ Training MERGE training_table PARTITION(ds='2020-10-28', pipeline='...', ts) USING ( SELECT ...) AS f ON features[0][0] = f.key WHEN MATCHED THEN UPDATE SET float_features = MAP_CONCAT(float_features, f.densefeatures)
  28. 28. Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.

Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.

Views

Total views

213

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

12

Shares

0

Comments

0

Likes

0

×