The document discusses the challenges of migrating a production pipeline from a legacy Big Data platform to Spark. It presents an approach using CyFlow, a framework built on Spark that allows component reuse and defines dependencies through a directed acyclic graph (DAG). Key challenges addressed include maintaining semantics during code conversion, meeting real-time constraints, and reducing costs. Metrics for validation include Jaccard similarity and precision/recall. Performance is tuned by aggregating state, modifying partitions, caching data, and unpersisting unneeded dataframes.
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
The Pill for Your Migration Hell
1. The Pill for your
Migration Hell
Roy Levin & Tomer Koren
Microsoft
2. Session Goals
Discuss the challenges in migrating a production
pipeline from a legacy Big Data platform to Spark.
Present a methodical approach to accomplish this
task
Measuring quality, optimizing performance and
scalability and deciding on the ‘Definition of Done’
5. Background
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
11. Maintaining Semantics
Schema Changes:
Source IP Destination IP Source
Hostname
Destination
Hostname
1.0.0.1 2.0.0.2 Vm1 Vm2
3.0.0.3 4.0.0.4 Vm3
Source IP Destinatio
n IP
Source
Hostname
Exist?
Hostname Host Id
1.0.0.1 2.0.0.2 True Vm1 111-11
1.0.0.1 2.0.0.2 False Vm2 222-22
3.0.0.3 4.0.0.4 True Vm3 333-33
SparkLegacy
12. Real Time Constraint
• Multiple ETL Pipeline that runs on an hourly basis
• Largest data feed contains ~4TB events per hour
• Running time should be less than 60 minutes to avoid accumulated
latency.
15. View Legacy Code in terms of
High-Level Components
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
16. View Legacy Code in terms of
High-Level Components
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
20. Managing it all
• This migration process needs to be done for every component of
every detection
• Some components are reused across detections
• What about the connections between the components?
• These dependencies need to be represented and validated as well
25. - ¾E⁄ʽ
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Using notebooks
26. - ¾E⁄ʽUsing notebooks
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Structure Implicit, according to schedules Explicit and visually depicted
27. - ¾E⁄ʽ
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Structure Implicit, according to schedules Explicit and visually depicted
Typing No schema checks Schema checks before running
Using notebooks
30. CosmosOutputData2 SparkOutputData2
Validation -- A Closer Look
Comparator
Recall our challenges
• Legacy ML models
• Non-reproducable ML models
• Indeterministic semantics (e.g.: based row numbers)
• Some non-translatable schema changes
• Some randomly generated UUIDs (e.g. AlertIds)
How Much to invest in achieving full parity?
31. Decide Based on Soft Metric of the Final Output
Resource Id
1 Res1
2 Res2
3 Res3
4 Res4
5 Res5
6 Res6
7 Res7
Alerts generated by Legacy Component
Resource Id
1 Res1
2 Res3
3 Res4
4 Res5
5 Res6
6 Res8
7 Res9
8 Res10
Alerts generated by Spark Component
𝑗𝑠 =
𝑦 ∩ 𝑦′
𝑦 ∪ 𝑦′
=
5
9
≅ 0.56
Jaccard Similarity
Precision
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦′
=
5
8
≅ 0.63
Recall
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦
=
5
7
≅ 0.71
For 𝒚 ∩ 𝒚 we write an alert content validator (for the rest of the column values)
33. Measure Running Time on Actual Load
14 Hours !!
Feature
Engineering
Classification
Based
Detection
CYFLOW DAG
Alert
Publisher
Processed
Data 1
Processed
Data 2
Published
Alerts
34. Finding the culprit
Transformation 1
Intermediate
Result 1
Feature Engineering
Transformation 1
Feature Engineering
Transformation 2
Intermediate
Result 2
20 minutes13 hours !!
35. Tuning Performance
• Aggregate hourly state into daily state (end of day)
Partition N
State
12:00 AMPartition 1
State
12:00 AM
Partition N
State
01:00 AMPartition 1
State
01:00 AM
Partition N
State
02:00 AMPartition 1
State
23:00 PM
24HourlySlices
Daily Aggregator
Partition N
State
Daily Partition 1
State
Daily
36. Tuning Performance
• Modify Default Partition (number of cores x 3) :
• Use of broadcasting when UDF with large signatures are reused.
• Cache (be aware for memory failures )
• Unpersist – remove dataframe from cache when no longer needed
38. Summary
• Present the challenges in migrating a large-scale legacy Big Data
system to Spark
• Preserving Semantics
• Realtime constrains
• COGS
• Introducing - ¾E⁄ʽ -- a framework built over Apache Spark that
allows component reuse & connectivity
• Discuss different validation strategies
• Reducing runtime and COGS: aspects of Spark performance tuning