The Pill for Your Migration Hell

The Pill for your
Migration Hell
Roy Levin & Tomer Koren
Microsoft

Session Goals
Discuss the challenges in migrating a production
pipeline from a legacy Big Data platform to Spark.
Present a methodical approach to accomplish this
task
Measuring quality, optimizing performance and
scalability and deciding on the ‘Definition of Done’

Background
On Premises
Azure Security Center

Background
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K

Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization

Expectations Vs Reality
Legacy
Pipeline
Legacy
Pipeline

Maintaining Semantics (Same Input == Same Output)
Real Time Constraint ( Over 100 TB per Day)
Cost of goods sold (COGS)
Challenges

Maintaining Semantics
Code Conversion:
• Rewrite UDFs (C# -> Python)
• Rewrite Transformations (USQL -> Pyspark)
PySparkUSQL

Code Conversion:
• Different Datatypes:
• Different ML Libraries

Schema Changes:
Source IP Destination IP Source
Hostname
Destination
Hostname
1.0.0.1 2.0.0.2 Vm1 Vm2
3.0.0.3 4.0.0.4 Vm3
Source IP Destinatio
n IP
Source
Hostname
Exist?
Hostname Host Id
1.0.0.1 2.0.0.2 True Vm1 111-11
1.0.0.1 2.0.0.2 False Vm2 222-22
3.0.0.3 4.0.0.4 True Vm3 333-33
SparkLegacy

Real Time Constraint
• Multiple ETL Pipeline that runs on an hourly basis
• Largest data feed contains ~4TB events per hour
• Running time should be less than 60 minutes to avoid accumulated
latency.

View Legacy Code in terms of
High-Level Components
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K

Cosmos Component Spark Component
View Legacy Code in terms of
High-Level Components

Cosmos Feature
Engineering
Spark Feature
Engineering
CosmosInputData1 CosmosInputData2
CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1
Schema mismatch!
Validating the Migrated Components

Cosmos Feature
Engineering
Spark Feature
Engineering
Validating the Migrated Components
Translator
Translator
Comparator
Comparator

Managing it all
• This migration process needs to be done for every component of
every detection
• Some components are reused across detections
• What about the connections between the components?
• These dependencies need to be represented and validated as well

Introducing - ¾E⁄Ê½
Component Reuse & Connectivity

MultiTransformer per Component
DataItem
DataFrame Model
DataItem
name1
name2
…
DataItemnamen
MultiDataItem
MultiTransformer
DataItem
DataItem
name1
name2
…
DataItem namem
MultiDataItem

DAG of Components
DataItem
DataFrame Model
DataItem
name1
name2
…
DataItemnamen
MultiDataItem
MT
DataItem
DataItem
name1
name2
…
DataItem namem
MultiDataItem
MT
MT
MTMT
DagMultiTransformer
dependencies

Stateful MultiTransfomers
Slice from iterationi-1
StatefulMultiTransformer
dataset1 dataset2
Slice of iterationi output-dataset1

- ¾E⁄Ê½
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Using notebooks

- ¾E⁄Ê½Using notebooks
Structure Implicit, according to schedules Explicit and visually depicted

- ¾E⁄Ê½
Structure Implicit, according to schedules Explicit and visually depicted
Typing No schema checks Schema checks before running
Using notebooks

Cosmos Feature
Engineering
Spark Feature
Engineering
Validating the Migrated Components - revisited
Translator
Translator
Comparator
Comparator

CosmosOutputData2 SparkOutputData2
Validation -- A Closer Look
Comparator
Recall our challenges
• Legacy ML models
• Non-reproducable ML models
• Indeterministic semantics (e.g.: based row numbers)
• Some non-translatable schema changes
• Some randomly generated UUIDs (e.g. AlertIds)
How Much to invest in achieving full parity?

Decide Based on Soft Metric of the Final Output
Resource Id
1 Res1
2 Res2
3 Res3
4 Res4
5 Res5
6 Res6
7 Res7
Alerts generated by Legacy Component
Resource Id
1 Res1
2 Res3
3 Res4
4 Res5
5 Res6
6 Res8
7 Res9
8 Res10
Alerts generated by Spark Component
𝑗𝑠 =
𝑦 ∩ 𝑦′
𝑦 ∪ 𝑦′
=
5
9
≅ 0.56
Jaccard Similarity
Precision
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦′
=
5
8
≅ 0.63
Recall
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦
=
5
7
≅ 0.71
For 𝒚 ∩ 𝒚 we write an alert content validator (for the rest of the column values)

Measure Running Time on Actual Load
14 Hours !!
Feature
Engineering
Classification
Based
Detection
CYFLOW DAG
Alert
Publisher
Processed
Data 1
Processed
Data 2
Published
Alerts

Finding the culprit
Transformation 1
Intermediate
Result 1
Feature Engineering
Transformation 1
Feature Engineering
Transformation 2
Intermediate
Result 2
20 minutes13 hours !!

Tuning Performance
• Aggregate hourly state into daily state (end of day)
Partition N
State
12:00 AMPartition 1
State
12:00 AM
Partition N
State
01:00 AMPartition 1
State
01:00 AM
Partition N
State
02:00 AMPartition 1
State
23:00 PM
24HourlySlices
Daily Aggregator
Partition N
State
Daily Partition 1
State
Daily

Tuning Performance
• Modify Default Partition (number of cores x 3) :
• Use of broadcasting when UDF with large signatures are reused.
• Cache (be aware for memory failures )
• Unpersist – remove dataframe from cache when no longer needed

Summary
• Present the challenges in migrating a large-scale legacy Big Data
system to Spark
• Preserving Semantics
• Realtime constrains
• COGS
• Introducing - ¾E⁄Ê½ -- a framework built over Apache Spark that
allows component reuse & connectivity
• Discuss different validation strategies
• Reducing runtime and COGS: aspects of Spark performance tuning

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

The Pill for Your Migration Hell

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Pill for Your Migration Hell

Similar to The Pill for Your Migration Hell (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

The Pill for Your Migration Hell