SlideShare a Scribd company logo
1 of 39
Download to read offline
The Pill for your
Migration Hell
Roy Levin & Tomer Koren
Microsoft
Session Goals
Discuss the challenges in migrating a production
pipeline from a legacy Big Data platform to Spark.
Present a methodical approach to accomplish this
task
Measuring quality, optimizing performance and
scalability and deciding on the ‘Definition of Done’
Background
On Premises
Azure Security Center
Background
Background
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization
Expectations Vs Reality
Legacy
Pipeline
Legacy
Pipeline
Maintaining Semantics (Same Input == Same Output)
Real Time Constraint ( Over 100 TB per Day)
Cost of goods sold (COGS)
Challenges
Maintaining Semantics
Code Conversion:
• Rewrite UDFs (C# -> Python)
• Rewrite Transformations (USQL -> Pyspark)
PySparkUSQL
Maintaining Semantics
Code Conversion:
• Different Datatypes:
• Different ML Libraries
Maintaining Semantics
Schema Changes:
Source IP Destination IP Source
Hostname
Destination
Hostname
1.0.0.1 2.0.0.2 Vm1 Vm2
3.0.0.3 4.0.0.4 Vm3
Source IP Destinatio
n IP
Source
Hostname
Exist?
Hostname Host Id
1.0.0.1 2.0.0.2 True Vm1 111-11
1.0.0.1 2.0.0.2 False Vm2 222-22
3.0.0.3 4.0.0.4 True Vm3 333-33
SparkLegacy
Real Time Constraint
• Multiple ETL Pipeline that runs on an hourly basis
• Largest data feed contains ~4TB events per hour
• Running time should be less than 60 minutes to avoid accumulated
latency.
Cost of goods sold (COGS)
Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization
View Legacy Code in terms of
High-Level Components
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
View Legacy Code in terms of
High-Level Components
Feature
Engineering 1 Features
Classification
Based
Detection
Raw Data 2
Alerts
Preprocessing
Feature
Engineering N Current
State
State
Manager
Anomaly
DetectionTime Series Alerts
Processed
Data 1
Detection Pipeline 1
Detection Pipeline N
Raw Data 1
Raw Data
M
Previous
States
Alert
Publisher
Processed
Data K
Cosmos Component Spark Component
View Legacy Code in terms of
High-Level Components
Cosmos Feature
Engineering
Spark Feature
Engineering
CosmosInputData1 CosmosInputData2
CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1
Schema mismatch!
Validating the Migrated Components
Cosmos Feature
Engineering
Spark Feature
Engineering
CosmosInputData1 CosmosInputData2
CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1
Validating the Migrated Components
Translator
Translator
Comparator
Comparator
Managing it all
• This migration process needs to be done for every component of
every detection
• Some components are reused across detections
• What about the connections between the components?
• These dependencies need to be represented and validated as well
Introducing - ¾E⁄ʽ
Component Reuse & Connectivity
MultiTransformer per Component
DataItem
DataFrame Model
DataItem
name1
name2
…
DataItemnamen
MultiDataItem
MultiTransformer
DataItem
DataItem
name1
name2
…
DataItem namem
MultiDataItem
DAG of Components
DataItem
DataFrame Model
DataItem
name1
name2
…
DataItemnamen
MultiDataItem
MT
DataItem
DataItem
name1
name2
…
DataItem namem
MultiDataItem
MT
MT
MTMT
DagMultiTransformer
dependencies
Stateful MultiTransfomers
Slice from iterationi-1
StatefulMultiTransformer
dataset1 dataset2
Slice of iterationi output-dataset1
- ¾E⁄ʽ
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Using notebooks
- ¾E⁄ʽUsing notebooks
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Structure Implicit, according to schedules Explicit and visually depicted
- ¾E⁄ʽ
Without CyFlow With CyFlow
Deployment Multiple notebooks - one per component Deploy the DAG
Unit Testing No framework Standalone Spark with pyunit
Shared utility code Import whl files - hard to maintain Use a repository
Structure Implicit, according to schedules Explicit and visually depicted
Typing No schema checks Schema checks before running
Using notebooks
Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization
Cosmos Feature
Engineering
Spark Feature
Engineering
CosmosInputData1 CosmosInputData2
CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1
Validating the Migrated Components - revisited
Translator
Translator
Comparator
Comparator
CosmosOutputData2 SparkOutputData2
Validation -- A Closer Look
Comparator
Recall our challenges
• Legacy ML models
• Non-reproducable ML models
• Indeterministic semantics (e.g.: based row numbers)
• Some non-translatable schema changes
• Some randomly generated UUIDs (e.g. AlertIds)
How Much to invest in achieving full parity?
Decide Based on Soft Metric of the Final Output
Resource Id
1 Res1
2 Res2
3 Res3
4 Res4
5 Res5
6 Res6
7 Res7
Alerts generated by Legacy Component
Resource Id
1 Res1
2 Res3
3 Res4
4 Res5
5 Res6
6 Res8
7 Res9
8 Res10
Alerts generated by Spark Component
𝑗𝑠 =
𝑦 ∩ 𝑦′
𝑦 ∪ 𝑦′
=
5
9
≅ 0.56
Jaccard Similarity
Precision
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦′
=
5
8
≅ 0.63
Recall
𝑝𝑟 =
𝑦 ∩ 𝑦′
𝑦
=
5
7
≅ 0.71
For 𝒚 ∩ 𝒚 we write an alert content validator (for the rest of the column values)
Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization
Measure Running Time on Actual Load
14 Hours !!
Feature
Engineering
Classification
Based
Detection
CYFLOW DAG
Alert
Publisher
Processed
Data 1
Processed
Data 2
Published
Alerts
Finding the culprit
Transformation 1
Intermediate
Result 1
Feature Engineering
Transformation 1
Feature Engineering
Transformation 2
Intermediate
Result 2
20 minutes13 hours !!
Tuning Performance
• Aggregate hourly state into daily state (end of day)
Partition N
State
12:00 AMPartition 1
State
12:00 AM
Partition N
State
01:00 AMPartition 1
State
01:00 AM
Partition N
State
02:00 AMPartition 1
State
23:00 PM
24HourlySlices
Daily Aggregator
Partition N
State
Daily Partition 1
State
Daily
Tuning Performance
• Modify Default Partition (number of cores x 3) :
• Use of broadcasting when UDF with large signatures are reused.
• Cache (be aware for memory failures )
• Unpersist – remove dataframe from cache when no longer needed
Agenda
Discussing the Challenges
Formalization
Defining Metrics & Testing
Tuning Performance
Summarization
Summary
• Present the challenges in migrating a large-scale legacy Big Data
system to Spark
• Preserving Semantics
• Realtime constrains
• COGS
• Introducing - ¾E⁄ʽ -- a framework built over Apache Spark that
allows component reuse & connectivity
• Discuss different validation strategies
• Reducing runtime and COGS: aspects of Spark performance tuning
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
 

What's hot (20)

Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 

Similar to The Pill for Your Migration Hell

Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...Zhen Ming (Jack) Jiang
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance ConsiderationsCirrus10
 
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...Hirofumi Iwasaki
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Caserandom_chance
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 

Similar to The Pill for Your Migration Hell (20)

Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
COBOL to Apache Spark
COBOL to Apache SparkCOBOL to Apache Spark
COBOL to Apache Spark
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance Considerations
 
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Case
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 

Recently uploaded (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

The Pill for Your Migration Hell

  • 1. The Pill for your Migration Hell Roy Levin & Tomer Koren Microsoft
  • 2. Session Goals Discuss the challenges in migrating a production pipeline from a legacy Big Data platform to Spark. Present a methodical approach to accomplish this task Measuring quality, optimizing performance and scalability and deciding on the ‘Definition of Done’
  • 5. Background Feature Engineering 1 Features Classification Based Detection Raw Data 2 Alerts Preprocessing Feature Engineering N Current State State Manager Anomaly DetectionTime Series Alerts Processed Data 1 Detection Pipeline 1 Detection Pipeline N Raw Data 1 Raw Data M Previous States Alert Publisher Processed Data K
  • 6. Agenda Discussing the Challenges Formalization Defining Metrics & Testing Tuning Performance Summarization
  • 8. Maintaining Semantics (Same Input == Same Output) Real Time Constraint ( Over 100 TB per Day) Cost of goods sold (COGS) Challenges
  • 9. Maintaining Semantics Code Conversion: • Rewrite UDFs (C# -> Python) • Rewrite Transformations (USQL -> Pyspark) PySparkUSQL
  • 10. Maintaining Semantics Code Conversion: • Different Datatypes: • Different ML Libraries
  • 11. Maintaining Semantics Schema Changes: Source IP Destination IP Source Hostname Destination Hostname 1.0.0.1 2.0.0.2 Vm1 Vm2 3.0.0.3 4.0.0.4 Vm3 Source IP Destinatio n IP Source Hostname Exist? Hostname Host Id 1.0.0.1 2.0.0.2 True Vm1 111-11 1.0.0.1 2.0.0.2 False Vm2 222-22 3.0.0.3 4.0.0.4 True Vm3 333-33 SparkLegacy
  • 12. Real Time Constraint • Multiple ETL Pipeline that runs on an hourly basis • Largest data feed contains ~4TB events per hour • Running time should be less than 60 minutes to avoid accumulated latency.
  • 13. Cost of goods sold (COGS)
  • 14. Agenda Discussing the Challenges Formalization Defining Metrics & Testing Tuning Performance Summarization
  • 15. View Legacy Code in terms of High-Level Components Feature Engineering 1 Features Classification Based Detection Raw Data 2 Alerts Preprocessing Feature Engineering N Current State State Manager Anomaly DetectionTime Series Alerts Processed Data 1 Detection Pipeline 1 Detection Pipeline N Raw Data 1 Raw Data M Previous States Alert Publisher Processed Data K
  • 16. View Legacy Code in terms of High-Level Components Feature Engineering 1 Features Classification Based Detection Raw Data 2 Alerts Preprocessing Feature Engineering N Current State State Manager Anomaly DetectionTime Series Alerts Processed Data 1 Detection Pipeline 1 Detection Pipeline N Raw Data 1 Raw Data M Previous States Alert Publisher Processed Data K
  • 17. Cosmos Component Spark Component View Legacy Code in terms of High-Level Components
  • 18. Cosmos Feature Engineering Spark Feature Engineering CosmosInputData1 CosmosInputData2 CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1 Schema mismatch! Validating the Migrated Components
  • 19. Cosmos Feature Engineering Spark Feature Engineering CosmosInputData1 CosmosInputData2 CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1 Validating the Migrated Components Translator Translator Comparator Comparator
  • 20. Managing it all • This migration process needs to be done for every component of every detection • Some components are reused across detections • What about the connections between the components? • These dependencies need to be represented and validated as well
  • 21. Introducing - ¾E⁄ʽ Component Reuse & Connectivity
  • 22. MultiTransformer per Component DataItem DataFrame Model DataItem name1 name2 … DataItemnamen MultiDataItem MultiTransformer DataItem DataItem name1 name2 … DataItem namem MultiDataItem
  • 23. DAG of Components DataItem DataFrame Model DataItem name1 name2 … DataItemnamen MultiDataItem MT DataItem DataItem name1 name2 … DataItem namem MultiDataItem MT MT MTMT DagMultiTransformer dependencies
  • 24. Stateful MultiTransfomers Slice from iterationi-1 StatefulMultiTransformer dataset1 dataset2 Slice of iterationi output-dataset1
  • 25. - ¾E⁄ʽ Without CyFlow With CyFlow Deployment Multiple notebooks - one per component Deploy the DAG Unit Testing No framework Standalone Spark with pyunit Shared utility code Import whl files - hard to maintain Use a repository Using notebooks
  • 26. - ¾E⁄ʽUsing notebooks Without CyFlow With CyFlow Deployment Multiple notebooks - one per component Deploy the DAG Unit Testing No framework Standalone Spark with pyunit Shared utility code Import whl files - hard to maintain Use a repository Structure Implicit, according to schedules Explicit and visually depicted
  • 27. - ¾E⁄ʽ Without CyFlow With CyFlow Deployment Multiple notebooks - one per component Deploy the DAG Unit Testing No framework Standalone Spark with pyunit Shared utility code Import whl files - hard to maintain Use a repository Structure Implicit, according to schedules Explicit and visually depicted Typing No schema checks Schema checks before running Using notebooks
  • 28. Agenda Discussing the Challenges Formalization Defining Metrics & Testing Tuning Performance Summarization
  • 29. Cosmos Feature Engineering Spark Feature Engineering CosmosInputData1 CosmosInputData2 CosmosOutputData1 CosmosOutputData2 SparkOutputData2 SparkOutputData1 Validating the Migrated Components - revisited Translator Translator Comparator Comparator
  • 30. CosmosOutputData2 SparkOutputData2 Validation -- A Closer Look Comparator Recall our challenges • Legacy ML models • Non-reproducable ML models • Indeterministic semantics (e.g.: based row numbers) • Some non-translatable schema changes • Some randomly generated UUIDs (e.g. AlertIds) How Much to invest in achieving full parity?
  • 31. Decide Based on Soft Metric of the Final Output Resource Id 1 Res1 2 Res2 3 Res3 4 Res4 5 Res5 6 Res6 7 Res7 Alerts generated by Legacy Component Resource Id 1 Res1 2 Res3 3 Res4 4 Res5 5 Res6 6 Res8 7 Res9 8 Res10 Alerts generated by Spark Component 𝑗𝑠 = 𝑦 ∩ 𝑦′ 𝑦 ∪ 𝑦′ = 5 9 ≅ 0.56 Jaccard Similarity Precision 𝑝𝑟 = 𝑦 ∩ 𝑦′ 𝑦′ = 5 8 ≅ 0.63 Recall 𝑝𝑟 = 𝑦 ∩ 𝑦′ 𝑦 = 5 7 ≅ 0.71 For 𝒚 ∩ 𝒚 we write an alert content validator (for the rest of the column values)
  • 32. Agenda Discussing the Challenges Formalization Defining Metrics & Testing Tuning Performance Summarization
  • 33. Measure Running Time on Actual Load 14 Hours !! Feature Engineering Classification Based Detection CYFLOW DAG Alert Publisher Processed Data 1 Processed Data 2 Published Alerts
  • 34. Finding the culprit Transformation 1 Intermediate Result 1 Feature Engineering Transformation 1 Feature Engineering Transformation 2 Intermediate Result 2 20 minutes13 hours !!
  • 35. Tuning Performance • Aggregate hourly state into daily state (end of day) Partition N State 12:00 AMPartition 1 State 12:00 AM Partition N State 01:00 AMPartition 1 State 01:00 AM Partition N State 02:00 AMPartition 1 State 23:00 PM 24HourlySlices Daily Aggregator Partition N State Daily Partition 1 State Daily
  • 36. Tuning Performance • Modify Default Partition (number of cores x 3) : • Use of broadcasting when UDF with large signatures are reused. • Cache (be aware for memory failures ) • Unpersist – remove dataframe from cache when no longer needed
  • 37. Agenda Discussing the Challenges Formalization Defining Metrics & Testing Tuning Performance Summarization
  • 38. Summary • Present the challenges in migrating a large-scale legacy Big Data system to Spark • Preserving Semantics • Realtime constrains • COGS • Introducing - ¾E⁄ʽ -- a framework built over Apache Spark that allows component reuse & connectivity • Discuss different validation strategies • Reducing runtime and COGS: aspects of Spark performance tuning
  • 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.