SlideShare a Scribd company logo
1 of 84
Download to read offline
FlinkML: Large-scale Machine
Learning with Apache Flink
Theodore Vasiloudis, SICS
SICS Data Science Day
October 21st, 2015
Apache Flink
What is Apache Flink?
● Large-scale data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer
What is Apache Flink?
What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations
Expressive APIs
● Main distributed data abstraction: DataSet
● Program using functional-style transformations, creating a Dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()
Pipelined Stream Processor
Iterate in the Dataflow
Iterate by looping
● Loop in client submits one job per iteration step
● Reuse data by caching in memory or disk
Iterate in the Dataflow
Delta iterations
Delta iterations
Learn more in
Vasia’s Gelly talk!
Large-scale Machine Learning
What do we mean?
What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
when the active budget constraint is the
computing time.
Source: Léon Bottou
What do we mean?
● What about the complexity of the problem?
What do we mean?
● What about the complexity of the problem?
Source: Wired Magazine
Deep learning
What do we mean?
● What about the complexity of the problem?
“When you get to a trillion [parameters], you’re getting to something that’s got a chance
of really understanding some stuff.” - Hinton, 2013
Source: Wired Magazine
What do we mean?
● We have a large-scale learning problem when the active budget constraint
is the computing time and/or the model complexity.
FlinkML
FlinkML
● New effort to bring large-scale machine learning to Flink
FlinkML
● New effort to bring large-scale machine learning to Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● sklearn-like ML pipelines
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)
State of the art in large-scale ML
Alternating Least Squares
R ≅ X Y✕Users
Items
Naive Alternating Least Squares
Blocked Alternating Least Squares
Blocked ALS performance
FlinkML blocked ALS performance
Going beyond SGD in large-scale
optimization
● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
● Average over batch size → Average over K (nodes) << batch size
CoCoA: Communication Efficient Coordinate
Ascent
Primal-dual framework
Source: Smith
(2014)
Primal-dual framework
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Source: Smith
(2014)
Average over nodes (K) instead of batches
Source: Smith
(2014)
CoCoA: Communication Efficient Coordinate
Ascent
CoCoA performance
Source:
Jaggi
(2014)
CoCoA performance
Available on FlinkML
SVM
Achieving model parallelism:
The parameter server
● The parameter server is essentially a distributed key-value store with two
basic commands: push and pull
○ push updates the model
○ pull retrieves a (lazily) updated model
● Allows us to store a model into multiple nodes, read and update it as
needed.
Architecture of a parameter server communicating with groups of workers.
Source:
Li (2014)
Comparison with other large-scale learning systems.
Source:
Li (2014)
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
Dealing with stragglers: SSP Iterations
Source: Ho et al.
(2013)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
To be merged soon
into FlinkML
Current and future work on
FlinkML
Coming soon
● Tooling
○ Evaluation & cross-validation framework
○ Predictive Model Markup Language
● Algorithms
○ Quad-tree kNN search
○ Efficient streaming decision trees
○ k-means and extensions
○ Colum-wise statistics, histograms
FlinkML Roadmap
● Hyper-parameter optimization
● More communication-efficient optimization algorithms
● Generalized Linear Models
● Latent Dirichlet Allocation
Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
Future of FlinkML
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.
Recent large-scale learning systems
Source: Xing (2015)
Recent large-scale learning systems
Source: Xing (2015)
How to
get here?
Demo?
Thank you.
@thvasilo
tvas@sics.se
References
● Flink Project: flink.apache.org
● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
● Leon Botou: Learning with Large Datasets
● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search
● Smith: CoCoA AMPCAMP Presentation
● CMU Petuum: Petuum Project
● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.
● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014.
● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.
● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015
● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015
I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial
on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed
Machine Learning
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
Thank you.
@thvasilo
tvas@sics.se

More Related Content

What's hot

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
AWR & ASH Analysis
AWR & ASH AnalysisAWR & ASH Analysis
AWR & ASH Analysis
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Similar to FlinkML: Large Scale Machine Learning with Apache Flink

Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Databricks
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 

Similar to FlinkML: Large Scale Machine Learning with Apache Flink (20)

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 

Recently uploaded

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

FlinkML: Large Scale Machine Learning with Apache Flink

  • 1. FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, SICS SICS Data Science Day October 21st, 2015
  • 3. What is Apache Flink? ● Large-scale data processing engine ● Easy and powerful APIs for batch and real-time streaming analysis ● Backed by a very robust execution backend ○ true streaming dataflow engine ○ custom memory manager ○ native iterations ○ cost-based optimizer
  • 4. What is Apache Flink?
  • 5. What does Flink give us? ● Expressive APIs ● Pipelined stream processor ● Closed loop iterations
  • 6. Expressive APIs ● Main distributed data abstraction: DataSet ● Program using functional-style transformations, creating a Dataflow. case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(...) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)) .groupBy(“word”).sum(“frequency”) .print()
  • 8. Iterate in the Dataflow
  • 9. Iterate by looping ● Loop in client submits one job per iteration step ● Reuse data by caching in memory or disk
  • 10. Iterate in the Dataflow
  • 12. Delta iterations Learn more in Vasia’s Gelly talk!
  • 14. What do we mean?
  • 15. What do we mean? ● Small-scale learning ● Large-scale learning Source: Léon Bottou
  • 16. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning Source: Léon Bottou
  • 17. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning ○ We have a large-scale learning problem when the active budget constraint is the computing time. Source: Léon Bottou
  • 18. What do we mean? ● What about the complexity of the problem?
  • 19. What do we mean? ● What about the complexity of the problem? Source: Wired Magazine
  • 21. What do we mean? ● What about the complexity of the problem? “When you get to a trillion [parameters], you’re getting to something that’s got a chance of really understanding some stuff.” - Hinton, 2013 Source: Wired Magazine
  • 22. What do we mean? ● We have a large-scale learning problem when the active budget constraint is the computing time and/or the model complexity.
  • 24. FlinkML ● New effort to bring large-scale machine learning to Flink
  • 25. FlinkML ● New effort to bring large-scale machine learning to Flink ● Goals: ○ Truly scalable implementations ○ Keep glue code to a minimum ○ Ease of use
  • 26. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression
  • 27. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS)
  • 28. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling
  • 29. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● sklearn-like ML pipelines
  • 30. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ...
  • 31. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
  • 32. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData)
  • 33. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
  • 34. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression()
  • 35. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
  • 36. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) // Train pipeline pipeline.fit(trainingData) // Calculate predictions val predictions = pipeline.predict(testingData)
  • 37. State of the art in large-scale ML
  • 38. Alternating Least Squares R ≅ X Y✕Users Items
  • 41. Blocked ALS performance FlinkML blocked ALS performance
  • 42. Going beyond SGD in large-scale optimization
  • 43. ● Beyond SGD → Use Primal-Dual framework ● Slow updates → Immediately apply local updates ● Average over batch size → Average over K (nodes) << batch size CoCoA: Communication Efficient Coordinate Ascent
  • 47. Immediately Apply Updates Source: Smith (2014) Source: Smith (2014)
  • 48. Average over nodes (K) instead of batches Source: Smith (2014)
  • 49. CoCoA: Communication Efficient Coordinate Ascent
  • 52. Achieving model parallelism: The parameter server ● The parameter server is essentially a distributed key-value store with two basic commands: push and pull ○ push updates the model ○ pull retrieves a (lazily) updated model ● Allows us to store a model into multiple nodes, read and update it as needed.
  • 53. Architecture of a parameter server communicating with groups of workers. Source: Li (2014)
  • 54. Comparison with other large-scale learning systems. Source: Li (2014)
  • 55. Dealing with stragglers: SSP Iterations
  • 56. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration Dealing with stragglers: SSP Iterations
  • 57. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. Dealing with stragglers: SSP Iterations
  • 58. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. Dealing with stragglers: SSP Iterations
  • 59. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. Dealing with stragglers: SSP Iterations
  • 60. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. ○ Allows for progress, while keeping convergence guarantees. Dealing with stragglers: SSP Iterations
  • 61. Dealing with stragglers: SSP Iterations Source: Ho et al. (2013)
  • 62. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015)
  • 63. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015) To be merged soon into FlinkML
  • 64. Current and future work on FlinkML
  • 65. Coming soon ● Tooling ○ Evaluation & cross-validation framework ○ Predictive Model Markup Language ● Algorithms ○ Quad-tree kNN search ○ Efficient streaming decision trees ○ k-means and extensions ○ Colum-wise statistics, histograms
  • 66. FlinkML Roadmap ● Hyper-parameter optimization ● More communication-efficient optimization algorithms ● Generalized Linear Models ● Latent Dirichlet Allocation
  • 67. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
  • 68. Future of FlinkML ● Streaming ML ○ Flink already has SAMOA bindings. ○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms. ● “Computation efficient” learning ○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning with modest computing resources.
  • 69. Recent large-scale learning systems Source: Xing (2015)
  • 70. Recent large-scale learning systems Source: Xing (2015) How to get here?
  • 71. Demo?
  • 73. References ● Flink Project: flink.apache.org ● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ ● Leon Botou: Learning with Large Datasets ● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search ● Smith: CoCoA AMPCAMP Presentation ● CMU Petuum: Petuum Project ● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014. ● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014. ● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS 2013. ● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData 2015 ● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015 I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning
  • 75.
  • 82.