On Starlink, presented by Geoff Huston at NZNOG 2024
Build a deep learning pipeline on apache spark for ads optimization
1. Build Deep Learning
Pipelines on Apache
Spark for Ads
Optimization
Big Data Consultant & Senior Data Scientist
Craig Chao
chaocraig@gmail.com
Slideshare: Craig Chao
2. Agenda
! Prolog
! Data Become a Weapon of New Colonialism
! Why Not Tensorflow but Deep Learning on Apache Spark?
! Data Engineer * Data Science
! ML Pipelines on Apache Spark
! ML & DL for Ads Optimization
! Deep Learning on Apache Spark
! Conclusion
3. Prolog
! Data Become a Weapon of New Colonialism
! Why Not Tensorflow but Deep Learning on Apache
Spark?
! Data Engineer * Data Science
4. Data Become a Weapon of New Colonialism
順豐、菜鳥互踢數據接口
華為手機上面騰訊APP的使用者數據
是誰的?
美國MIT譽為「中國最聰明公司」科大訊飛
人臉識別的「偷食神器」
A Judge Just Ordered
LinkedIn to Allow Scraping
08/2017
5. Data Become a Weapon of New Colonialism
Src: https://twitter.com/jason_kint/
Src: https://www.iab.com/insights/iab-internet-advertising-revenue-report-conducted-by-pricewaterhousecoopers-pwc-2/
10. Data Developer/Engineer vs. Data Scientist
Src: https://www.stitchdata.com/resources/reports/the-state-of-data-engineering/ https://www.oreilly.com/ideas/2016-data-science-salary-survey-results
5 ~ 10 : 1
11. ML Pipelines on Apache Spark
Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
12. ML Pipelines on Apache Spark
! Dataframe
! ML dataset holding a variety of data types
! Transformer
! an algorithm transforming one DataFrame into another
DataFrame
! Estimator
! an algorithm being fit on a DataFrame to produce a
Transformer
! Pipeline
! chains multiple Transformers and Estimators together to
specify an ML workflow
! Parameter
! Parameters belong to specific instances of Estimators and
Transformers
! Any parameters in the ParamMap will override parameters
previously specified via setter methods.
13. ML Pipelines on Apache Spark
Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
14. ML Pipelines on Apache Spark
Raw unknown lyrics After Cleanser After StopWordsRemover After Stemmer
After Word2Vec After LogisticRegression
Pop or Heavy Metal?
20. ML & DL for Ads Optimization
Rose Navy Olive
Alice 0 +4 0
Bob 0 0 +2
Carol -1 0 -2
Dave +3 0 0
(Alice)
(Blue)
(Navy)
(Periwinkle)
21. ML & DL for Ads Optimization
• Optimizing X, Y simultaneously is non-convex, hard
• If X or Y are fixed, system of linear equations: convex,
easy
• Initialize Y with random values
• Solve for X
• Fix X, solve for Y
• Repeat (“Alternating”)
X
YT
22. ML & DL for Ads Optimization
A m
=
n
S
k
k• T’
n
m
•Σ
Singular Value Decomposition(SVD) Context-aware Matrix Factorization
24. ML & DL for Ads Optimization
Deep Walk(2014) A Multi-View Deep Learning(2015)
25. ML & DL for Ads Optimization
Wide & Deep Learning Models((Youtube, 2016)
Deep Candidate Generation Model(Youtube, 2016) Session-based Recommendation With
RNN(2016)
27. Deep Learning on Apache Spark
Apache SystemML
! Apache Top-Level-Project
! Declarative Large-Scale
Machine Learning
! OS: Linux, macOS, Windows
! Written in: Java
! Open-sourced by IBM in
2015
A machine learning platform optimal for big data
28. Deep Learning on Apache Spark
Apache SystemML
https://github.com/dusenberrymw/systemml-nn/blob/master/nn/examples/mnist_lenet.dml
Build-in NN modules
30. ! Seamless integration of Spark Machine Learning
pipelines with Microsoft Cognitive Toolkit (CNTK) and
OpenCV
! CNTK Model Gallery
! https://www.microsoft.com/en-us/cognitive-toolkit/features/
model-gallery/
! Including GAN, Reinforcement Learning, ResNet152…
Deep Learning on Apache Spark:
MS MMLSpark
31. Deep Learning on Apache Spark:
MS MMLSpark
it implicitly converts the data
into the format expected by the
algorithm: tokenize and hash
strings, one-hot encodes
categorical variables,
assembles the features into
vector and so on.
32. Deep Learning on Apache Spark:
MS MMLSpark
ML Pipeline to evaluate CNTK model.
Windows Azure Storage Blob
33. Deep Learning on Apache Spark:
Databricks
! Founded by the creators of
Apache Spark, Ali Ghodsi,
CEO, adjunct professor of
UC Berkeley
! The total funding is $100M+
! Import model from TF,
MXNet, Keras, PyTorch,
Caffe, CNTK, Theano, Jcuda
35. Deep Learning on Apache Spark:
DataBricks
Build a NN model from scratch
Easy on a driver-only cluster,
complicated on distributed nodes.
36. Deep Learning on Apache Spark:
DL4J
! DeepLearning4J is a java based
toolkit for building, training and
deploying Neural Networks
! An open-source, distributed deep-
learning project in Java and Scala
spearheaded by the people at
Skymind
! ND4J is the Java scientific computing
engine powering our matrix
manipulations. ND4S is its Scala wrapper.
! Including RL and model import from
Keras(Theano, Tensorflow, Caffe and
CNTK)
Machine learning models are served in
production with Skymind's model server.
Secure, Scalable, Stable, Debuggable, Certified
38. Deep Learning on Apache Spark
BigDL
! A distributed deep learning library for
Apache Spark released by Intel®
! Can load pre-trained Caffe or Torch models
! Uses Intel MKL(Intel® Math Kernel Library)
and multi-threaded programming in each
Spark task
39. Deep Learning on Apache Spark
BigDL
Build a NN model from scratch
40. Deep Learning on Apache Spark
BigDL DL4J Databricks MMLSpark SystemML
Vendor Intel DeepLearning4J Databricks Microsoft Apache
Pre-trained models Caffe/Torch/
Tensorflow
Keras, TensorFlow,
Caffe and Theano
TF, MXNet, Keras, PyTorch,
Caffe, CNTK, Theano, JCuda
CNTK Gallery/
Keras
DML/Caffe2DML
Train a NN from scratch Y Y Y N Y / DML
Notebook Python/Scala Scala / Reactive Python/Scala/R/SQL Python/Scala Python/Scala
Free Y N / if model server N Y Y
Usability High High High Middle Low
Docker Y Y / Spark Notebook N Y Y
Cloud Y / (AWS, Azure,
Cloudera…)
N Y / AWS Azure N
Source: Craig Chao, DataConf 2017
41. Conclusions
! Data Wars
! Unified Data Platform
! Data Engineer/Developers are key
roles
! Reusable/Portable ML Pipelines
! DL has deep layers of hidden factors
! DL models for Ads/RecSys
! Codes level intro. of DL solutions on
Apache Spark
42. Add a Slide Title - 3
chaocraig@gmail.com
Slideshare: Craig Chao