SlideShare a Scribd company logo
1 of 28
Download to read offline
Ruifeng Zheng, JD.COM
Yanbo Liang, Hortonworks
Apache Spark and Machine
Learning Boosts Revenue
Growth for Online Retailers
#MLSAIS16
About us
Ruifeng Zheng ruifengz@foxmail.com
– Senior Software Engineer in Intelligent Advertising Lab at JD.COM
– Apache Spark, Scikit-Learn & XGBoost contributor
– SparkLibFM & SparkGBM Author
Yanbo Liang ybliang8@gmail.com
– Staff Software Engineer at Hortonworks
– Apache Spark PMC member
– Tensorflow & XGBoost contributor
2#MLSAIS16
Outline
• What are the problems?
• How we solve it?
• The lessons learned.
• Where is the gap?
• Enhancements
– ALS with warm start
– SparkGBM – a new GBM impl atop Spark
• Future work
3#MLSAIS16
About JD.com & Wiwin
JD.com
China’s largest online retailer
China’s largest e-commerce delivery system
300+ million active users
Billions of SKUs on shelves, in thousands of categories
WiWin Team in Business Growth Dept.
Supply Data-Mining Services for Top Brands
4#MLSAIS16
Business scenarios
– User Segmentation
– Cross Selling
– Purchase Prediction
5#MLSAIS16
User Segmentation
Demand: Help Brands to measure marketing
campaigns (beyond ROI and GMV)
6#MLSAIS16
Impression,
Click, View,
Search, Ask,
Follow, Order,
Comment,
Cart, etc
Input
4A Model
4A status of
each user
Aware
Appeal Act
Advocate
User Segmentation
7#MLSAIS16
V1: RDD only
V2: DataFrame + RDD(only used in complexed operations)
~3.2x speedup
save 40% memory footprint
Cross-Selling
Demand: Help Brands to find potential co-operators
among millions of brands
8#MLSAIS16
Cross-Selling
9#MLSAIS16
Flexible pattern #SPARK-13385
rules like {A,B,C} -> {X,Y}
Measure how many more
times X and Y occur together
than expected
Purchase Prediction
Demand: Help Brands to better target potential
users
10#MLSAIS16
Different from tradition recommendation:
– For each user, select several items
– For several items, select millions of
users
Purchase Prediction - Ranking
11#MLSAIS16
Pipeline
12#MLSAIS16
CRISP-DM
In-house data
processing
toolchain
Spark-Shell (Jupyter)
Spark-SQL
MLlib
GraphX
MLlib
Spark-Streaming
Lessons Learned - 1
Multi-Column processing
– Imputer
#SPARK-21690
– ApproxQuantile
#SPARK-14352
– Bucketizer
#SPARK-22797
13#MLSAIS16
0
1
2
3
4
5
6
7
8
1 col 10 cols 100 cols
Training Time of Imputer(sec)
One-Col Multi-Col
Lessons Learned - 2
RDD & DataFrame are Complementary
ETL and data transformation -> DataFrame
Complex logic containing lots of aggregation -> RDD
14#MLSAIS16
Lessons Learned - 3
Parallelized Cross-Validation
15#MLSAIS16
https://bryancutler.github.io/cv-parallel/
0
10000
20000
30000
40000
50000
60000
0M 1M 2M 3M 4M 5M 6M 7M 8M
Serial Parallel=5
GAP Warm Start
– Resume training
– Accelerate convergence
– Stable solution
Callback after each iteration
– Early stop
– Model checkpoint
Compact Numeric Format
16#MLSAIS16
ALS
GBM
ALS
17#MLSAIS16
ALS – Warm start
18#MLSAIS16
Item factors
in solution T
Initial Item factors to
train solution T+1
Randomized
C off shelves
D on shelves
ALS – Warm start
0.8
1.6
1 2 3 4 5 6 7 8 9 10
RMSE
ALS ALS (warm start)
19#MLSAIS16
0.83
0.80
Save ~40% training time
GBM - Life is short, you need GBM
Objective	in	t-th Iteration:
2345 6 = 8
9:;
5<;
= >9, @>9
5<;
+ B5 C9 + Ω(B5)
20#MLSAIS16
previous
prediction
training loss: how well
model fit on training
data
regularization:
control model
complexity
base model
to be added
in Iteration t
GBM - Impls
21#MLSAIS16
•Tree as base model
•First-order approximation
GBT
•Second-order
approximation
•L1 & L2 regularization
•Shrinkage
•Column sampling
•Sparsity-aware split
finding
XGBoost •Histogram subtraction
•Discrete bins
LightGBM
DMLC-Rabit
MS-DMTK
Dedicated ML frameworks result in extra costs in
Deployment, Maintenance & Monitoring
SparkGBM https://github.com/zhengruifeng/SparkGBM
To be a scalable and efficient GBM atop Spark
22#MLSAIS16
Second-order approximation
L1 & L2 regularization
Shrinkage
Column sampling
Sparsity-aware
Binned data
Histogram subtraction
Codebase legacy:
Model save/load
Periodic checkpointer
…
SparkGBM - Features
Compatible with MLLib pipeline
Warm start
Early stop
User-defined functions (RDD only)
– Objection
– Evaluation
– Callback: Early stopping, Model checkpoint
23#MLSAIS16
SparkGBM – API 1
24#MLSAIS16
GBMRegressor & GBMClassifier
val gbmr = new GBMRegressor
gbmr.setBoostType("dart")
.setObjectiveFunc("square")
.setEvaluateFunc(Array("rmse", "mae"))
.setRegAlpha(0.1)
.setRegLambda(0.5)
.setDropRate(0.1)
.setEarlyStopIters(10)
.setInitialModelPath(path)
Gradient boosting & DART
Objective
Regularization
Early stop
Warm start
SparkGBM – API 2
25#MLSAIS16
GBMRegressionModel & GBMClassificationModel
val model1 = gbmr.fit(train)
val model2 = gbmr.fit(train, test)
model2.setFirstTrees(5)
model2.transform(test)
model2.setEnableOneHot(true)
model2.leaf(test)
Train without validation,
early stop is disabled
Train with validation, early
stop is enabled
Using first 5 trees for
following computation
Prediction
Feature transformation
by index of leaf/path
SparkGBM – Performance
26#MLSAIS16
0
500
1000
1500
2000
2500
3000
MLlib-GBT SparkGBM XGboost4J
Training Time(sec)
Training Time(sec) Allreduce in Rabit
Reduce &
Broadcast
Future work
• Warm start in other algorithms
– Use K-Means to initialize GMM
• ALS enhancements
– Improve the solution stability
• SparkGBM enhancements
– Add features from XGBoost & LightGBM, i.e. softmax to
support multi-class classification
27#MLSAIS16
Thank you!
28#MLSAIS16

More Related Content

What's hot

Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Building A Feature Factory
Building A Feature FactoryBuilding A Feature Factory
Building A Feature FactoryDatabricks
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Databricks
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveDatabricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017MLconf
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingDatabricks
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer ModelsDatabricks
 

What's hot (20)

Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Building A Feature Factory
Building A Feature FactoryBuilding A Feature Factory
Building A Feature Factory
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 

Similar to Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

StudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataStudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataTetsuo Yamabe
 
MLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine LearningMLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine LearningBigML, Inc
 
Open ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai YuanOpen ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai YuanDatabricks
 
AMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to DeploymentAMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to DeploymentBob Fourer
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Tips to get the most out of OpenERP
Tips to get the most out of OpenERPTips to get the most out of OpenERP
Tips to get the most out of OpenERPAudaxis
 
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...Odoo
 
MLSD18. Real-World Use Case I
MLSD18. Real-World Use Case IMLSD18. Real-World Use Case I
MLSD18. Real-World Use Case IBigML, Inc
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkAlok Singh
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)Adrian Aley
 
eCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccesseCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccessDivante
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
London measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulosLondon measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulosKonstantinos Papadopoulos
 
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558SAP Ariba Live 2018
 
Presentation on erp by Khurram Waseem Khan mba 2nd semester hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester   huPresentation on erp by Khurram Waseem Khan mba 2nd semester   hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester hukhurram wasim khan
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentationCary Millsap
 

Similar to Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang (20)

StudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataStudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure Data
 
MLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine LearningMLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine Learning
 
Open ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai YuanOpen ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai Yuan
 
AMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to DeploymentAMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to Deployment
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Tips to get the most out of OpenERP
Tips to get the most out of OpenERPTips to get the most out of OpenERP
Tips to get the most out of OpenERP
 
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
 
MLSD18. Real-World Use Case I
MLSD18. Real-World Use Case IMLSD18. Real-World Use Case I
MLSD18. Real-World Use Case I
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)
 
eCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccesseCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of Success
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
London measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulosLondon measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulos
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
 
Sap infosys fico
Sap infosys ficoSap infosys fico
Sap infosys fico
 
Presentation on erp by Khurram Waseem Khan mba 2nd semester hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester   huPresentation on erp by Khurram Waseem Khan mba 2nd semester   hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester hu
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

  • 1. Ruifeng Zheng, JD.COM Yanbo Liang, Hortonworks Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers #MLSAIS16
  • 2. About us Ruifeng Zheng ruifengz@foxmail.com – Senior Software Engineer in Intelligent Advertising Lab at JD.COM – Apache Spark, Scikit-Learn & XGBoost contributor – SparkLibFM & SparkGBM Author Yanbo Liang ybliang8@gmail.com – Staff Software Engineer at Hortonworks – Apache Spark PMC member – Tensorflow & XGBoost contributor 2#MLSAIS16
  • 3. Outline • What are the problems? • How we solve it? • The lessons learned. • Where is the gap? • Enhancements – ALS with warm start – SparkGBM – a new GBM impl atop Spark • Future work 3#MLSAIS16
  • 4. About JD.com & Wiwin JD.com China’s largest online retailer China’s largest e-commerce delivery system 300+ million active users Billions of SKUs on shelves, in thousands of categories WiWin Team in Business Growth Dept. Supply Data-Mining Services for Top Brands 4#MLSAIS16
  • 5. Business scenarios – User Segmentation – Cross Selling – Purchase Prediction 5#MLSAIS16
  • 6. User Segmentation Demand: Help Brands to measure marketing campaigns (beyond ROI and GMV) 6#MLSAIS16 Impression, Click, View, Search, Ask, Follow, Order, Comment, Cart, etc Input 4A Model 4A status of each user Aware Appeal Act Advocate
  • 7. User Segmentation 7#MLSAIS16 V1: RDD only V2: DataFrame + RDD(only used in complexed operations) ~3.2x speedup save 40% memory footprint
  • 8. Cross-Selling Demand: Help Brands to find potential co-operators among millions of brands 8#MLSAIS16
  • 9. Cross-Selling 9#MLSAIS16 Flexible pattern #SPARK-13385 rules like {A,B,C} -> {X,Y} Measure how many more times X and Y occur together than expected
  • 10. Purchase Prediction Demand: Help Brands to better target potential users 10#MLSAIS16 Different from tradition recommendation: – For each user, select several items – For several items, select millions of users
  • 11. Purchase Prediction - Ranking 11#MLSAIS16
  • 13. Lessons Learned - 1 Multi-Column processing – Imputer #SPARK-21690 – ApproxQuantile #SPARK-14352 – Bucketizer #SPARK-22797 13#MLSAIS16 0 1 2 3 4 5 6 7 8 1 col 10 cols 100 cols Training Time of Imputer(sec) One-Col Multi-Col
  • 14. Lessons Learned - 2 RDD & DataFrame are Complementary ETL and data transformation -> DataFrame Complex logic containing lots of aggregation -> RDD 14#MLSAIS16
  • 15. Lessons Learned - 3 Parallelized Cross-Validation 15#MLSAIS16 https://bryancutler.github.io/cv-parallel/ 0 10000 20000 30000 40000 50000 60000 0M 1M 2M 3M 4M 5M 6M 7M 8M Serial Parallel=5
  • 16. GAP Warm Start – Resume training – Accelerate convergence – Stable solution Callback after each iteration – Early stop – Model checkpoint Compact Numeric Format 16#MLSAIS16 ALS GBM
  • 18. ALS – Warm start 18#MLSAIS16 Item factors in solution T Initial Item factors to train solution T+1 Randomized C off shelves D on shelves
  • 19. ALS – Warm start 0.8 1.6 1 2 3 4 5 6 7 8 9 10 RMSE ALS ALS (warm start) 19#MLSAIS16 0.83 0.80 Save ~40% training time
  • 20. GBM - Life is short, you need GBM Objective in t-th Iteration: 2345 6 = 8 9:; 5<; = >9, @>9 5<; + B5 C9 + Ω(B5) 20#MLSAIS16 previous prediction training loss: how well model fit on training data regularization: control model complexity base model to be added in Iteration t
  • 21. GBM - Impls 21#MLSAIS16 •Tree as base model •First-order approximation GBT •Second-order approximation •L1 & L2 regularization •Shrinkage •Column sampling •Sparsity-aware split finding XGBoost •Histogram subtraction •Discrete bins LightGBM DMLC-Rabit MS-DMTK Dedicated ML frameworks result in extra costs in Deployment, Maintenance & Monitoring
  • 22. SparkGBM https://github.com/zhengruifeng/SparkGBM To be a scalable and efficient GBM atop Spark 22#MLSAIS16 Second-order approximation L1 & L2 regularization Shrinkage Column sampling Sparsity-aware Binned data Histogram subtraction Codebase legacy: Model save/load Periodic checkpointer …
  • 23. SparkGBM - Features Compatible with MLLib pipeline Warm start Early stop User-defined functions (RDD only) – Objection – Evaluation – Callback: Early stopping, Model checkpoint 23#MLSAIS16
  • 24. SparkGBM – API 1 24#MLSAIS16 GBMRegressor & GBMClassifier val gbmr = new GBMRegressor gbmr.setBoostType("dart") .setObjectiveFunc("square") .setEvaluateFunc(Array("rmse", "mae")) .setRegAlpha(0.1) .setRegLambda(0.5) .setDropRate(0.1) .setEarlyStopIters(10) .setInitialModelPath(path) Gradient boosting & DART Objective Regularization Early stop Warm start
  • 25. SparkGBM – API 2 25#MLSAIS16 GBMRegressionModel & GBMClassificationModel val model1 = gbmr.fit(train) val model2 = gbmr.fit(train, test) model2.setFirstTrees(5) model2.transform(test) model2.setEnableOneHot(true) model2.leaf(test) Train without validation, early stop is disabled Train with validation, early stop is enabled Using first 5 trees for following computation Prediction Feature transformation by index of leaf/path
  • 26. SparkGBM – Performance 26#MLSAIS16 0 500 1000 1500 2000 2500 3000 MLlib-GBT SparkGBM XGboost4J Training Time(sec) Training Time(sec) Allreduce in Rabit Reduce & Broadcast
  • 27. Future work • Warm start in other algorithms – Use K-Means to initialize GMM • ALS enhancements – Improve the solution stability • SparkGBM enhancements – Add features from XGBoost & LightGBM, i.e. softmax to support multi-class classification 27#MLSAIS16