Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

Ruifeng Zheng, JD.COM
Yanbo Liang, Hortonworks
Apache Spark and Machine
Learning Boosts Revenue
Growth for Online Retailers
#MLSAIS16

About us
Ruifeng Zheng ruifengz@foxmail.com
– Senior Software Engineer in Intelligent Advertising Lab at JD.COM
– Apache Spark, Scikit-Learn & XGBoost contributor
– SparkLibFM & SparkGBM Author
Yanbo Liang ybliang8@gmail.com
– Staff Software Engineer at Hortonworks
– Apache Spark PMC member
– Tensorflow & XGBoost contributor
2#MLSAIS16

Outline
• What are the problems?
• How we solve it?
• The lessons learned.
• Where is the gap?
• Enhancements
– ALS with warm start
– SparkGBM – a new GBM impl atop Spark
• Future work
3#MLSAIS16

About JD.com & Wiwin
JD.com
China’s largest online retailer
China’s largest e-commerce delivery system
300+ million active users
Billions of SKUs on shelves, in thousands of categories
WiWin Team in Business Growth Dept.
Supply Data-Mining Services for Top Brands
4#MLSAIS16

Business scenarios
– User Segmentation
– Cross Selling
– Purchase Prediction
5#MLSAIS16

User Segmentation
Demand: Help Brands to measure marketing
campaigns (beyond ROI and GMV)
6#MLSAIS16
Impression,
Click, View,
Search, Ask,
Follow, Order,
Comment,
Cart, etc
Input
4A Model
4A status of
each user
Aware
Appeal Act
Advocate

User Segmentation
7#MLSAIS16
V1: RDD only
V2: DataFrame + RDD(only used in complexed operations)
~3.2x speedup
save 40% memory footprint

Cross-Selling
Demand: Help Brands to find potential co-operators
among millions of brands
8#MLSAIS16

Cross-Selling
9#MLSAIS16
Flexible pattern #SPARK-13385
rules like {A,B,C} -> {X,Y}
Measure how many more
times X and Y occur together
than expected

Purchase Prediction
Demand: Help Brands to better target potential
users
10#MLSAIS16
Different from tradition recommendation:
– For each user, select several items
– For several items, select millions of
users

Purchase Prediction - Ranking
11#MLSAIS16

Pipeline
12#MLSAIS16
CRISP-DM
In-house data
processing
toolchain
Spark-Shell (Jupyter)
Spark-SQL
MLlib
GraphX
MLlib
Spark-Streaming

Lessons Learned - 1
Multi-Column processing
– Imputer
#SPARK-21690
– ApproxQuantile
#SPARK-14352
– Bucketizer
#SPARK-22797
13#MLSAIS16
0
1
2
3
4
5
6
7
8
1 col 10 cols 100 cols
Training Time of Imputer(sec)
One-Col Multi-Col

Lessons Learned - 2
RDD & DataFrame are Complementary
ETL and data transformation -> DataFrame
Complex logic containing lots of aggregation -> RDD
14#MLSAIS16

Lessons Learned - 3
Parallelized Cross-Validation
15#MLSAIS16
https://bryancutler.github.io/cv-parallel/
0
10000
20000
30000
40000
50000
60000
0M 1M 2M 3M 4M 5M 6M 7M 8M
Serial Parallel=5

GAP Warm Start
– Resume training
– Accelerate convergence
– Stable solution
Callback after each iteration
– Early stop
– Model checkpoint
Compact Numeric Format
16#MLSAIS16
ALS
GBM

ALS – Warm start
18#MLSAIS16
Item factors
in solution T
Initial Item factors to
train solution T+1
Randomized
C off shelves
D on shelves

ALS – Warm start
0.8
1.6
1 2 3 4 5 6 7 8 9 10
RMSE
ALS ALS (warm start)
19#MLSAIS16
0.83
0.80
Save ~40% training time

GBM - Life is short, you need GBM
Objective in t-th Iteration:
2345 6 = 8
9:;
5<;
= >9, @>9
5<;
+ B5 C9 + Ω(B5)
20#MLSAIS16
previous
prediction
training loss: how well
model fit on training
data
regularization:
control model
complexity
base model
to be added
in Iteration t

GBM - Impls
21#MLSAIS16
•Tree as base model
•First-order approximation
GBT
•Second-order
approximation
•L1 & L2 regularization
•Shrinkage
•Column sampling
•Sparsity-aware split
finding
XGBoost •Histogram subtraction
•Discrete bins
LightGBM
DMLC-Rabit
MS-DMTK
Dedicated ML frameworks result in extra costs in
Deployment, Maintenance & Monitoring

SparkGBM https://github.com/zhengruifeng/SparkGBM
To be a scalable and efficient GBM atop Spark
22#MLSAIS16
Second-order approximation
L1 & L2 regularization
Shrinkage
Column sampling
Sparsity-aware
Binned data
Histogram subtraction
Codebase legacy:
Model save/load
Periodic checkpointer
…

SparkGBM - Features
Compatible with MLLib pipeline
Warm start
Early stop
User-defined functions (RDD only)
– Objection
– Evaluation
– Callback: Early stopping, Model checkpoint
23#MLSAIS16

SparkGBM – API 1
24#MLSAIS16
GBMRegressor & GBMClassifier
val gbmr = new GBMRegressor
gbmr.setBoostType("dart")
.setObjectiveFunc("square")
.setEvaluateFunc(Array("rmse", "mae"))
.setRegAlpha(0.1)
.setRegLambda(0.5)
.setDropRate(0.1)
.setEarlyStopIters(10)
.setInitialModelPath(path)
Gradient boosting & DART
Objective
Regularization
Early stop
Warm start

SparkGBM – API 2
25#MLSAIS16
GBMRegressionModel & GBMClassificationModel
val model1 = gbmr.fit(train)
val model2 = gbmr.fit(train, test)
model2.setFirstTrees(5)
model2.transform(test)
model2.setEnableOneHot(true)
model2.leaf(test)
Train without validation,
early stop is disabled
Train with validation, early
stop is enabled
Using first 5 trees for
following computation
Prediction
Feature transformation
by index of leaf/path

SparkGBM – Performance
26#MLSAIS16
0
500
1000
1500
2000
2500
3000
MLlib-GBT SparkGBM XGboost4J
Training Time(sec)
Training Time(sec) Allreduce in Rabit
Reduce &
Broadcast

Future work
• Warm start in other algorithms
– Use K-Means to initialize GMM
• ALS enhancements
– Improve the solution stability
• SparkGBM enhancements
– Add features from XGBoost & LightGBM, i.e. softmax to
support multi-class classification
27#MLSAIS16

Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

Similar to Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang