Personalized product recommendation, the selection of cross-sell, customer churn and purchase prediction is becoming more and more important for e-commerical companies, such as JD.com, which is one of the world’s largest B2C online retailers with more than 300 million active users. Apache Spark is a powerful tool as a standard framework/API for machine learning feature generation, model training and model evaluation. Thus, JD machine learning engineers can easily experiment with ML pipelines and rapidly adjust model to meet demand.
Spark MLlib provides lots of useful machine learning algorithms with high performance implementation. However, there is still gap between our requirement and the existing MLlib algorithms. We also implemented and open-sourced some algorithms, such as a new GBM (gradient boosting machine) implementation named SparkGBM, in which a lot of valuable experience of XGBoost and LightGBM was learned. Under the Spark MLlib pipeline framework, we can plugin our algorithm into the whole ML pipeline and benefit from the unified machine learning platform. In this talk, we will discuss how Apache Spark to prototype machine learning models for user segmentation, targeting and personalization at our group, the revenue growth it brings, and the lessons we learned.
We’ll also share the various pitfalls we have countered when using Apache Spark, and practical experience we collaborated with community to overcome the difficulties.
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang
1. Ruifeng Zheng, JD.COM
Yanbo Liang, Hortonworks
Apache Spark and Machine
Learning Boosts Revenue
Growth for Online Retailers
#MLSAIS16
2. About us
Ruifeng Zheng ruifengz@foxmail.com
– Senior Software Engineer in Intelligent Advertising Lab at JD.COM
– Apache Spark, Scikit-Learn & XGBoost contributor
– SparkLibFM & SparkGBM Author
Yanbo Liang ybliang8@gmail.com
– Staff Software Engineer at Hortonworks
– Apache Spark PMC member
– Tensorflow & XGBoost contributor
2#MLSAIS16
3. Outline
• What are the problems?
• How we solve it?
• The lessons learned.
• Where is the gap?
• Enhancements
– ALS with warm start
– SparkGBM – a new GBM impl atop Spark
• Future work
3#MLSAIS16
4. About JD.com & Wiwin
JD.com
China’s largest online retailer
China’s largest e-commerce delivery system
300+ million active users
Billions of SKUs on shelves, in thousands of categories
WiWin Team in Business Growth Dept.
Supply Data-Mining Services for Top Brands
4#MLSAIS16
6. User Segmentation
Demand: Help Brands to measure marketing
campaigns (beyond ROI and GMV)
6#MLSAIS16
Impression,
Click, View,
Search, Ask,
Follow, Order,
Comment,
Cart, etc
Input
4A Model
4A status of
each user
Aware
Appeal Act
Advocate
10. Purchase Prediction
Demand: Help Brands to better target potential
users
10#MLSAIS16
Different from tradition recommendation:
– For each user, select several items
– For several items, select millions of
users
16. GAP Warm Start
– Resume training
– Accelerate convergence
– Stable solution
Callback after each iteration
– Early stop
– Model checkpoint
Compact Numeric Format
16#MLSAIS16
ALS
GBM
18. ALS – Warm start
18#MLSAIS16
Item factors
in solution T
Initial Item factors to
train solution T+1
Randomized
C off shelves
D on shelves
19. ALS – Warm start
0.8
1.6
1 2 3 4 5 6 7 8 9 10
RMSE
ALS ALS (warm start)
19#MLSAIS16
0.83
0.80
Save ~40% training time
20. GBM - Life is short, you need GBM
Objective in t-th Iteration:
2345 6 = 8
9:;
5<;
= >9, @>9
5<;
+ B5 C9 + Ω(B5)
20#MLSAIS16
previous
prediction
training loss: how well
model fit on training
data
regularization:
control model
complexity
base model
to be added
in Iteration t
21. GBM - Impls
21#MLSAIS16
•Tree as base model
•First-order approximation
GBT
•Second-order
approximation
•L1 & L2 regularization
•Shrinkage
•Column sampling
•Sparsity-aware split
finding
XGBoost •Histogram subtraction
•Discrete bins
LightGBM
DMLC-Rabit
MS-DMTK
Dedicated ML frameworks result in extra costs in
Deployment, Maintenance & Monitoring
22. SparkGBM https://github.com/zhengruifeng/SparkGBM
To be a scalable and efficient GBM atop Spark
22#MLSAIS16
Second-order approximation
L1 & L2 regularization
Shrinkage
Column sampling
Sparsity-aware
Binned data
Histogram subtraction
Codebase legacy:
Model save/load
Periodic checkpointer
…
23. SparkGBM - Features
Compatible with MLLib pipeline
Warm start
Early stop
User-defined functions (RDD only)
– Objection
– Evaluation
– Callback: Early stopping, Model checkpoint
23#MLSAIS16
24. SparkGBM – API 1
24#MLSAIS16
GBMRegressor & GBMClassifier
val gbmr = new GBMRegressor
gbmr.setBoostType("dart")
.setObjectiveFunc("square")
.setEvaluateFunc(Array("rmse", "mae"))
.setRegAlpha(0.1)
.setRegLambda(0.5)
.setDropRate(0.1)
.setEarlyStopIters(10)
.setInitialModelPath(path)
Gradient boosting & DART
Objective
Regularization
Early stop
Warm start
25. SparkGBM – API 2
25#MLSAIS16
GBMRegressionModel & GBMClassificationModel
val model1 = gbmr.fit(train)
val model2 = gbmr.fit(train, test)
model2.setFirstTrees(5)
model2.transform(test)
model2.setEnableOneHot(true)
model2.leaf(test)
Train without validation,
early stop is disabled
Train with validation, early
stop is enabled
Using first 5 trees for
following computation
Prediction
Feature transformation
by index of leaf/path
27. Future work
• Warm start in other algorithms
– Use K-Means to initialize GMM
• ALS enhancements
– Improve the solution stability
• SparkGBM enhancements
– Add features from XGBoost & LightGBM, i.e. softmax to
support multi-class classification
27#MLSAIS16