This talk is composed of 3 major parts: the iterative creation of a recommender engine, the labeling of images, the post processing of images.
After introducing the main topic, labeling images to improve recommendation engine performances, we start with a recommendation engine discussion. We briefly describe the “classical” recommender system (collaborative filtering, content based filtering) and their advantages and limitations. We then describe the re-ranking approach we used to combine different engines into one. Re-ranking is a method (used by Google for example) that takes the different ranking as features and optimizes a certain loss. In our case we combine our different recommendations through a logistic regression that predict the probability of purchases for each tuple (user, sale). This version of the engine led to +7% revenue per customer and is now running in production.
We then explain why we wanted to use images information. It seemed that sales with some given images were performing better than others. If we had labels on all images we could use them in a content-based recommender system (used itself in the re-ranking engine). We then described how to label our images using pre-trained models, transfer learning and external APIs. We also show how easy it is to steal these APIs.
The final part deals with post processing of the images. Since most pre-trained models only output one class prediction, we need to reshape these into broad themes that can be used in our engine. We use a Non Negative Matrix Factorization for this purpose and show that we have very interpretable results. We conclude by comparing visually the different engines.
The key take away (more information in the pitch part) are theses:
- Machine learning: overview of recommender systems, re-ranking, how to label images, transfer learning.
- Do iterative data science. Start simple, then try more complex systems.
- Avoid rushing in deep learning without checking what you can find on Internet. Use pre-trained models and transfer learning.
There is a lot of hype around deep learning and image recognition. However, there are not that many success stories for web pure player companies. In our case we explain how we started with simple recommender systems before improving them gradually and finally using images information.
One of the key take away is the following: do iterative data science. Always prefer shipping a minimum viable product before creating something complex. At our clients, we commonly see teams rushing into images projects for the only purpose of doing deep learning without a clear ROI in mind.
We insist on the fact that deep learning is not an end in itself. Here, it boils down to making new information available in the system. In this sense, deep learning methods are just an extension of Business Intelligence.
2. Introduction and Context
Iterative building of a
recommender system
Labeling Images
Pragmatic deep learning for
dummies
Post Processing
AKA: Image for BI on steroids
Outline
Results
More images !
3. Dataiku
• Founded in 2013
• 90 + employees, 100 + clients
• Paris, New-York, London, San Francisco,
Singapore
Data Science Software Editor of Dataiku DSS
DESIGN
Load and prepare
your data
PREPARE
Build your
models
MODEL
Visualize and share
your work
ANALYSE
Re-execute your
workflow at
ease
AUTOMATE
Follow your production
environment
MONITOR
Get predictions
in real time
SCORE
PRODUCTIO
N
4. E-business vacation retailer
Negotiate the best prize for their clients
Discount luxury
Key Figures
Sale Image is paramount
Purchase is impulsive
18 Millions of clients.
Hundreds of sales opened everyday
5. Specificities
Highly temporary sales
-> Classical recommender system fail
-> Time event linked (Christmas, ski, summer)
Expensive Product
-> Few recurrent buyers
-> Appearance counts a lot
9. One Meta Model to Rule Them All
Recommenders
as
features
Machine
learning
to
op5mize
purchasing
probability
Combine
Recommend
Describe
10. Cleaning, combining
and enrichment of
data
Recommendation
Engines
Optimization of home
display
the application
automatically runs and
compiles heterogeneous
data
Generation of
recommendations based
on user behaviour
Every customer is shown the 10 sales
he is the most likely to buy
Customer visits
Purchases
Sales Images
Metal model combine
recommendations to
directly optimize
purchasing probability
Meta Model
Recommender system for Home Page Ordering
+7% revenue
Sales information
(A/B testing)
Batch Scoring every night
11. Why use Image ?
We want do distinguish
« Sun and
Beach »
« Ski »
A picture is worth a thousand words
12. Sales Images
Integrating Image Information
Labeling Model
Pool + Palm Trees Hotel
+ Mountains
Pool + Forest + Hotel + Sea
Sea + Beach +Forest + Hotel
Sales descriptions
vector
CONTENT
BASED
Recommender
System
13. Image Labelling For Recommendation Engine
Pragma&c
Deep
learning
for
“Dummies”
14. Using Deep Learning models
Common Issues
“I don’t have GPUs server” “I don’t have a deep leaning expert”
“I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ”
I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
15. “I don’t have (or few) labelled data”
-> Is there similar data ?
Solution 1 : Pre trained models
PLACES
DATABASE
US
SUN
DATABASE
205
categories
2.5
M
images
307
categories
110
K
images
16. tower: 0.53
skyscraper: 0.26
swimming_pool/outdoor: 0.65
inn/outdoor: 0.06
Solution 1 : Pre trained models
If there is open data, there is an open pre trained model !
• Kudos to the community
• Check the licensing
Example
with
Places
(Caffe
Model
Zoo)
:
17. Solution 2 : Transfer Learning
Credit
:
Fei-‐Fei
Li
&
Andrej
Karpathy
&
Jus5n
Johnson
hYp://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
18. PLACES
DATABASE
OUR
DATA
SUN
DATABASE
Training
(op5onal)
Pre-‐trained
model
VGG16
tower: 0.53
skyscraper: 0.26
Re-‐Training
Transferred
Data
:
Last
convolu5onal
layer
features
Re-‐trained
model
TensorFlow
2
fully
connected
layers
Caffe
Model
Zoo
GPU
CPU
GPU
Leverage existing knowledge !
Solution 2 : Transfer Learning
Accuracy:
72%,
Top-‐5
Acc:
90
%
>
state
of
the
art
on
dataset
alone
19. Post Treatment & Results
(Or how we transfer the labelling
information)
Using
Images
informa&on
for
BI
on
steroids
21. Image content detection
Topic scores determine the importance of topics in an image
TOPIC
TOPIC
SCORE
(%)
Golf
course
–
Fairway
–
PuHng
green
31
Hotel
–
Inn
–
Apartment
building
outdoor
30
Swimming
pool
–
Lido
Deck
–
Hot
tub
outdoor
22
Beach
–
Coast
-‐
Harbor
17
TOPIC
TOPIC
SCORE
(%)
Tower
–
Skyscraper
–
Office
building
62
Bridge
–
River
–
Viaduct
38
22. Results ?
1) Visits :
• France and Morocco
• Pool displayed
2) First Recommendation
• Mostly France & Mediterranean
• Fails to display pools
3) Only Images recommendation
• Pool all around the world
• Does not respect budget
4) Third column = Right Mix
1) 2) 3) 4)
23. Conclusion
Do iterative data science !
Start simple and grow
Evaluate at each steps
Image labelling = BI on steroids
Transfer Learning
Kick-start your project
Gain time and money
Any Data Scientist can do it
Deep Learning
Don’t start from scratch !
Is there existing data ?
Is there a pre-trained model ?
24. Learned along the way
What’s next ?
AYrac5veness
=
%
visits
with
tag
/
%
sales
with
tag
For
ski
sales,
indoor
pictures
performs
beYer
29. What about APIs ? Use for generating labels !
How to steal model:
• 1) Score part of the database for training
• 2) Train a model
• 3) Score your entire database !
(Or don’t, it’s illegal)
But I have only 5000 requests ?
-> Use Transfer Learning !
30. What about APIs ? Use for generating labels !
Experiment:
• 5000 requests on API
-> 4500 for training , 500 for validation
-> 180 class to predict
• Transfer learning with MIT Places Pre-trained Model
• Scikit learn Multilabel model
• One Vs the Rest
• Untuned Logistic regression
(demo, not used in any real project)
(Or don’t, it’s illegal)
31. What about APIs ? Results
Accuracy
95
Recall
80
Precision
75
Label
Probability
Label
Probability
landscape 1,0000 sunset 0,9998
sky 1,0000 no person 0,9996
outdoors 1,0000 water 0,9990
nature 1,0000 park 0,9849
rock 1,0000 river 0,9678
travel 1,0000 scenic 0,8031
Label
Probability
Label
Probability
beach 1,0000 ocean 1,0000
summer 1,0000 relaxation 1,0000
sand 1,0000 island 1,0000
tropical 1,0000 idyllic 1,0000
travel 1,0000 seashore 0,9998
seascape 1,0000 water 0,9997(demo, not used in any real project)