This document discusses building an impersonal recommendation system using big data. It describes different recommendation approaches like collaborative filtering, knowledge-based, and content-based recommendations. An impersonal recommender provides suggestions without user profiles by analyzing customer purchase histories to find related item associations. The document proposes using Apache Hadoop to store and process large datasets for generating association rules to power recommendations. Elasticsearch would store and serve the rules to power an online recommender evaluation and improvement.
2. Agenda
1. Recommendation system overview
2. Different approaches to build recommendation system
3. Impersonal recommendation system in theory
4. Impersonal recommendation system in practice
3. Recommendation system
The goal of a recommendation system is to predict the degree
to which an user will like or dislike a set of items, such as
goods or services.
Recommendation systems have become extremely common in
recent years, and are applied in a variety of applications and
fields. The most popular ones are goods, movies, music,
news, books, research articles, search queries, social tags,
restaurants, financial services, live insurances and people
(social networks and online dating).
4. Examples of using
recommenders
Amazon uses recommendation system to
increase sales by 35% and suggests goods
based on previous user’s experience and
the frequenters bought goods
Netflix suggests movies based on behaviour
of similar users and previous user’s rating
(result: 2 of 3 movies are watched after
recommendation)
5. Pandora radio suggest music base on
previous user’s experience
Examples of using
recommenders
In 2012, Target predicted woman
pregnant before medical test based
on changes in her shopping
behaviour
6. Possible approaches
There are several total different approaches to build
recommendation system:
❖ Collaborative filtering – based on users interaction (likes,
views, buys); extremely popular on online services, shops, etc
❖ Knowledge base – pursue knowledge-based approach; common
used for impersonal recommendations
❖ Content based – similarity of items results in suggestions;
common used to suggest text articles, songs
❖ Hybrid – combine the others approaches
7. Collaborative filtering
❖ Also known as social-filtering systems, aggregate data
about customer’s preferences or purchasing habits. Then
they give recommendations based on similarity between
users or similarity in overall behaviour patterns.
❖ For example, Netflix uses tuned collaborative filtering
algorithm to suggest movies. If user U1 likes movie M1,
and user U2 likes movies M1 and M2 then movie M2
will be recommended for user M1
8. Collaborative filtering
❖ The users behaviour history (views, clicks, buys) is required to
implement collaborative filtering recommender. The idea is to
find users with similar preferences and gives them
recommendations based on similar user’s preferences.
❖ In fact, this approach requires access to user’s profiles and
capability to save each action (both technical and legal). After
that, analysis may be run to get list of preferences for each user.
❖ There is cold start problem: not possible to get
recommendations for new user, because of similar user is
unknown yet
9. Knowledge based
recommenders
❖ Suggest products/services based on inferences about a
user’s preferences and needs. There are several different
types of these systems: some of them uses prebuilt/already
known rules, the others build these rules dynamically.
❖ Unlike collaborative filtering this approach doesn’t require
user’s profiles. Recommendations may be given based on
some predefined or dynamically created rules.
❖ This approach may be used not only for online application,
but also for different offline use cases as retail
10. Knowledge based
recommenders
For example, there is recommender built by Yhat that
suggest new sort of beer to try based on knowledge about
beer (i.e. user who likes light lager with known aroma,
palate, etc will like similar beer XXX).
http://jeweell.com/ct/food/1133467-beer.html
11. Content-based
recommenders
❖ Content-based recommenders are based on machine learning
research (particularly, clustering and classification). It’s
common used by news aggregators to suggest new stories the
user might like to read and cluster them in different groups.
❖ For example, Google News recommendations for the article:
12. Hybrid approach
❖ Combine previous described methods to reach the best
performance.
❖ There is well known cold start problem when algorithm
doesn’t have data to give recommendation for new
user/product. It can be solved by using different
approaches to give recommendations for new/well-known
users or products. For example, goods might be
recommended by collaborative filtering, but knowledge
base recommender will be used for new users/products (we
don’t have history yet)
13. Which approach to choose?
❖ In fact, the thorough analysis is required to choose the
correct approach for each use case.
❖ Several approaches may be used to solve the same issue
and the correct one is not easy to choose, because of a lot
of factors influence performance of recommendations
and different goals may returns in different approaches.
14. Which approach to choose?
Let’s imagine user living in Lviv with some café preferences. He is making a short
weekend trip to London. What could be recommend for him in London?
All previously mentioned
approaches applicable to
answer this question:
•Collaborative filtering
•Knowledge-base
•Content-base
15. Impersonal recommender
The idea of impersonal recommender is to give
recommendations not for particular user, but in general.
For instance, it may be used in retail to find goods-
complements. There are really not obvious case:
Wal-Mart discovered that diapers are sold together with
expensive beer on friday evening. Placed them together
leads in geometrical growth of sales.
16. Impersonal recommender
Applicable in the different areas:
• retail, by employees to increase
revenue/sales
• in e-commerce as short-budget
approach for making
recommendations on web-site
• interactive navigator-kiosk
http://smartcity.prom.ua/g2763766-interaktivnyj-sensornyj-kiosk
17. Data Science way of getting
things done
http://www.tomatosphere.org/teacher-resources/teachers-guide/principal-investigation/scientific-method.cfm
18. The problem
There is a history of customer’s actions:
{a1, a2, a3}
{a2, a3, a5, a6}
{a4, a2}
{a1, a5, a6, a3}
{a3, a5, a2}
…
What should we suggest to customer who has already
committed {a2, a5} (let’s assume that order doesn’t matter)?
19. Naive approach: frequent
item-sets
Affinity analysis is used to build Frequent Item-Sets
are widely used in Market Basket analysis.
Several algorithms were created to perform affinity
analysis (Apriory, FP-Growth)
Unfortunately, it doesn’t work. Frequent Item-Sets don’t
filter out already-purchased goods and “cannibals”.
20. Next step: association rules
Association rules are active used in Market Basket Analysis and may be
effectively used for creating recommendations.
General association rule looks like:
A => B,
usually purchase of A leads to purchase of B (rule is user independent).
Rule has several statistical characteristics (supports, confidence, lift) that
show strength of rule and may be used for high quality recommendations
21. Rules for recommendation
It’s not enough to build rules, they must be correctly interpreted lately. The
most important properties of each rule are support (show who this rule is
important/frequent), confidence (how you can rely on this rule) and lift.
Lift is a derivate from Bayes’ theorem and show positive/negative correlation
between head and tail of rule:
head => tail
All these parameters must be chosen for each particular case. In general,
• lift < 1 means negative correlation (rules works, but has negative effect)
• lift ~ 1 means no correlation (rules doesn’t work)
• lift > 1 means positive correlation
22. Online recommender
evaluation
Of course, recommender is not
ended up generating rules.
The remaining task: evaluate
quality of generated rules. It
gives possibility not only
compare different models (using
A/B testing), but also use
clickstream to improve rules. http://www.sitedoublers.com/blog/multivariate-test-victorias-
secret
23. Online recommender
improvement
❖ Users reaction on recommendation may be used to
improve quality of recommender. For example, it’s
possible to save successful/ignored recommendations
and use these information to improve new generated
recommendations.
❖ It is quite important, because user preferences is not
stable and is changing during the time.
24. Technology stack
❖ There are a lot of already implemented solutions for building
association rules made by Oracle, SAS, Microsoft, etc.
❖ However, in the new world of unstructured/semistructured data
and growing data amount, it’s not enough. Quality of recommender
depends on amount of data used to train recommender. Than more
data is available for analysis, than better.
❖ EDW trends to engage Hadoop as main storage and processing
system
❖ Here comes Hadoop-centric solution…
26. Apache Hadoop
Hadoop is designed to save and process petabytes of data and is an
ideal choice to build recommendation system on top of it.
Hadoop provides wide range of tools for efficient data processing as
well as specialised library for data science needs (Mahout), i.e. for
building of recommender
Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.
27. ElasticSearch
Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a
RESTful web interface and schema-free JSON documents. It
provides scalable search, has near real-time search, and supports
multitenancy.
ElasticSearch is used by GitHub, Foursquare, Etsy, SoundCloud,
Xing and Wikimedia and can leverage several TB of data.
ElasticSearch will be used for keeping rules and serving requests