This document discusses real-time recommendation systems and describes the Sifarish recommendation engine implementation. Sifarish uses Hadoop, Storm, and Redis to process both batch and real-time recommendations. It generates recommendations through content-based analysis, social recommendations based on user behavior, and real-time processing of new user event data through Storm. Sifarish provides features like implicit rating generation, item correlation analysis, time-sensitive recommendations, and business goal injection for generating personalized recommendations at scale.
2. CONTENTS
Recommendation processing concepts
Hadoop, Storm & Redis based Recommendation
Engine implementation in ‘Sifarish’.
Content based recommendation and social
recommendation
Key distinguishing features of ‘Sifarish’ compared
to Apache Mahout
Real time Social Recommendations
2
3. HADOOP AT 30,000 FT
Power of functional programming and parallel
processing join hands to create Hadoop
Basically parallel processing framework running on
cluster of commodity machines
Stateless functional programming because
processing of each row of data does not depend
upon any other row or any state
Divide and conquer parallel processing. Data gets
partitioned and each partition get processed by a
separate mapper or reducer task.
3
4. STORM AT 30,000 FT
Clustered framework for scalable real time stream
processing
Like Hadoop, parallel processing framework running
on cluster of commodity machines
Instead of processes as in Hadoop, uses a combination
of processes and threads for parallelism
Unlike 2 processing stages in Hadoop (map and
reduce) there can be multiple processing stages
defined in a Storm topology.
Unlike a Hadoop job, a topology once deployed runs
continuously.
4
5. REDIS AT 30,000 FT
It’s a wonderful glue for Big Data eco system
Can be thought of as a distributed data
structure server
Can be used as a list, queue, cache etc.
Supports master slave replication
There is no sharding support
5
6. RECOMMENDATION SYSTEMS
• You know recommender systems if you have visited
Amazon or Netflix.
• Very computationally intensive, ideal for Big Data
processing.
• In memory based recommendation engines, the entire
data set is used directly e.g user behavior based
recommendation a.k.a social recommendation or
content based recommendation engine. This is our focus.
• Model based recommendation, a model is built first by
training the data and then predictions made e.g.,
Bayesian, decision tree
6
7. CONTENT BASED RECOMMENDATION
Recommendation is based on innate attributes of
items under consideration
Each item is considered to be a point in an n
dimensional feature space, where the item has n
attributes
Distance between items in n dimensional space is
computed to find similarities between items.
Similarity is inversely proportional to distance
Attributes can be numerical, categorical or text.
Not effective in for cross sell recommendation
Essential for boot strapping recommender system
7
8. CONTENT BASED RECOMMENDATION
Distance between numerical attributes is simply
the difference in values
Distance between categorical attributes is 0 if
same 1 otherwise
Distance between text attributes is based on
either jaccard distance or cosine distance
Distance between corresponding attributes is
aggregated to find distance between items
Different weights can be assigned to different
attributes for the aggregation to control the
contribution of particular attribute
8
9. COLD START
When bootstrapping a business no user behavior
data is available.
Content based recommendation is the only option.
Distance calculation is performed between user
profile and items.
Two different kinds of entities. Attributes from one
entity is mapped to attributes of the other entity.
User profile may have been provided explicitly by
user or derived from user behavior e.g. pages
visited, search terms etc.
9
10. WARM START
Refers to the case when some limited amount
interaction data is available
The user may have browsed and / or bought some
item
We use content based recommendation again, but
we find similarities between items of same type
(e.g., product)
Use SameTypeSimilarity MR to find distance
beween pairs of items for all possible pair
10
11. SOCIAL RECOMMENDATION
• Customers are fully engaged and significant amount of
user behavior data is available
• Recommendation algorithms are based on user behavior
data only
• Consider a matrix of user and item. Items are rows and
users are columns a.k.a utility matrix. The matrix is
sparse
• The cell value could be boolean e.g., whether user has
purchased an item or shown interest in some way
• The cell value could also be numeric representing rating.
Rating could be exclusive and derived from user
behavior data
11
12. SOCIAL RECOMMENDATION
• The purpose of recommenders is to fill in the blanks
in the utility matrix
• If an user has rated A, then enough users must have
rated A as well as other items, for recommendation
to be effective
• Effective in cross sell recommendation.
• The utility matrix is dynamic causing drift in the
underlying model.
• Periodic re-computation is necessary depending
upon the rate of change
12
13. DISTANCE BASED SOCIAL RECOMMENDATION
• Consider rows of the utility matrix, which are items
vectors. The vector is n dimensional if there n users
• We can find distances between pair of item vectors
• Consider a matrix of user and item. Items are rows
and users are columns a.k.a utility matrix
• The cell value could be boolean e.g., whether user
has purchased an item or shown interest in some
way
• The cell value could also be numeric representing
rating. Rating could be exclusive and derived from
user behavior data
13
14. ITEM CORRELATION
• We can find distances between pair of item
vectors, using distance algorithms discussed
earlier.
• ItemDynamicAttributeSimilarity is the MR used.
Distance or correlation algorithm can be
configured to Jaccard, Cosine or Pearson.
• This is known as item based correlation. The other,
although less preferred, approach is user based
correlation.
14
16. IMPLICIT RATING ESTIMATE
• Generally users don’t explicitly rate items. It tends
to be biased because users with extreme views
tend to rate more
• The MR ImplicitRatingEstimator converts user
engagement data (e.g, browsing product
description page, product review page, placing item
in shopping cart etc) to a rating value.
• This is an optional processing phase necessary,
when explicit rating data is not available
16
17. RATING PREDICTOR
• Based on rating by an user u1 for item i1, the rating
for an item i2 is predicted using the correlation
between i1 and i2
• The MR job for rating prediction is UtilityPredictor
• The correlation between items can be
multiplicative or additive. The type of correlation to
be used can be set through a configuration
parameter.
• For multiplicative correlation, the algorithms are
Jaccard, Cosine or Pearson, as mentioned earlier.
• The next slide is on additive correlation
17
18. ADDITIVE ITEM CORRELATION
• Also known as Slope One Recommender
• If a set of users have rated two items i1 and i2, we
find the average rating difference between and i2
and i1.
• If an user has rating for i2, we can predict the rating
for i1 based on the average of the difference
• The steps can be repeated, e.g. find average rating
difference between i3 and i1 and if the user has
rating for i3, get another prediction for rating of i1.
18
19. AGGREGATION OF PREDICTED RATING
• If an user u1 has rated items i1, i2, ..i5, all of them
could be correlated to an item i9. All 5 items will
contribute towards prediction of rating for the item
i9
• The MR UtilityAggregator aggregates predicted
rating.
• We can either take average or median of all
predicted ratings during. The choice can be made
through configuration
19
20. BUSINESS GOAL INJECTION
• This is an optional processing phase, where items
are associated with scores indicative of business
interest (e.g. preferring items with excess
inventory) in recommending an item
• Final recommendation score is a weighted average
between predicted rating and the business goal
score. The relative weights are configurable.
• The MR for this processing is BusinessGoalInjector
20
21. GROUP BY USER
• This is an optional task that groups the
recommended items produced by the
processing steps discussed so far by user ID
• The MR class TextSorter performs this task
21
22. TIME SENSITIVE RECOMMENDATION
• Timestamp is associated with rating
data. Each cell in the rating matric has
an associated time stamp.
• When processing, past rating data
beyond a specified time window is
discarded.
• Time window can be specified as a
configuration parameter.
22
23. USER SEGMENTATION
• When user population is not homogenous, it
is better to segment the users by clustering
or other means
• Separate utility matrix should be built for
each segment.
• Ratings should be predicted for each
segment separately by running the MR
pipeline for each segment
23
24. KEY DISTINGUISHING FEATURES OF SIFARISH
•Implicit rating generation from explicit user
engagement events for social recommendation
•Semantic matching using RDF model for knowledge
representation for content based recommendation
•Supports time widow, location attributes for content
based recommendation
•Time sensitive social recommendation
•Business goal infused social recommendation
•Real time social recommendation
•Serendipity and novelty in social recommendation
(planned)
24
26. REAL TIME
RECOMMENDATION PROCESSING FLOW
• 1 - Copy historical event click stream data to HDFS
• 2 - Copy output of multiple MR i.e. item correlation
matrix to Redis cache. This needs to be done
whenever correlation matrix is re computed
• 3 - Copy event mapping metadata to Redis cache.
This is one time operation.
• 4 - Write real time event click stream data to Redis
queue
• 5 -Storm consumes event mapping metadata from
Redis cache when the storm topology starts up.
26
27. REAL TIME
RECOMMENDATION PROCESSING FLOW
• 6 - Storm consumes item correlation matrix from
Redis cache
• 7 - Storm consumes event click stream data from
Redis queue
• 8 - Storm writes recommended items for an user to
Redis queue or cache
• 9 -Application server consumes recommended
items from Redis queue or cache
27
28. REAL TIME
RECOMMENDATION PROCESSING
• Only recent user engagement data is used. Recency
is defined per session, by time window or event
count.
• However, historical user engagement event is used
to compute item correlation matrix using Hadoop.
• Historical user engagement event data is converted
to implicit rating by Hadoop MR which is consumed
by several more Hadoop MR to generate the item
correlation matrix.
• Item correlation matrix is saved in Redis as a map
for later consumption by Storm
28
29. REAL TIME RECOMMENDATION
PROCESSING
• Storm ingests real time user engagement click
stream data from a Redis queue and uses items
correlation matrix generated by Hadoop to make
Real time recommendation
• Storm writes recommended items to another Redis
queue or cache
• In the next several slides we will go through some
details of the steps involved
29
30. GENERATE IMPLICIT RATING
• As mentioned earlier this is generated by a Hadoop MR
ImplicitRatingEstimator.
• Uses pre processes click stream data consisting of
(userID, sessionID, eventType, timestamp).
• There are different event types indicative of user’s level
of intent or interest for an item e.g. purchased item, in
checkout, placed in shopping cart, browsed from search
results etc.
• Events with strongest intent level are extracted from the
click stream along with the counts for such event. This
information is mapped to an implicit rating based some
heuristics.
30
31. CONVERTING IMPLICIT RATING TO A
COMPACT FORM
• Implicit rating generated in the previous step is of the
format (userID, itemID, rating)
• However item correlation generating MR
ItemDynamicAttributeSimilarity expects data is a
compact format as (itemdID1, userID1:rating1,
userID2:rating2,..)
• The format conversion is done through the Hadoop MR
CompactRatingFormatter. It’s essentially a group by
operation.
31
32. ITEM CORRELATION
• The MR ItemDynamicAttributeSimilarity generates item
correlation with the output format (itemdID1, itemID2,
corr1)
• There are many configuration parameters involved, the
important being correlation algorithm, the choices being
Jaccard, Cosine and Pearson
• For real time processing the correlation data needs to
be a sparse matrix form.
• The MR CorrelationMatrixBuilder does the necessary
transformation with the output being of the format
(itemID1, itemID2:corr1, itemID2:corr2,….)
32
33. CACHING ITEM CORRELATION
• Item correlation matrix is loaded into a Redis map
using a python script. The map key is the item ID
and the value is the list of correlated itemIDs along
with corresponding correlation coefficients
• A storm bolt reads the correlated items and
coefficients from Redis, when it receives a new user
engagement tuple from the Redis queue. The storm
bolt also caches the correlation values in an in-
memory Google Guava cache.
33
34. CACHING USER EVENT TO RATING
MAPPING METADATA
• This mapping meta data is used by Storm bolt
to convert real time user engagement event
data to implicit rating
• The metadata JSON file content is loaded into
a Redis cache by a python script.
• This is an one time operation. However it
needs to be reloaded, if the metadata is
changed.
34
35. STORM PROCESSING
• A Storm Redis Spout consumes user event data from a
Redis queue.
• The event data is distributed across multiple Storm Bolt
instances. The data is partitioned by userID (field grouped
in Storm terminology)
• The bolt on receipt of the event data, estimates rating
based on recent user engagement event click stream data
• It also looks up the corresponding row of the item
correlation matrix from the in memory Google Guava
cache using itemID as the key
• Guava cache loads from Redis cache in case of cache miss.
35
36. STORM PROCESSING
• Predicted ratings are calculated for items correlated with the
item in the user event using the estimated rating and the item
correlation row vector
• The predicted rating vector is aggregated with the cumulative
predicted rating vector
• The cumulative predicted rating vectors are sorted by rating
value and the top n items along with associated predicted
ratings are written to a Redis queue.
• Output is written to the Redis queue in the format
(userID1,itemIe1:rating1,itemdID2:rating2)
• Optionally, the recommendation output can be written to a
Redis cache with userID as the key and recommended items as
the value
36
37. EVENT CLICK STREAM
• The storm bolt maintains an window of recent user
engagement event click stream data in an in-memory
cache
• The click stream data expiry in the window can be
managed in several ways driven by configuration
• If data is expired by session, whenever a new session is
encountered for an user, the window is cleared
• If data is expired by time span, any event data older is
discarded from the window
• If data is expired by a maximum count, older data is
discarded from the window when the window size
exceeds limit
37
38. EVENT CLICK STREAM
• The storm bolt maintains an window of recent event click
stream data in an in memory cache
• The click stream data expiry in the window can be
managed in several ways driven by configuration
• If data is expired by session, whenever a new session is
encountered for an user, the window is cleared
• If data is expired by time span, any event data older is
discarded from the window
• If data is expired by a maximum count, older data is
discarded from the window when the window size
exceeds limit
38