_EECSE6893_001_2014_3_+Yelper-Final+Project+Report

Columbia University E6883 Big Data Analytics Fall 2014 Final Report
Yelp-er: Analyzing Yelp Data
Naman Jain, Naatasha Kenkre, Rhea Goel, Sanket Jain
Computer Science Department
Columbia University
nj2303@columbia.edu, nsk2141@columbia.edu, rg2936@columbia.edu, sj2659@columbia.edu
Abstract—This project analyzes data provided by the
consumer-centric, web-based platform called Yelp that
connects local businesses with users. In an attempt to highlight
how big data analytics can benefit various stakeholders of a
business model, we perform three tasks: one, to benefit the
users, two, to aid the businesses, and three, to help Yelp
improve their own product. First, we generate query-based
HeatMap to better visualize the search results. Next, we
perform semantic analysis and topic modeling (using LDA) on
user reviews to help businesses identify latest trends. Last, we
propose a gamification model for Yelp to increase user
engagement.
Keywords- Yelp; HeatMap; Semantic Analysis; Topic
Modeling; LDA; Gamification
I. INTRODUCTION
Retailers and consumer-packaged-goods (CPG) companies
have long had access to vast amounts of transaction data:
every day, companies capture information about every SKU
sold to every customer at every store. In addition,
companies regularly use sophisticated market-research
techniques to analyze latest market trends and usage
patterns. This gained knowledge from humongous amount
of data has now become the formula for growth for any
consumer centric company. There are several advantages of
big data analytics such as targeted marketing, user
engagement enhancement, reducing consumer churn,
portfolio strategy and product development and what not!
Yelp is one such consumer-centric platform that connects
people with great local businesses. Founded in 2004, Yelp
has come a long way and currently has overall 67 million
reviews for local businesses and had about 139 million
monthly unique visitors by Q3 2014. Needless to say, it has
generated massive amounts of data with tremendous
potential - this was our primary motivation behind the
project idea.
In our project, we try to demonstrate how such huge
amounts of data can be harnessed to benefit different
facets/stakeholders of a company. Therefore, we perform
three tasks: one, to benefit the users, two, to aid the
businesses, and three, to help Yelp improve their own
product.
First, from the user perspective, we realized that one might
need to find out where he/she can find the most bars/night
clubs or places with majority Chinese restaurants, or any
such cuisine or interest based queries. Currently Yelp only
provides a list of search results, which can be sorted by
distance from current location, price etc. However, there is
no visualization of all the search results. We therefore felt
the need for some mechanism to help the user identify hubs
for his/her interests in the city. We used HeatMaps for the
same, and highlight areas on Google Maps with the most
number of search results for a particular query. This makes
the application more usable by enhancing the look-and-feel
of the user interface.
Next, for the businesses, we perform semantic analysis and
topic modeling on user reviews, to identify the most
commonly talked about things about their business. For
instance, if it is a restaurant, we can identify what dishes are
Yelpers talking about the most, or what is the overall
opinion of people about the place in terms of service, food,
staff etc. This can help businesses figure out their strengths
and weaknesses. Consequently, they can monetize based on
the insights about what attracts the users.
Last, but not the least, we propose a gamification model for
Yelp itself. Gamification is nothing but the application of
typical elements of game playing (e.g., point scoring,
competition with others, rules of play) to other areas of
activity, typically as an online marketing technique to
encourage engagement with a product or service. This user-
engagement strategy has been used rigorously in the past
few years, and continues to be indisputably effective. Our
motivation behind this was our prior experience with a
similar platform used commonly in some major cities of
India, called Zomato. Zomato took restaurant reviews to a
whole new level with users actually battling it out to go to
more places, earn more points, and become a ‘better’ foodie.
Yelp, on the other hand, incorporates this idea only through
the ‘Elite’ community, which is only an exclusive, in-the-
know crew. Hence, we extend the concept to all users of
Yelp aiming to drive better customer retention and lifetime
value. This also helps Yelp increase their customer base,
market value, brand name and user loyalty.

The remainder of this report will take you through literature
review, data source and description, and algorithms and
software used for each of the three tasks.
II. RELATED WORKS
Given that Yelp is a popular consumer review platform,
most related work has been centered on user review content
analysis.
The most widely studied is the correlation of user reviews
with the reputation or average rating of the restaurants or
other businesses on Yelp, like optimal aggregation of
consumer ratings by Dai et al, in [5]. Another such
interesting study is [1], where Luca et al study the trends of
how a restaurant rating is correlated with the restaurant
revenue, by combining Yelp user review data along with
revenue data obtained from government sources. One
important finding of the study was that a one-star increase in
Yelp rating leads to a 5-9 percent increase in revenue.
Another common area of study is the credibility of reviews
on this consumer website. Online reviews have become a
valuable resource for decision-making. However, its
usefulness brings forth the problem of deceptive opinion
spam; when business-owners commit review fraud, either
by leaving positive reviews for themselves or negative
reviews for their competitors. There has been significant
research on the investigation of the extent and patterns of
review fraud on Yelp. For instance, in [2] Luca et al
highlight the extent of review fraud and suggest that a
business’s decision to commit review fraud responds to
competition and reputation incentives. Likewise, Mukherjee
et. al [3] try to identify what algorithm is used behind the
trade secret fake review detection filter by Yelp and find out
that behavioral features are much more effective than
linguistic features.
Furthermore, there has been significant linguistic analysis as
well, mostly revolving around sentiment analysis and topic
modeling [8], [9]. Another area is trying to find out which
reviews are more likely to be relied upon by most users or
what factors make a review more useful. For example,
Tucker tries to do power/influence analysis of elite or non-
elite members of the community using speech code theory
to explain and evaluate how computer users communicate
by posting reviews on Yelp [4]. The study reveals that
overall, opinion leaders on Yelp, a group of regular users
who have earned elite status in the community, did carry
more authority with review readers than non-elite members.
In addition, there has been research not based on user
review analysis, like that of [6] which studies why
individuals use the website Yelp.com from a uses and
gratifications perspective. The results of this survey-based
study indicate that individuals overwhelmingly use
Yelp.com for information-seeking purposes, followed by
entertainment, convenience, interpersonal utility, and pass
time.
III. SYSTEM OVERVIEW
Yelp has introduced, this year, an “Academic Dataset”, a
deep dataset for research-minded academics from their
wealth of data. It is a dataset that is not only unlike standard
datasets but also has some world-relevance in some research
project.
The Challenge Dataset includes data from Phoenix, Las
Vegas, Madison, Waterloo and Edinburgh:
• 42,153 businesses
• 320,002 business attributes
• 31,617 check-in sets
• 252,898 users
• 955,999 edge social graph
• 403,210 tips
• 1,125,458 reviews
The question that arises is that what can we find out using
this dataset or what all could we predict from such varied
data. There are a myriad of options about what we can do
with the given dataset. We could guess a review's rating
from its text alone. We could take all of the reviews of a
business and predict when it will be the busiest, or when the
business is open. We could predict if a business is good for
kids, or has Wi-Fi, or parking? We could also predict what
makes a review useful, funny, or cool or figure out which
business a user is likely to review next. Other tasks include
predicting business categories with a fancy clustering
algorithm, predicting star ratings using sentiment analysis
or building a cool visualization of great local businesses.
Clearly, there are innumerable ways to extract meaningful
information from this data. But our given time constraint
and our knowledge at this point of time we chose the three
tasks described above.
Following is a brief description of what exactly the dataset
provides us:
Yelp provides reviews of the 250 closest businesses.
The dataset that has been provided to us is a single gzip-
compressed file, which is composed of one json-object per
line. Here, every object contains a 'type' field, which tells
you whether it is a business, a user, or a review.
We have three different objects that we have to deal with:
• Business Objects: The business objects contain
basic information about local businesses. The
'business_id' field present in the object that is a
unique identifier for the business can be used with

the Yelp API to fetch even more information for
visualizations. The various other fields are as
follows:
o name : gives the full business name
o neighborhoods : provides us with a list of
neighborhood names(might be empty!)
o full_address : localized address of the
business in question
o city & state : the city and state
respectively in which the business resides
o latitude &longitude : latitude and
longitude of the location of the business
o stars : the star rating for the business
o review_count : the number of reviews that
have been written for this business
o photo_url : the URL for the picture
associated with the business
• Review Objects:
The review objects contain the review text, the star
rating, and information on votes that the Yelp users
have cast on the review. The two fields present
user_id and business_id are of substantial use. We
can make use of user_id to associate this review
with others by the same user. Similarly, the
business_id aids us in associating this particular
review with others of the same business.
As with the business object, the review object too
has its own fields as:
o user_id : this is the identifier of the user
writing this review
o stars : star rating which is an integer bet.
1-5
o text: the important part of the review
which consists of the review text
o date : date the review was written
o votes which consists of useful, cool and
funny : each of contain the number of
votes that have been assigned to them by a
user.
• User Objects:
The User objects contain aggregate information
about a single user across all of Yelp (including
businesses and reviews not in this dataset). The
fields present in this object can be given as follows:
o type: user
o user_id: this is the unique user identifier that
we have come across before
o name: the first and last name of the user
o review_count: the number of reviews that have
been written by this user
o average_stars: is an average i.e. floating point
average of the reviews by this user
o votes (useful , funny and cool): this is a count
of useful, funny and cool votes across all the
reviews by this user.
IV. ALGORITHM AND SOFTWARE PACKAGE DESCRIPTION
Since we performed three tasks, we discuss each task in
detail in this section and tal about the algorithm and
software used.
I. Query-based HeatMap
HeatMaps are nothing but a graphical
representation of data where the individual values
contained in a matrix are represented as colors. We
use google maps are our base, and highlight hubs
for specific queries. For this case, higher the
density of the search term (ex: Sushi) is in a
particular area, sharper is the color gradient of that
area.
The overall process flow involved the following
steps:
• Filter Restaurants: Yelp hosts a number of
businesses including doctors, hair stylists,
and other services. For this task, we
worked only with restaurants and worked
with cuisine-based queries.
• Extract location coordinates: Using the
dataset given in the JSON format, we
extracted the latitude and longitude for
each business appearing in a query’s
search results.
• Integrate Google Maps API: These
coordinates were then passed to the script
that generates the HeatMap; which used
the Google Maps API.
• HeatMap for query results in chosen city:
Finally, we created a web-based UI to
exhibit this functionality. The user has the
choice to change query and city from the
drop down menus, and see relevant areas
highlighted on the map.
We used the Google Maps JavaScript API v3 for
this purpose. This is a JavaScript API provided by
Google, to integrate any HeatMap with Google
maps. Using this, we can visualize HeatMap on an
interactive map of the world
(https://www.google.com/maps). We also provide
options to change gradient, map view
(Satellite/Terrain), radius, color etc.
To display the functionality and make it more
interactive, we have also added drop down menus
to change search query and desired city.
The HeatMap for the given search query is then
generated in real time.

Figure 1. HeatMap for search query “Sushi” in Pheonix
A sample screenshot of a HeatMap for “Sushi” in
Pheonix is shown in Figure 1.
II. Semantic Analysis and Topic Modeling
The overall process flow for this task can be described
as follows:
• Extraction of reviews: we first extracted
reviews for all the restaurants (again, we focus
only on restaurants, but this is easily
extendable to all businesses) and treated each
review as a separate document.
• Topic Modeling (LDA): We then perform
topic modeling using the most common
algorithm called Latent Dirichlet Allocation.
• Top k words: Then we identify relevant topics,
and choose the top K words that represent the
topic with a good relevance measure.
• Word Cloud: Finally, we visualize our results
using Word clouds.
Topic Modeling has been in use for a while now, and is
a great tool for language processing and text analysis.
In machine learning and natural language processing, a
topic model is a type of statistical model for
discovering the abstract "topics" that occur in a
collection of documents. So basically, topic models are
a suite of algorithms that uncover the hidden thematic
structure in document collections. These algorithms
help us develop new ways to search, browse and
summarize large archives of texts. A "topic" consists of
a cluster of words that frequently occur together. Using
contextual clues, topic models can connect words with
similar meanings and distinguish between uses of
words with multiple meanings.
We have already highlighted the commercial value of
such analysis. It can help businesses figure out their
strengths and weaknesses. This way they can leverage
their strengths, and earn profits by discovering their
Unique Selling Point (USP).
We used Latent Dirichlet allocation (LDA), which is
perhaps the most common topic model currently in use.
It is a generalization of PLSI developed by David Blei,
Andrew Ng, and Michael I. Jordan in 2002, allowing
documents to have a mixture of topics.[10] Other topic
models are generally extensions on LDA, such as
Pachinko allocation, which improves on LDA by
modeling correlations between topics in addition to the
word correlations which constitute topics.
It is a generative model that allows sets of observations
to be explained by unobserved groups that explain why
some parts of the data are similar. For example, if
observations are words collected into documents, it
posits that each document is a mixture of a small
number of topics and that each word's creation is
attributable to one of the document's topics.

Figure 2. Word Cloud for identified topic ‘Nightlife/Bar’
Many open-source software packages for topic
modeling have been released. We have used MALLET,
which is a Java-based package for statistical NLP,
document classification, clustering, topic modeling,
information extraction, and other machine learning
applications to text. MALLET topic modeling toolkit
contains efficient, sampling-based implementations of
LDA, Pachinko Allocation and Hierarchical LDA. It
helped us identify topics, along with topic strength –
and relevance of each word in all topics.
A sample word cloud for a topic ‘nightlife/bar’ is
shown in Figure 2.
III. Gamification
We propose a gamification model for Yelp.com by
analyzing user’s activity on the website. For instance,
review count, or the number of reviews written by the user,
is a good measure of his/her involvement on the website as a
contributor. There are typically two kinds of traffic that the
website faces: one, the reader traffic, and two, the
contributors. Yelp’s aim should be to encourage more and
ore writers or contributors to be able to increase their market
value.
Similarly, number of fans, and friends of a user is a good
indicator of how good he/she is at networking. There can be
a whole lot of advantages of identifying the relatively more
social lot as well.
However, just the number of reviews written by the user is
not good enough, as we also need a mechanism to check
quality of the content provided. Here, the number of votes
and compliments can be used, as these numbers are
suggestive of how useful is the content provided by these
Yelpers.
Another important measure of trust is the time since the
user has been active on Yelp. It is highly likely that a person
has been on Yelp since several years, and has only written a
handful of reviews is probably not as active as someone
who joined only recently, and has already written more than
two reviews every month. A regular reader might find
reviews written by the latter kind of people more reliable
and up-to-date.
Thus, we use all the available information and assign tags to
users like ‘Popular’, ‘Social’, ‘Newbie’, ‘Lazybones’,
‘Super Active’, and ‘Dependable’.
As discussed before, such a model helps encourage activity
and users’ self contributions to the community, thereby
aiding Yelp in increasing its customer base, and brand value
and perhaps, monopolize the competitive market.
V. EXPERIMENT RESULTS
Describe the experiment results of your algorithm. Show
how did you evaluate the performance of your algorithm.

The tasks we performed were all unsupervised, and the first
time for Yelp (as per our literature review). Hence, we did
not have any ground truth to compare our results with. We
thereby adopted other qualitative evaluation strategies for
each task.
To evaluate the usefulness of the Query-based heatmaps, we
conducted an online survey asking several questions aiming
to compare the current and the proposed user interface for
search query results. We also asked the users to rate on a
scale of 1 to 5, the usefulness of HeatMaps in this scenario;
where 65% users rated it 4 or 5, and an average rating of 3.5
was obtained. Further, about 73% users agreed that this
model is visually more appealing then the current list view
of search results. In addition, about 95% users said that they
have once, or more, wanted to know about the hubs of a
particular cuisine in their city. This only proves that this
introduction to the Yelp UI is not only desired, but also a
successful means to achieve it.
Next, for topic modeling, we could easily evaluate that the
identified topics were relevant and made sense. We ran the
results by out project leader as well to have him on board as
well.
Last, for the gamification model, we wanted our algorithm
to be such that we have a reasonable percentage of people
belonging to different categories. For instance, we do not
want most people to be tagged ‘Lazybones’, as it becomes
demotivating. On the flip side, we also do not want almost
everyone to be declared ‘Super active’ as it diminishes the
value of the tag. Hence, we decided a percentage of users
that we want to belong to each category, and we modified
our algorithm based on these decided values, thereby
ensuring that the gamification model is neither too lenient,
nor too harsh.
VI. CONCLUSION
The project tries to highlight the power of big data analytics
by analyzing Yelp dataset, and performing three different
tasks to benefit all stakeholders of their business model: the
Yelpers, the business owners, and Yelp.com itself. We
perform visualization of search query results using HeatMap,
perform topic modeling n user reviews for businesses to
understand the latest trends, and propose a gamification
model to encourage user participation.
ACKNOWLEDGMENT
The authors would like to thank Professor Ching-Yung Lin,
who supported us throughout the course of this project. We
are thankful for his aspiring guidance, invaluably
constructive criticism and friendly advice during the project
work. We are sincerely grateful to Bhavdeep Sethi, leader of
the Retail team, for sharing his truthful and illuminating
views on a number of issues related to the project. We
express our gratitude to him, and all the other members of
the teaching staff, who have always been very responsive in
providing necessary information, and without whose
generous support this project wouldn’t have shaped the way
it has.
REFERENCES
[1] Luca, Michael. "Reviews, Reputation, and Revenue: The Case of
Yelp.com." Harvard Business School Working Paper, No. 12-016,
September 2011.
[2] Luca, Michael and Zervas, Georgios, Fake It Till You Make It:
Reputation, Competition, and Yelp Review Fraud (November 26,
2014). Harvard Business School NOM Unit Working Paper No. 14-
006.
[3] Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. “What Yelp
Fake Review Filter Might Be Doing?”, International AAAI
Conference on Weblogs and Social Media, North America, jun. 2013.
[4] Tiana Tucker (2011).“Online Word of Mouth: Characteristics of
Yelp.com Reviews”. The Elon Journal of Undergraduate Research in
Communications Vol. 2, No.1, 37-42.
[5] Weijia Dai, Ginger Z. Jin, Jungmin Lee, Michael Luca (2012).
“Optimal Aggregation of Consumer Ratings: An Application to
Yelp.com”. NATIONAL BUREAU OF ECONOMIC
RESEARCH, NBER WORKING PAPER SERIES, Paper No.
18567, JEL No. D8,L15,L86.
[6] Amy Hicks, Stephen Comp, Jeannie Horovitz, Madeline
Hovarter, Maya Miki andJennifer L. Bevan. “Why people use
Yelp.com: An exploration of uses and gratifications” Computers in
Human Behaviour Volume 28, Issue 6, November 2012, Pages
2274–2279.
[7] Maria R. EblingI and Ramón Cáceres “Gaming and Augmented
Reality Come to Location-Based Services” , IEEE Computer
Society, Issue No.01 - January-March (2010 vol.9), pp: 5-6.
[8] Geoffrey Levine and Gerald DeJong. “Automatic Topic Model
Adaptation for Sentiment Analysis in Structured Domains”
Proceedings of the 5th International Joint Conference on Natural
Language Processing, Chiang Mai, Thailand, November 8 – 13, 2011,
75-83.
[9] Bin Lu; Ott, M.; Cardie, C.; Tsou, B.K., "Multi-aspect Sentiment
Analysis with Topic Models," Data Mining Workshops (ICDMW),
2011 IEEE 11th International Conference on , vol., no., pp.81,88, 11-
11 Dec. 2011.
[10] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John
(January 2003). "Latent Dirichlet allocation". Journal of Machine
Learning Research 3: 993–1022.doi:10.1162/jmlr.2003.3.4-5.993.

_EECSE6893_001_2014_3_+Yelper-Final+Project+Report

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

_EECSE6893_001_2014_3_+Yelper-Final+Project+Report