Detect Deceptive Hotel Reviews Using Text Classification

Spot Deceptive TripAdvisor Hotel
Reviews
By: Yousef Fadila
Project Notebook:
https://github.com/yousef-fadila/cs548-project-5/blob/master/notebook.ipynb
CS548: Text Mining Project

Motivation - Fake reviews in the news
TripAdvisor warns of hotels posting fake reviews
http://abcnews.go.com/Technology/story?id=8094231
Twitter campaign takes aim at fake restaurant reviews on
TripAdvisor
https://www.theguardian.com/travel/2015/oct/24/twitter-campaign-targets-fake-tripadvisor-restaurant-reviews

Datasets
Deceptive Opinion Spam Corpus TripAdvisor Hotel-reviews
Consists of:
400 deceptive positive reviews
400 deceptive negative reviews
⇒ From Amazon Turks
400 truthful positive reviews
400 truthful negative reviews
⇒ From Trusted users in TripAdvisor
Consists of:
878561 reviews from 4333 hotels
crawled from TripAdvisor.
⇒ Includes meta-data. (hotel name,
rating, stars, location..)

Outline
Guiding Questions:
1. Which is more prevalent, positive deceptive or negative deceptive reviews among the
200,000 sample reviews?
2. What star-rating of hotels most commonly has deceptive reviews? Who are the top ten
hotels with deceptive positive reviews?
3. Is there enough support to claim that deceptive positive reviews are used to cover
previous negative reviews?
Extra:
1. Would a 2-step approach based on domain knowledge (like the one presented on
anomaly detection showcase) improve the accuracy of the text classification model?
2. Demo: Try it yourself.
3. Are computers better than Humans in detecting deceptive reviews?

Text Classification Model
1. (1,3) n_grams
2. min_df=3
3. max_df=0.96
4. LinearSVC classification.

Positive deceptive vs. negative deceptive ratio
1. Which is more prevalent, positive deceptive or negative deceptive reviews among
the 200,000 sample reviews?
Answer:
Positive deceptive reviews are more
prevalent.

Hotel Stars-Rating vs. Deceptive reviews rate
1. What star rating of hotels most commonly has deceptive reviews? who are the top
hotels according deceptive positive ratio reviews?
Top “deceptive” Hotels:
********Inn Houston
******** York Hotel
********ose Hotel
********a Inn Houston Wirt Road
********lmonico

Frequent Sequences Leads to Positive Deceptive
Reviews1. Pick up 20 hotels with deceptive reviews
2. Export all reviews of the selected hotels to arff file
3. Set sequence Id to hotel Id.
4. Run GSP algorithm in Weka.

2 Step Approach
1. Would a 2-step approach based on domain knowledge (like the one presented on anomaly
detection showcase) improve the accuracy of the text classification model?
What features could be used
to distinguish deceptive from
truthful?
False Positive vs False Negative.
Supervised vs Unsupervised

Content Based Features
Some online reviews are too good to be true; Cornell computers spot 'opinion spam' http://bit.ly/2g6ou9X
"The researchers then applied computer analysis based on subtle features of text. Truthful hotel reviews, for
example, are more likely to use concrete words relating to the hotel, like "bathroom," "check-in" or "price."
Deceivers write more about things that set the scene, like "vacation," "business trip" or "my husband." Truth-
tellers and deceivers also differ in the use of keywords referring to human behavior and personal life, and
sometimes in features like the amount of punctuation or frequency of "large words." In parallel with previous
analysis of imaginative vs. informative writing, deceivers use more verbs and truth-tellers use more nouns."
Features to extract from the review text:
1)amount of punctuation
2)total nouns - total verbs
3)length of the review.
4)adjective and adverbs ratio

Unsupervised AD Followed by supervised classifier
No Improvement!

2nd Try: One Single Step Supervised Model
Merge both “bag of words” features and the content based extracted features
together for supervised classifier.
No Improvement!

3rd Try: Change Topology
2 supervised text
classification models.
Positive-negative
based only on “bag of words”.
Deceptive-truthful uses
both bag of words and
content based features.

3rd Try: Change Topology - Result
Overall
Improvement by 7%!

Demo: Try it yourself
www.yousef.fadila.net/cs548
REST API:
POST REQUEST to:
www.yousef.fadila.net/cs548/review_checker
Payload: {'review_text': text}
Sample response:{"result": "Likely Fake" }

Computers vs. Humans
Are computers better than Humans in detecting deceptive reviews?
Survey of WPI students
74 WPI students responded
Students were given 5 positive reviews and were asked to decide whether
they are truthful or deceptive reviews
The list intentionally includes reviews that weren’t classified correctly using
the model from 1st experiment

1 Computers Humans
1 1

1 Computers Humans
1 1
1 0

1 Computers Humans
1 1
1 0
0 0

1 Computers Humans
1 1
1 0
0 0
1 1

Computers Humans
1 1
1 0
0 0
1 1
1 1

Computers vs. Humans - Result
This is not a scientific study nor a
statistical one!
This is only a game! In fact it is unfair game as we use
reviews from the dataset we train the model on them!
The purpose of the game is to show if humans truth bias,
assuming that what they are reading is true until they find
evidence to the contrary, could affect their ability to spot
deceptive reviews.
Computers Humans
1 1
1 0
0 0
1 1
1 1
4 3

Detect Deceptive Hotel Reviews Using Text Classification

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Detect Deceptive Hotel Reviews Using Text Classification

Similar to Detect Deceptive Hotel Reviews Using Text Classification (20)

More from Yousef Fadila

More from Yousef Fadila (9)

Recently uploaded

Recently uploaded (20)

Detect Deceptive Hotel Reviews Using Text Classification