Quora ML Workshop: Sock Puppets and Hoaxes on the Web

Be Nice, Be Respectful:
Protecting Online Spaces with Applied
Machine Learning

Sockpuppets and Hoaxes on the Web
An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J.
Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web
Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention.
Disinformation on the Web: Impact, Characteristics, and Detection of
Wikipedia Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian.
Proceedings of World Wide Web Conference, 2016 (WWW 2016)
Srijan Kumar
@srijankedia
Computer Science, Stanford University
Joint works with Robert West, Justin Cheng, Jure Leskovec, V.S. Subrahmanian

• Web enables social interaction
• Web is no longer a static library that
people passively browse
• Web is a place where people:
– Act as prosumers, i.e., content producers and content
consumers
– Interact with other people:
• Internet forums, Blogs, Social networks, Twitter,
Wikis, Podcasts, Slide sharing, Bookmark sharing, Product
reviews, Comments, …
Web allows...

…but there’s also a dark side
to the web

Time (2016); The Atlantic (2016); BBC (2015), Vanity Fair (2017), Digital Trends (2017)
Not everyone has good intentions…

Web: Source of false information

Modeling and Detection of
Misbehavior and Misinformation on
the Web
This talk:

Challenges in analyzing malicious behavior
Data imbalance
Limited labels
Deceptive behavior
Smaller propotion of behavior
(< 10%) is malicious
Little known information about
malicious behavior
Malicious behavior tends to
masquerade as benign
VS Subrahmanian and Srijan Kumar. Predicting Human Behavior: The Next Frontiers. Science 2017.

Sockpuppets in online discussions
An Army of Me: Sockpuppets in Online Discussion Communities. S. Kumar, J.
Cheng, J. Leskovec and V.S. Subrahmanian. Proceedings of World Wide Web
Conference, 2017 (WWW 2017). Best Paper Award Honorable Mention.

Eric_17 April 28 2013, 12AM
Thanks. I knew Marvel fans would try to flame me, but they
have nothing other than “oh that’s your opinion” instead of
coming up with their own argument
Fellstrike April 29 2013, 6PM
Quit talking to yourself, *******. Get back on your
meds if you’re going to do that
bdiaz209 April 28 2013, 11PM
Possibly the best blog I’ve ever read major props to you
bdiaz209 posts only on this discussion to
support and defend Eric_17
Sockpuppets in Discussions

Data: Sockpuppets
2.9M
Users
2.1M
Articles
62M
Posts

Defining sockpuppets
No ground truth sockpuppet labels! (Surprise?!)
We adopt currently used definition from Wikipedia, after
statistical validation for our task, as follows:
Sockpuppets are accounts that post from the
same IP address in the same discussion very close
in time (15 min), in at least 3 different instances.
Note: we use the IP addresses for definition, but not detection
3,656
Sockpuppets
1,623
Puppetmasters

Characteristics of sockpuppets

How to compare sockpuppets & ordinary users?
For each sockpuppet, match an
ordinary user that makes
similar number of posts
on
similar discussions
We have to match!

Smoothzilla Feb 5 2013, 3PM
Thanks for your support!!!!
Falcon-X32 Feb 5 2013, 3PM
I agree. You are absolutely right!
jakey008 Feb 5 2013, 2PM
should have read the reviews first :(
ricobeans27 Feb 5 2013, 3PM
Couldn’t agree more.
Interact more with each other
p < 10-3
Upvote each other more
p < 10-3
Relation between pair of sockpuppets

Do puppetmasters lead double lives?
Double life hypothesis:
Puppetmaster maintains distinct personality for the two sockpuppets
More simiar Less similar
Ordinary Sockpuppet 1 Sockpuppet 2
Similarity is measured as cosine similarity between user posts’ features: LIWC, sentiment,
number of words, etc.

Alternate hypothesis:
Puppetmaster operates both sockpuppets similarly
Less similar More similar
Ordinary Sockpuppet 1 Sockpuppet 2
Similarity is measured as cosine similarity between user posts’ features: LIWC, sentiment,
number of words, etc.

Both sockpuppets are more similar to
each other
p < 10-3
“Good sock/Bad sock” not common
Non-sockpuppet Sockpuppet 1 Sockpuppet 2

Why are sockpuppets created?
Only for deception?

Deceptiveness
Levenshtein distance between usernames
Numberofpairs
0 5 10 15 20
0100200300
Non-Pretenders Pretenders
Sock pairs Random pairs
srijan srijan2 srijan theRealBatman
2/31/3
Hypothesis: Deceptive sockpuppets of the same master have very different usernames.

srijan Feb 5 2013, 3PM
i agree.. these morons dont know a thing
theRealBatman Feb 5 2013, 3PM
YOU ARE STUPID AND A *****
best article i have read!!!
ricobeans27 Feb 5 2013, 3PM
But this article doesn’t make any sense
More opinionated
p < 10-3
Swear more
p < 10-3
Downvoted and
reported more
p < 10-3
Pretender vs Non-pretender Sockpuppets

How are sockpuppets used?
Do sockpuppets always support
one another?

Neutral sockpuppets
why so?
best article ever!
We quantify the amount of support by counting assenting, negation and dissenting words from LIWC
60%
Neutral

Supporter sockpuppets
Totally agree!!
best article ever!
60%
Neutral
30%
Supporter

Dissenter sockpuppets
60%
Neutral
30%
Supporter
10%
Dissenter
I don’t think so
best article ever!

Probabilityofbeingapretender
Supportiveness and Deceptiveness
0.5
0.0
1.0
Dissenter
0.58
0.42
Neutral
0.70
0.30
Supporter
0.74
0.26
Pretender
Non-pretender
Deception is important to create
an illusion of public consensus

Features
Post
Number of words,
characters, etc.,
LIWC counts,
Readability,
Sentiment,
…
Community
Number of upvotes and
downvotes,
Fraction of reported posts,
Is account reported,
…
Activity
Number of posts,
number of replies,
reciprocity of posts,
age of account,
…
Note: we are not using the IP based features

Is an account a sockpuppet?
0.5 0.6 0.8 1.00.7 0.9
0.57
0.54
0.59
0.68
Post
Community
Activity
All
AUC
Baseline

Do two accounts belong to the same person?

Do two accounts belong to the same person?
0.5 0.6 0.8 1.00.7 0.9
0.80
0.56
0.86
0.91
AUC
Post
Community
Activity
All
Baseline

Sockpuppetry
Benign usage by
Non-pretender sockpuppets:
Primarily created to separate
interest, are respectful,
operate similarly, and are
neutral towards each other
Malicious usage by
Pretender sockpuppets:
Primarily created to create
an illusion of consensus,
are abusive, and support &
defend each other
Conclusion

Disinformation on the Web
Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia
Hoaxes. S. Kumar, R. West, J. Leskovec and V.S. Subrahmanian. Proceedings
of World Wide Web Conference, 2016 (WWW 2016)

Types of false information
Misinformation
honest mistake
Disinformation
deliberate lie to mislead
Hoax
“deliberately
fabricated falsehood
made to masquerade
as truth”
Wikipedia

Why Wikipedia?
The free encyclopedia that anyone can edit
Easy to add (false)
information
• Freely accessible
• Large reach
• Major source of
information for
many

Why care about false information?
Results
● 64 public successful hoaxes
● Pair with similar legitimate articles
● 320 random hoax/non-hoax pairs x 10 raters on Mech Turk
If a hoax “looks” like a genuine Wikipedia article, it is
assumed to be credible.
Accurate detection needs non-appearance features.
If humans are good in identifying false information, then we
don’t need to worry.
50%
Random
66%
Human
86%
Classifier

Data: Wikipedia Hoaxes
Hoax article vs hoax facts

Data: Wikipedia Hoaxes
Hoax article vs hoax facts
21,218 hoax articles
Hoax lifecycle:

Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Quantify their
impact?
What are the
hoaxes like?
Can we find
them?

Impact of hoaxes
“The worst hoaxes are those which
(a) last for a long time,
(b) receive significant traffic,
(c) are relied upon by credible news media.”
Jimmy Wales on Quora
Most hoaxes are caught soon, but
some hoaxes (~1%) are impactful
along all three axes

Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Most hoaxes are
caught soon, but
some hoaxes are
impactful!
What are the
hoaxes like?
Can we find
them?

Successful hoax
pass patrol
survive for a month
viewed frequently
Failed hoax
flagged and
deleted during
patrol
Wrongly flagged
temporarily flagged
Legitimate
articles
never flagged
Hoax
Non-hoax

Characteristics of hoaxes
Hoax articles are longer, but they mostly have plain text
and have lesser web and wiki links.
Features:
o Plain-text length
o Plain-text-to-markup ratio
o Wiki-link density
o Web-link density
Appearance:
how the article looks
Link-network:
how the article
connects
Support:
how other articles
refer to it
Editor:
who created the
article

CC = 0
incoherent article
CC > 0
coherent article
Hoax articles are less coherent than non-hoax articles.
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
how the article
connects
Support:
how other articles
refer to it
Editor:
who created the
article

Other articles rarely refer to the hoax article compared
to non-hoax articles. Whenever reference happens, it
was made recently by the hoaxster or an IP address
Features:
o Number of prior mentions
o Time since first mention
o Creator of first mention
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
hoaxes have
incoherent
wikilinks.
Support:
how other articles refer
to it
Editor:
who created the
article

Hoax creators are more recently registered, and
have lesser editing experience.
Features:
o Creator’s account age
o Creator’s experience
Appearance:
hoaxes mostly have
text and few
references.
Link-network:
hoaxes have
incoherent
wikilinks.
Support:
hoaxes have few,
recent, suspicious
mentions.
Editor:
who created the
article

Impact
of hoaxes
Characteristics
of hoaxes
Detection
of hoaxes
Hoaxes are
different from
non-hoaxes in
many respects!
Most hoaxes are
caught soon, but
some hoaxes are
impactful!
Can we find
them?

Detection of hoaxes
Will a hoax get
past patrol?
Is an article
a hoax?
Is an article flagged
as hoax really one?
AUC = 71%
Appearance
features
AUC = 98%
Editor and
Network features
AUC = 86%
Editor and
support features

We discovered new hoaxes!
Steve Moertel
America popcorn
entrepreneur
6 years 11 months!
Flagged by us, deleted by Wikipedia administrators

Overall Conclusions
• Misbehavior and misinformation are prevalent on web, and
compromise safety and integrity
• Sockpuppets: Used for both benign and malicious purposes.
Malicious ones attempt to create an illusion of consensus
• Hoaxes: Successfully mislead readers, are impactful, and
have distinct characteristics
• Detection is important and feature engineering works
experimentally

You may also be interested in
• Tutorials on misbehavior and misinformation:
– Data-Driven Approaches towards Malicious Behavior Modeling. Jiang et al.,
SIGKDD 2017
– Antisocial Behavior on the Web: Characterization and Detection. Kumar et al.,
WWW 2017
• Vandals in Wikipedia
– VEWS: A Wikipedia Vandal Early Warning System. Kumar et al., SIGKDD 2015
• Language and deception
– Linguisitic Harbingers of Betrayal: A Case Study on an Online Strategic Game.
Niculae et al., ACL 2015
• Social network algorithm for troll detection
– Accurately Detecting Trolls in Slashdot Zoo via Decluttering. Kumar et al.,
ASONAM 2014
More details at: http://cs.stanford.edu/~srijan

MIS2 workshop at WSDM 2018
Submit your awesome papers, tools, demos, and more!
Full research papers, short papers, works in progress, extended
abstracts are welcome!
MIS2: Misinformation and Misbehavior
Mining on the Web
Feb 9, 2018 at Los Angeles, CA
Held in conjunction with WSDM 2018
Submissions due: Nov 20, 2017
Notifications due: Dec 14, 2017
Website: http://snap.stanford.edu/mis2/

Quora ML Workshop: Sock Puppets and Hoaxes on the Web

Quora ML Workshop: Sock Puppets and Hoaxes on the Web

Recommended

Recommended

More Related Content

Similar to Quora ML Workshop: Sock Puppets and Hoaxes on the Web

Similar to Quora ML Workshop: Sock Puppets and Hoaxes on the Web (20)

Recently uploaded

Recently uploaded (20)

Quora ML Workshop: Sock Puppets and Hoaxes on the Web