Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Discovering Context
1. Discovering Context: Classifying
tweets through a semantic
transform based on Wikipedia
Yegin Genc, Yasuaki Sakamoto, and
Jeffrey V. Nickerson
2. "So I'm told by a reputable “I hate how my phone has
person they have killed this stupid … spell check
Osama Bin Laden. …" …”
Twitter to function as a large sensor system, and can increase
our awareness of our surroundings
4. "So I'm told by a reputable “I hate how my phone has
person they have killed this stupid … spell check
Osama Bin Laden. …" …”
important
important
important
Terrorism (?) Irritating technology
important important
NOT IMPORTANT
5. How to classify?
message
transform
distance(T(m1), T(m2))
transform
message
d(message1, message2) α d(T(message1),T( message2))
6. A Two-Step Approach
Wiki Page 1
(WP1)
d12
Wiki Page 2 WP1 WP2
Tweet 1
(WP2)
Tweet 2 d2n
Wiki Page 3 d13
Tweet 3 (WP3) d32
d1n
.
. . WP3
. .
d3n WPn
.
Tweet n
Wiki Page n
(WPn)
STEP 1: STEP 2:
FINDING WIKI PAGES CALCULATING DISTANCE
8. Tweet:
RT ashajayy Rest in peace JD Salinger Catcher in the Rye is one of my absolute
favourite books Sad day
Words:
Rest, peace, JD, Salinger, Catcher, Rye, absolute, favourite, books, Sad, day
Candidate Pages Hits
//en.wikipedia.org/wiki/J.D._Salinger 290
//en.wikipedia.org/wiki/J._D._Salinger 289
//en.wikipedia.org/wiki/books 145
//en.wikipedia.org/wiki/Doris_Day 138
//en.wikipedia.org/wiki/peace 131
10. Method
X Y
T1 t1x t1y
Distance Matrix
Tweets T2 t2x t2y Discriminant
-T 1 (Topic 1) T1 T2 T3
Analysis
-T 2 (Topic 1) MDS T3 t3x t3y
-T 3 (Topic 2)
T1 0 d12 d12 Accuracy
. T2 d21 0 d23 Rate
. T3 d31 d32 0
T3
. T1
T2
DSED
Acc. SED
SED
DLSA
LSA Acc. LSA
Wikipedia DWIKI
Acc. WIKI
11. Other Techniques
String Edit Distance (SED) Latent Semantic Analysis (LSA)
Natural language processing
Minimum number of edits needed to technique for classification based on
transform one string into the other term occurrences in documents
Kitten → sitten (subst. of 's' for 'k')
SED = 1
12. Data
Without Noise With Noise
Category Count Category Count
X 15 X J.D. Salinger 15
J.D. Salinger
iPad 15 iPad 15
Haiti 15 Haiti 15
TOTAL 45 Random 55
TOTAL 100
RT @ashajayy Rest in peace, JD Salinger. Catcher in the Rye is one of my absolute favourite books. Sad day.
@JMNelis I fear I may have killed him because I talked about how I hate "Catcher." (1/2)
'Catcher In The Rye' Author J.D. Salinger Dies At 91 - The author of The Catcher in the Rye died of natural causes,... http//ow.ly/16rETF
iPad..not so appealing to me (Yet!) It's basically the MacBook&iPhone combined.I have both so don't think i'll be getting the iPad soon.
Have u seen it?Apple iPad Tablet Steve Jobs Unveils Visionary Computer http//bit.ly/9IslTP
The new Apple formula Hype
What Yall think about me buying a whole bunch of sour patch kids and giving them to haiti i bet they would be HAPPY!
Please ReTweet (http//caltweet.com/4gx ) - Lets ALL really AID Haiti
RT @UNC_Health_Care Video Want to help the #Haitian patients at #UNC Hospitals? Here's how. http//bit…
@Alitas_Way naw im kiddin but ma'am it really looks great on u
Please come to our Legal Studies Open House on Tuesday February 2nd from 6-730pm.Please call for exact location and to RSVP …
Most impressive stat for Warner is he holds the top 3 most passing yards in a superbowl. Three games three most passing yards in 40
13. X J.D. Salinger
iPad
Haiti
Tweets without noise:
SED LSA Wiki
0.2
0.6
6
0.1
4
0.2
Coordinate 2
Coordinate 2
Coordinate 2
2
0.0
0
-0.3 -0.2 -0.1
-0.2
-2
-4
-0.6
-0.3 -0.2 -0.1 0.0 0.1 0.2 -0.6 -0.2 0.2 0.6 -2 0 2 4 6 8
Coordinate 1 Coordinate 1 Coordinate 1
Technique J. D. Salinger iPad Haiti
String Edit Distance .67 .13 .60
Latent Semantic Analysis .67 .73 .80
Wikipedia .93 .87 .80
14. X J.D. Salinger
iPad
Haiti
Tweets with noise: Random
SED
SED 0.6
0.6 LSA
LSA Wiki
Wiki
6
46
4
0.2
Coordinate 2
Coordinate 2
0.2
Coordinate 2
Coordinate 2
2
02
0
-0.2
-0.2
-2
-2
-4
-4
-0.6
-0.6
1
1 0.0
0.0 0.1
0.1 0.2
0.2 -0.6
-0.6 -0.2
-0.2 0.2
0.2 0.6
0.6 -2
-2 0
0 2
2 4
4 6
6 8
8
rdinate 1
rdinate 1 Coordinate 1
Coordinate 1 Coordinate 1
Coordinate 1
Technique J. D. Salinger iPad Haiti
Latent Semantic Analysis .60 .60 .20
Wikipedia .93 .87 .73
15. Conclusion
Wikipedia Space shows promising results in
defining similarity of short text
– Socially constructed
– Large space
– Immune to noise
16. Future Work
• Adaptive classification
– What we consider as noise may contain useful
information depending on the context
• Improved mapping and distance calculations
• Utilizing other social aspects of Wikipedia
We study how we can categorize messages streaming through Twitter. These messages, called tweets, come in at a rate of more than 600 a second [2], and are often cryptic. recognizing new and useful topics in this noisy environment, we may provide automated tools with pragmatic uses: Twitter functions as a large sensor system, and can increase our awareness of our surroundings Humans are experts in recognizing new and useful messages while ignoring others. They do this by extracting meaning from messages, categorizing messages with related meaning into the same topics, and noticing information that does not fit any existing categories. Attempts to automate this fundamental ability of cognition using semantic models still leave room for improvement (e.g. [1]).
Accuracy (hit plus correct rejection) of classifying 45 tweets with known categories when 55 randomly sampled tweets are added.
. Forty-five tweets with known categories mapped onto two-dimensional planes using multidimensional scaling of the between-tweet distances based on String Edit Distance, LSA and Wikipedia. An x is a tweet about J. D. Salinger and a triangle is a tweet about the iPad.