Pinpointing Location Focus in Microblogs

Pinpointing Locational Focus in
Microblogs
Jie Yin, Sarvnaz Karimi, John Lingad
November 2014
DIGITAL PRODUCTIVITY FLAGSHIP

Where is it happening?
For those monitoring social media to
• send help in emergency
• avoid certain area(e.g., for traffic)
• recommend services (ads)
CSIRO: positive impact | Presentation title | Presenter name2 |

CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi3 |
Find it on the map!

Locational focus
Locational focus: Macquarie Centre, North Ryde, New South Wales, Australia
Location mentions: Sydney, Macquarie Centre
Author Location: Brisbane, Australia

Some tweets mention multiple locations: Not
easy to identify the focus
Mary river, Queensland, Australia
Mary river, Queensland, Australia

Some tweets have no locational focus
There is an unknown location
(Ambiguity)
No specific focus
(World Level?)

To find locational focus, we have two tasks:
1. Find mentions of locations
2. Aggregate these to infer the main focus

Finding location mentions
1. Where to look for the location mentions
• Location mentions can be in Tweet text and or in hashtags
• Some hashtags are concatenated words or abbreviations, e.g., #QLDflood =
QLD + flood
• Tweet texts may mention a geographical location, such as Sydney, or a Point-
of-Interest (POI) such as an organisations name or a shop
• Authors’ locations in their profile (not exactly a location mention)
2. How to find these mentions?
• Hashtag segmentation
• Named Entity Recognition

Location mention extraction
• Related work
• NER tools for formal text, such Stanford NER and OpenNLP, are highly
accurate (solved problem).
• NER specific for Twitter: TwiNER [Wang et al.,2012], TwitterNLP [Ritter et al.,
2011]
• Retrained NER tools for Twitter [Lingad et al., 2013] – Location and
Organisation entities only
• In this work:
• Segmented the hashtags using a simple greedy maximal matching heuristics:
Used an English dictionary augmented with place name abbreviations
• Used retrained Stanford NER, and used LOC and ORG

Inferring locational focus
Given a list of location mentions, determine what the focus is.
For example:
If mentions are VIC,NSW,QLD,WA then focus is Australia
If mentions are Swanston St, RMIT then focus is RMIT University,
Melbourne, VIC, Australia
• Requires knowledge of the geographical locations as well as POIs
and their relationships/hierarchy.
• Gazetteer Australia 2010, GeoNames New Zealand, OpenStreetMap

Gazetteer as a tree
Specific POI
City/Suburb/Town/
Non-Specific POI
(e.g., river, highway)
State/Territory/
Region
Country

Inference algorithm: Where on the map?
• Step 1: Query location mentions from the gazetteer, and return
matching (partial or exact match) results in full path in the
gazetteer tree
• Step 2: Create an inference tree using the returned paths
• Step 3: Propagate the scores in the tree
• Step 4: Find a maximum scoring path
Goal: Finding the lowest granularity possible
Assumption: More possible matches found within a geographical
region indicates that region on the map is more likely the focus

Querying the gazetteer tree
• Location mentions: Sydney, Macquarie Centre
• Author Location: Brisbane, Australia
• Gazetteer querying returns:
- brisbane, queensland, australia
- south brisbane, queensland, australia
- macquaire centre, north ryde, new south wales, australia
- macquaire university, macquaire park, new south wales, australia
- ...
Each of these returned results get a matching score based on
Jaccard similarity of the query and the matched node.

Building the inference tree
earth
australia
queensland
brisbaneLeaf Score
brisbane, queensland, australia
macquaire centre, north ryde, new south wales, australia
new south wales
north ryde
macquaire centre
Leaf Score

Propagating scores to the parents and finding the
maximal path
• More branches within a sub-tree increase the chance of their
parent to be in the maximal path
• Bottom-up scoring of parents from leaves to the root
• Parent score = current score + 0.5*score of the highest scoring
child
• Top-down selection of the maximal path based on entropy as the
termination condition. If entropy of children scores are higher
than a pre-defined threshold, the algorithm stops at that level.

earth
australia
queensland
brisbane
brisbane, queensland, australia
macquaire centre, north ryde, new south wales, australia
Sydney, new south wales, australia
new south wales
north ryde
macquaire centre
sydneyA
0.5*A
B
D
macquaire University
C
0.5*Max(B,C)
0.5*Max(0.5*B,D)
Leaf score= w*2^level*Jaccard similarity

Dataset & annotation
• Queried Twitter with keywords such as fire, earthquake, storm,
hurricane
• Randomly sampled 7,000 tweets
• Two annotation steps:
1. Indentify location mentions
2. Identify locational focus (based on tweet and author location)
• Three annotators per tweet, only tweets with majority agreement
(2 out of 3) were kept in the final set.
• Tweets that their focus was not within Australia and New Zealand
were removed.
• There was a small set of tweets that were marked as impossible to
detect the focus which were removed.
• Final set: 1398 tweets (80 kept for parameter tuning)

Baseline: Yahoo! PlaceFinder*
• A service that accepted queries and returned a list of matching
places in the form of country, state, city, poi
• A query to the service was similar to a database querying:
SELECT * FROM geo.placefinder WHERE text = query text
And we chose the query text to be
(a) tweet (text & hashtag) and user location from their profile
(b) the list of location mentions from one tweet (human
annotated)
* As it was called in Jan 2014

Accuracy with manual location mentions (without
NER)
All Text Hashtag User Location
Level 1 - Country 89.9 35.3 45.2 71.6
Level 2 - State 73.5 29.3 37.4 36.3
Level 3 - City/Suburb 51.0 24.5 12.4 4.9
Level 4 - POI 29.7 11.7 8.1 1.8
No focus 58.5 95.8 96.4 63.2
User location was most useful in the county level, but did not contribute much in other
levels of granularity.
POI was hardest with only ~30% were correctly identified.
All = 0.6 text + 0.3 hashtag + 0.1 user location

Accuracy with location mentions extracted using
NER
Level 1 Level 2 Level 3 Level 4 No Focus
PlaceFinder (a) 87.9 58.6 22.9 21.0 0.3
PlaceFinder (b) 87.8 59.1 23.5 18.8 25.5
Our Alg. No NER 89.9 73.5 51.0 29.7 58.5
Our Alg. With NER 91.3 65.7 47.0 24.9 53.4
(a) The whole tweet was queried (b) location mentions were queries (no NER)
Country level focus was the easiest with all settings performing similar.
PlaceFinder was consistently worse in other levels, but that could also be the effect of
our gazetteer hierarchy.

The sources of errors in our algorithm
• Annotation mistakes: human annotators missed some of
the mentions.
• Missing some of the street and POIs in the gazetteer.
• Heavily misspelled place named that were not corrected
in our pre-processing step.
• We favoured lower granularities in our scoring, which
introduced wrong POIs that were not needed.
• Gazetteer bias: if one mention had many matches in a
region, the path could wrongly get stronger.

What we learnt and what’s next?
• Finding locational focus is difficult, even for human (low
agreement in annotation)
• Our method was accurate (90%) at country level, but accuracy
dropped for state, city, and POI levels (29%).
• All three information sources (text, hashtag, and author location)
contribute in finding focus, but in different levels.
• How to make it better?
• Incorporate some context, e.g., tweets that share hashtags, replies,
temporally close
• Learning the weights of different information sources

Related Studies
• Twitterstand: geotagging content of tweets. Used GeoNames
gazetteer and heuristic rules to find and disambiguate the location
focus.
• Kinsella et al [2011]: learning language model of locations
CSIRO: positive impact | Presentation title | Presenter name23 |

Pinpointing Location Focus in Microblogs

Recommended

Recommended

More Related Content

Similar to Pinpointing Location Focus in Microblogs

Similar to Pinpointing Location Focus in Microblogs (20)

More from Sarvnaz Karimi

More from Sarvnaz Karimi (6)

Recently uploaded

Recently uploaded (20)

Pinpointing Location Focus in Microblogs

Editor's Notes