This document discusses mining social media and user generated data to extract geographic points of interest (POIs). It presents methods for using structured mentions of POIs from sources like Foursquare and Wikipedia to train models to extract POIs from unstructured web text. Evaluation data is created to assess extraction accuracy. Models using bootstrapped web snippets achieve up to 88% precision on unlabeled data, reflecting real-world POIs. A hybrid approach uses gazetteers and Flickr language models to accurately locate extracted POIs. The methods allow discovering new POIs and increasing coverage of diverse POI types from crowdsourced information.
4. Motivation
§ Geographic Points of Interest are valuable
representations of important places in the world
around us.
§ Browsing and search
of POIs increasingly
important
› Web search
› Mobile
› Navigation
5. Where do POIs come from?
§ Editing listings coming from NMAs, commercial
directories etc.
› Costly process
› Expensive to maintain freshness
› Coverage
§ Do they reflect the kind of
places that people are
interested in looking for?
6. Can we get them from the web?
§ Un/semi-structured mentions of POIs throughout
text on web
› Lots of context
§ Structured mentions of POIs in micro blogging
systems and Wikipedia articles
› Easy to extract
7. When is a POI not a POI?
1 The White House is at 1600 Pennsylvania
Avenue, Washington DC.
2 The White House released a statement today
suggesting the moon is made of cheese.
3 The people living in the white house at the end
of the street turned out to be Martians.
12. Can we bootstrap using social media?
§ Train Conditional Random Fields (CRF) using
web snippets bootstrapped from structured
mentions in micro-blog entries
› Extract POI, use as query to search engine
› Resultant snippets filtered to those that contain POI
› Sanitise
§ Also from geocoded Wikipedia articles (according
to Yago2)
13. Ground Truth Data
§ Created by manual assessors given explicit
instructions
› 1,337 examples of POIs in (some) context
› 1,066 unique POIs
› Inter-assessor agreement:
Ground Truth Precision Recall F-Measure
Assessor
1 0.749 0.792 0.770
2 0.814 0.716 0.762
15. Features
§ Lexical
› Word identity, shape, position, etc.
§ Grammatical
› Part of Speech, Apache OpenNLP
§ Statistical
› Normalised Point-wise Mutual Information of mobile
search query logs
§ Geographic
› Gazetteer attributes from Yahoo! Placemaker
› http://developer.yahoo.com/geo/placemaker/
16. Process Overview
Extract
Geocoded Wikipedia Wikipedia Bootstrapped Wikipedia based
Article
Articles Raw Web Snippets POI Tagger
Search Engine (Bing)
CRF Model Training
Snippet Processing
Titles
Foursquare Foursquare
Check-Ins
Bootstrapped Raw Web based POI
(Foursquare)
Extract Snippets Tagger
POI
Mentions
Check-Ins Gowalla Bootstrapped Gowalla based
(Gowalla) Raw Web Snippets POI Tagger
… was only after he had left the Marriott Hotel that he
remembered…
17. Results
Training Data Testing Data Precision Recall
Y! Placemaker Manual Data 0.237 0.228
Wikipedia Manual Data 0.514 0.337
Foursquare Manual Data 0.276 0.655
Gowalla Manual Data 0.360 0.414
Wikipedia 10-fold CV 0.879 0.955
Foursquare 10-fold CV 0.689 0.468
Gowalla 10-fold CV 0.857 0.868
18. Language Modelling
§ Partition the world into 1km cells
§ For each, create model from Flickr photos taken
in that area
c user (t,L)
P(t | θ L ) = L = ∑c user (t i ,L)
L t i ∈L
§ Treat problem as IR, match a POI (query) against
the cells (document)
› Return centroid of of best matching cell
€
19. Performance
Placemaker Cascade Geo Scope # Examples
Placemaker 0.29 0.29 0.29 134
POIs
Placemaker 4.19 2.90 2.12 131
Other Locs
All Known 1.17 0.82 0.79 265
Locs
New - 439.0 5.88 130
Locations
All Data - 1.20 0.96 395
20. Conclusions and Implications
§ POIs are valuable, but useful ones difficult to define
§ Generating evaluation data is hard
§ Can use web snippets bootstrapped with
check-ins, and articles on Wikipedia to train POI
tagger
› Up to 88% precision on unlabelled data
› Reflect the POIs users visit
› Easily updated
› Can be located accurately using hybrid gazetteer + Flickr
language model technique
21. Benefits of this approach
§ Discover POIs:
› that we already know about (replace/extend existing
sources)
› we didn’t already know about (novel POIs)
› of more diverse types (increasing coverage)
› that are fresher
§ Increase relevance of local and hyperlocal search
using wisdom of the crowds
22. Research Areas
- Automatic POI detection in UGC
- Learning how users refer to places
- Localising media
- Generating evaluation data
- (This is hard)
- Multi-source combination
- Quality & Credibility
23. Adam Rae
adamrae@yahoo-inc.com
Thank you Vanessa Murdock
Adrian Popescu
Hugues Bouchard
Editor's Notes
What is a POI?POIs have names, locations, category, context (depends on envisaged use-case)A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business.
news articles from the U.S. and the U.K., but also included a small number of examples from Yahoo! Answers and a small number of queries submitted to a search engine.The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs. The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs.
Learn the set of feature weights (big) lambda which maximises the label sequence probabilityProbability of a label sequence y, given an observed sequence xZ normalising factorF(Y,X) is the set of feature functions computed over the observations and the label transitions.
Up to ten snippets per queryUse BI0
All three model are statistically significantly higher than baseline
C_user(t,L) is the number of unique users who use the term ‘t’ in the cell ‘L’|L| is the sum of the user frequency of all terms in the locationMakes sense to use highly precise extant info when available, so use LM in combination with Placemaker (gazetteer) = cascade model
Median distances in kilometres
Re-finding existing POIs allows us to get get context from social media as well as confirm our model’s performanceNovel POIs are valuable, extending our knowledge of what is out thereNot restricted by the biases of existing sources like commercial enterprises or narrow criteria POIs
Wild text : web snippets, Tweets, news, etc, varies in cleanliness and consistency depending on sourceAutomatically detecting POIs in UGC content(“Corner of forth and main”)Discussion on the subjective nature of POI/location etc, very application-dependant (How to evaluate discover tasks?) Discussion – open questionsLocalising them Talking about manual annotation data for POI detection(How hard is it for humans?)Analytics- Combinations of sources