Mining the Web for Points of Interest

Adam Rae
Vanessa Murdock, Adrian Popescu, Hugues Bouchard
SIGIR 2012, Portland, Oregon, Entities Session

!

I’m at Adam’s
Bar…

?

Mining the Web for
Points of Interest

Using social media to increase our
knowledge of the world

Contents

§ Motivation

§ Point Of Interest (POI) extraction using user
generated data

§ POI localisation using social media

§ Conclusions

Motivation
§ Geographic Points of Interest are valuable
representations of important places in the world
around us.

§ Browsing and search
of POIs increasingly
important
›  Web search
›  Mobile
›  Navigation

Where do POIs come from?

§ Editing listings coming from NMAs, commercial
directories etc.
›  Costly process
›  Expensive to maintain freshness
›  Coverage
§ Do they reflect the kind of
places that people are
interested in looking for?

Can we get them from the web?
§ Un/semi-structured mentions of POIs throughout
text on web
›  Lots of context

§ Structured mentions of POIs in micro blogging
systems and Wikipedia articles
›  Easy to extract

When is a POI not a POI?

1  The White House is at 1600 Pennsylvania
Avenue, Washington DC.

2  The White House released a statement today
suggesting the moon is made of cheese.

3  The people living in the white house at the end
of the street turned out to be Martians.

Europe According to Foursquare

The World According to Foursquare

The World According to Gowalla

The World According to Wikipedia

Can we bootstrap using social media?

§ Train Conditional Random Fields (CRF) using
web snippets bootstrapped from structured
mentions in micro-blog entries
›  Extract POI, use as query to search engine
›  Resultant snippets filtered to those that contain POI
›  Sanitise

§ Also from geocoded Wikipedia articles (according
to Yago2)

Ground Truth Data
§ Created by manual assessors given explicit
instructions
›  1,337 examples of POIs in (some) context
›  1,066 unique POIs
›  Inter-assessor agreement:

Ground Truth Precision Recall F-Measure
Assessor
1 0.749 0.792 0.770

2 0.814 0.716 0.762

Sequential Tagging Model

1 $ '
p(Y | X, λ ) = exp& ∑ λ j F j (Y, X))
& )
Z(X) % j (

+ 1
- % (/-
argmaxΛ, exp' ∑ λ j F j (Y, X)* 0
' *-
- Z(X)
. & j )1

Features
§ Lexical
›  Word identity, shape, position, etc.
§ Grammatical
›  Part of Speech, Apache OpenNLP
§ Statistical
›  Normalised Point-wise Mutual Information of mobile
search query logs
§ Geographic
›  Gazetteer attributes from Yahoo! Placemaker
›  http://developer.yahoo.com/geo/placemaker/

Process Overview

Extract
Geocoded Wikipedia Wikipedia Bootstrapped Wikipedia based
Article
Articles Raw Web Snippets POI Tagger

Search Engine (Bing)

CRF Model Training
Snippet Processing
Titles

Foursquare Foursquare
Check-Ins
Bootstrapped Raw Web based POI
(Foursquare)
Extract Snippets Tagger
POI
Mentions
Check-Ins Gowalla Bootstrapped Gowalla based
(Gowalla) Raw Web Snippets POI Tagger

… was only after he had left the Marriott Hotel that he
remembered…

Results

Training Data Testing Data Precision Recall

Y! Placemaker Manual Data 0.237 0.228

Wikipedia Manual Data 0.514 0.337
Foursquare Manual Data 0.276 0.655
Gowalla Manual Data 0.360 0.414
Wikipedia 10-fold CV 0.879 0.955
Foursquare 10-fold CV 0.689 0.468
Gowalla 10-fold CV 0.857 0.868

Language Modelling
§ Partition the world into 1km cells
§ For each, create model from Flickr photos taken
in that area

c user (t,L)
P(t | θ L ) = L = ∑c user (t i ,L)
L t i ∈L

§ Treat problem as IR, match a POI (query) against
the cells (document)
›  Return centroid of of best matching cell
€

Performance

Placemaker Cascade Geo Scope # Examples
Placemaker 0.29 0.29 0.29 134
POIs
Placemaker 4.19 2.90 2.12 131
Other Locs
All Known 1.17 0.82 0.79 265
Locs
New - 439.0 5.88 130
Locations
All Data - 1.20 0.96 395

Conclusions and Implications

§  POIs are valuable, but useful ones difficult to define

§  Generating evaluation data is hard

§  Can use web snippets bootstrapped with
check-ins, and articles on Wikipedia to train POI
tagger
›  Up to 88% precision on unlabelled data
›  Reflect the POIs users visit
›  Easily updated
›  Can be located accurately using hybrid gazetteer + Flickr
language model technique

Benefits of this approach
§ Discover POIs:
›  that we already know about (replace/extend existing
sources)
›  we didn’t already know about (novel POIs)
›  of more diverse types (increasing coverage)
›  that are fresher

§ Increase relevance of local and hyperlocal search
using wisdom of the crowds

Research Areas
-  Automatic POI detection in UGC
-  Learning how users refer to places
-  Localising media
-  Generating evaluation data
-  (This is hard)
-  Multi-source combination
-  Quality & Credibility

Adam Rae
adamrae@yahoo-inc.com
Thank you Vanessa Murdock
Adrian Popescu
Hugues Bouchard

Mining the Web for Points of Interest

Recommended

Recommended

More Related Content

Similar to Mining the Web for Points of Interest

Similar to Mining the Web for Points of Interest (20)

Recently uploaded

Recently uploaded (20)

Mining the Web for Points of Interest

Editor's Notes