SlideShare a Scribd company logo
1 of 23
Download to read offline
Adam Rae
Vanessa Murdock, Adrian Popescu, Hugues Bouchard
     SIGIR 2012, Portland, Oregon, Entities Session
!




    I’m at Adam’s
        Bar…




?

                        Mining the Web for
                         Points of Interest

          Using social media to increase our
                     knowledge of the world
Contents

§ Motivation

§ Point Of Interest (POI) extraction using user
   generated data

§ POI localisation using social media

§ Conclusions
Motivation
§ Geographic Points of Interest are valuable
   representations of important places in the world
   around us.

§ Browsing and search
   of POIs increasingly
   important
 ›    Web search
 ›    Mobile
 ›    Navigation
Where do POIs come from?

§ Editing listings coming from NMAs, commercial
   directories etc.
 ›    Costly process
 ›    Expensive to maintain freshness
 ›    Coverage
§ Do they reflect the kind of
   places that people are
   interested in looking for?
Can we get them from the web?
§ Un/semi-structured mentions of POIs throughout
   text on web
 ›    Lots of context

§ Structured mentions of POIs in micro blogging
   systems and Wikipedia articles
 ›    Easy to extract
When is a POI not a POI?

1  The White House is at 1600 Pennsylvania
   Avenue, Washington DC.

2  The White House released a statement today
   suggesting the moon is made of cheese.

3  The people living in the white house at the end
   of the street turned out to be Martians.
Europe According to Foursquare
The World According to Foursquare
The World According to Gowalla
The World According to Wikipedia
Can we bootstrap using social media?

§ Train Conditional Random Fields (CRF) using
   web snippets bootstrapped from structured
   mentions in micro-blog entries
 ›    Extract POI, use as query to search engine
 ›    Resultant snippets filtered to those that contain POI
 ›    Sanitise


§ Also from geocoded Wikipedia articles (according
   to Yago2)
Ground Truth Data
§ Created by manual assessors given explicit
   instructions
 ›    1,337 examples of POIs in (some) context
 ›    1,066 unique POIs
 ›    Inter-assessor agreement:

      Ground Truth   Precision     Recall        F-Measure
       Assessor
           1          0.749        0.792           0.770

           2          0.814        0.716           0.762
Sequential Tagging Model


                   1      $                 '
   p(Y | X, λ ) =      exp& ∑ λ j F j (Y, X))
                          &                 )
                  Z(X)    % j               (


           + 1
           -         %                 (/-
    argmaxΛ,      exp' ∑ λ j F j (Y, X)* 0
                     '                 *-
           - Z(X)
           .         & j               )1
Features
§ Lexical
 ›    Word identity, shape, position, etc.
§ Grammatical
 ›    Part of Speech, Apache OpenNLP
§ Statistical
 ›    Normalised Point-wise Mutual Information of mobile
      search query logs
§ Geographic
 ›    Gazetteer attributes from Yahoo! Placemaker
 ›    http://developer.yahoo.com/geo/placemaker/
Process Overview



                     Extract
Geocoded Wikipedia                                     Wikipedia Bootstrapped                                             Wikipedia based
                     Article
     Articles                                           Raw Web Snippets                                                    POI Tagger



                                Search Engine (Bing)




                                                                                                     CRF Model Training
                                                                                Snippet Processing
                      Titles

                                                             Foursquare                                                     Foursquare
     Check-Ins
                                                       Bootstrapped Raw Web                                                 based POI
   (Foursquare)
                      Extract                                 Snippets                                                        Tagger
                       POI
                     Mentions
    Check-Ins                                          Gowalla Bootstrapped                                               Gowalla based
    (Gowalla)                                           Raw Web Snippets                                                   POI Tagger




         … was only after he had left the Marriott Hotel that he
                            remembered…
Results

Training Data   Testing Data   Precision   Recall

Y! Placemaker Manual Data      0.237       0.228

Wikipedia       Manual Data    0.514       0.337
Foursquare      Manual Data    0.276       0.655
Gowalla         Manual Data    0.360       0.414
Wikipedia       10-fold CV     0.879       0.955
Foursquare      10-fold CV     0.689       0.468
Gowalla         10-fold CV     0.857       0.868
Language Modelling
§ Partition the world into 1km cells
§ For each, create model from Flickr photos taken
   in that area

               c user (t,L)
 P(t | θ L ) =                        L =    ∑c       user   (t i ,L)
                     L                       t i ∈L


§ Treat problem as IR, match a POI (query) against
   the cells (document)
 ›    Return centroid of of best matching cell
                      €
Performance


             Placemaker   Cascade   Geo Scope   # Examples
Placemaker   0.29         0.29      0.29        134
POIs
Placemaker   4.19         2.90      2.12        131
Other Locs
All Known    1.17         0.82      0.79        265
Locs
New          -            439.0     5.88        130
Locations
All Data     -            1.20      0.96        395
Conclusions and Implications

§  POIs are valuable, but useful ones difficult to define

§  Generating evaluation data is hard

§  Can use web snippets bootstrapped with
    check-ins, and articles on Wikipedia to train POI
    tagger
 ›    Up to 88% precision on unlabelled data
 ›    Reflect the POIs users visit
 ›    Easily updated
 ›    Can be located accurately using hybrid gazetteer + Flickr
      language model technique
Benefits of this approach
§ Discover POIs:
 ›    that we already know about (replace/extend existing
      sources)
 ›    we didn’t already know about (novel POIs)
 ›    of more diverse types (increasing coverage)
 ›    that are fresher


§ Increase relevance of local and hyperlocal search
   using wisdom of the crowds
Research Areas
-  Automatic POI detection in UGC
-  Learning how users refer to places
-  Localising media
-  Generating evaluation data
 -    (This is hard)
-  Multi-source combination
-  Quality & Credibility
Adam Rae
            adamrae@yahoo-inc.com
Thank you         Vanessa Murdock
                   Adrian Popescu
                  Hugues Bouchard

More Related Content

Similar to Mining the Web for Points of Interest

Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013MongoDB
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrMongoDB
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...Amazon Web Services
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency TrapTom Croucher
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Abraham Aranguren
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware appsAnshu Prateek
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackElasticsearch
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗Bobby Chen
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -晋吾 北川
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without InterferenceTony Tam
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 

Similar to Mining the Web for Points of Interest (20)

Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency Trap
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware apps
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic Stack
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗
 
SIL rapid capture
SIL rapid captureSIL rapid capture
SIL rapid capture
 
Why Django
Why DjangoWhy Django
Why Django
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 

Recently uploaded

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Mining the Web for Points of Interest

  • 1. Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session
  • 2. ! I’m at Adam’s Bar… ? Mining the Web for Points of Interest Using social media to increase our knowledge of the world
  • 3. Contents § Motivation § Point Of Interest (POI) extraction using user generated data § POI localisation using social media § Conclusions
  • 4. Motivation § Geographic Points of Interest are valuable representations of important places in the world around us. § Browsing and search of POIs increasingly important ›  Web search ›  Mobile ›  Navigation
  • 5. Where do POIs come from? § Editing listings coming from NMAs, commercial directories etc. ›  Costly process ›  Expensive to maintain freshness ›  Coverage § Do they reflect the kind of places that people are interested in looking for?
  • 6. Can we get them from the web? § Un/semi-structured mentions of POIs throughout text on web ›  Lots of context § Structured mentions of POIs in micro blogging systems and Wikipedia articles ›  Easy to extract
  • 7. When is a POI not a POI? 1  The White House is at 1600 Pennsylvania Avenue, Washington DC. 2  The White House released a statement today suggesting the moon is made of cheese. 3  The people living in the white house at the end of the street turned out to be Martians.
  • 8. Europe According to Foursquare
  • 9. The World According to Foursquare
  • 10. The World According to Gowalla
  • 11. The World According to Wikipedia
  • 12. Can we bootstrap using social media? § Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries ›  Extract POI, use as query to search engine ›  Resultant snippets filtered to those that contain POI ›  Sanitise § Also from geocoded Wikipedia articles (according to Yago2)
  • 13. Ground Truth Data § Created by manual assessors given explicit instructions ›  1,337 examples of POIs in (some) context ›  1,066 unique POIs ›  Inter-assessor agreement: Ground Truth Precision Recall F-Measure Assessor 1 0.749 0.792 0.770 2 0.814 0.716 0.762
  • 14. Sequential Tagging Model 1 $ ' p(Y | X, λ ) = exp& ∑ λ j F j (Y, X)) & ) Z(X) % j ( + 1 - % (/- argmaxΛ, exp' ∑ λ j F j (Y, X)* 0 ' *- - Z(X) . & j )1
  • 15. Features § Lexical ›  Word identity, shape, position, etc. § Grammatical ›  Part of Speech, Apache OpenNLP § Statistical ›  Normalised Point-wise Mutual Information of mobile search query logs § Geographic ›  Gazetteer attributes from Yahoo! Placemaker ›  http://developer.yahoo.com/geo/placemaker/
  • 16. Process Overview Extract Geocoded Wikipedia Wikipedia Bootstrapped Wikipedia based Article Articles Raw Web Snippets POI Tagger Search Engine (Bing) CRF Model Training Snippet Processing Titles Foursquare Foursquare Check-Ins Bootstrapped Raw Web based POI (Foursquare) Extract Snippets Tagger POI Mentions Check-Ins Gowalla Bootstrapped Gowalla based (Gowalla) Raw Web Snippets POI Tagger … was only after he had left the Marriott Hotel that he remembered…
  • 17. Results Training Data Testing Data Precision Recall Y! Placemaker Manual Data 0.237 0.228 Wikipedia Manual Data 0.514 0.337 Foursquare Manual Data 0.276 0.655 Gowalla Manual Data 0.360 0.414 Wikipedia 10-fold CV 0.879 0.955 Foursquare 10-fold CV 0.689 0.468 Gowalla 10-fold CV 0.857 0.868
  • 18. Language Modelling § Partition the world into 1km cells § For each, create model from Flickr photos taken in that area c user (t,L) P(t | θ L ) = L = ∑c user (t i ,L) L t i ∈L § Treat problem as IR, match a POI (query) against the cells (document) ›  Return centroid of of best matching cell €
  • 19. Performance Placemaker Cascade Geo Scope # Examples Placemaker 0.29 0.29 0.29 134 POIs Placemaker 4.19 2.90 2.12 131 Other Locs All Known 1.17 0.82 0.79 265 Locs New - 439.0 5.88 130 Locations All Data - 1.20 0.96 395
  • 20. Conclusions and Implications §  POIs are valuable, but useful ones difficult to define §  Generating evaluation data is hard §  Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger ›  Up to 88% precision on unlabelled data ›  Reflect the POIs users visit ›  Easily updated ›  Can be located accurately using hybrid gazetteer + Flickr language model technique
  • 21. Benefits of this approach § Discover POIs: ›  that we already know about (replace/extend existing sources) ›  we didn’t already know about (novel POIs) ›  of more diverse types (increasing coverage) ›  that are fresher § Increase relevance of local and hyperlocal search using wisdom of the crowds
  • 22. Research Areas -  Automatic POI detection in UGC -  Learning how users refer to places -  Localising media -  Generating evaluation data -  (This is hard) -  Multi-source combination -  Quality & Credibility
  • 23. Adam Rae adamrae@yahoo-inc.com Thank you Vanessa Murdock Adrian Popescu Hugues Bouchard

Editor's Notes

  1. What is a POI?POIs have names, locations, category, context (depends on envisaged use-case)A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business.
  2. news articles from the U.S. and the U.K., but also included a small number of examples from Yahoo! Answers and a small number of queries submitted to a search engine.The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs. The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs.
  3. Learn the set of feature weights (big) lambda which maximises the label sequence probabilityProbability of a label sequence y, given an observed sequence xZ normalising factorF(Y,X) is the set of feature functions computed over the observations and the label transitions.
  4. Up to ten snippets per queryUse BI0
  5. All three model are statistically significantly higher than baseline
  6. C_user(t,L) is the number of unique users who use the term ‘t’ in the cell ‘L’|L| is the sum of the user frequency of all terms in the locationMakes sense to use highly precise extant info when available, so use LM in combination with Placemaker (gazetteer) = cascade model
  7. Median distances in kilometres
  8. Re-finding existing POIs allows us to get get context from social media as well as confirm our model’s performanceNovel POIs are valuable, extending our knowledge of what is out thereNot restricted by the biases of existing sources like commercial enterprises or narrow criteria POIs
  9. Wild text : web snippets, Tweets, news, etc, varies in cleanliness and consistency depending on sourceAutomatically detecting POIs in UGC content(“Corner of forth and main”)Discussion on the subjective nature of POI/location etc, very application-dependant (How to evaluate discover tasks?) Discussion – open questionsLocalising them Talking about manual annotation data for POI detection(How hard is it for humans?)Analytics- Combinations of sources