SlideShare a Scribd company logo
1 of 33
Semantic Search at Yahoo
P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s ⎪ A p r i l 1 6 , 2 0 1 4
The Semantic Web (2001-)
4/16/20142
 Part of Tim Berners-Lee‟s
original proposal for the Web
 Beginning of a research community
› Ontology engineering
› Logical inference
› Agents, web services
 Rough start in deployment
› Misplaced expectations
› Lack of adoption
 The Semantic Web, May 2001
 “At the doctor's office, Lucy instructed her
Semantic Web agent through her handheld Web
browser. The agent promptly retrieved
information about Mom's prescribed treatment
from the doctor's agent, looked up several lists
of providers, and checked for the ones in-plan
for Mom's insurance within a 20-mile radius of
her home and with a rating of excellent or very
good on trusted rating services. It then began
trying to find a match between available
appointment times (supplied by the agents of
individual providers through their Web sites) and
Pete's and Lucy's busy schedules.”
 (The emphasized keywords indicate terms
whose semantics, or meaning, were defined for
the agent through the Semantic Web.)
4/16/20143
Misplaced expectations?
Lack of adoption
 Standardization ahead of adoption
› URI, RDF, RDF/XML, RDFa, JSON-LD,
OWL, RIF, SPARQL, OWL-S, POWDER …
 Chicken and egg problem
› No users/use cases, hence no data
› No data, because no users/use cases
 By 2007, some modest progress
› Metadata in HTML: microformats
› Linked Data: simplifying the stack
Web search by 2007
5
 Large classes of queries are solved to perfection
 Improvements in web search are harder and harder to come by
› Relevance models, hyperlink structure and interaction data
› Combination of features using machine learning
› Heavy investment in computational power
• real-time indexing, instant search, datacenters and edge services
 Language issues
› Multiple interpretations
• jaguar
• paris hilton
› Secondary meaning
• george bush (and I mean the beer brewer
in Arizona)
› Subjectivity
• reliable digital camera
• paris hilton sexy
› Imprecise or overly precise searches
• jim hendler
 Complex needs
› Missing information
• brad pitt zombie
• florida man with 115 guns
• 35 year old computer scientist living in
barcelona
› Category queries
• countries in africa
• barcelona nightlife
› Transactional or computational queries
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Poorly solved information needs remain
Many of these queries would
not be asked by users, who
learned over time what search
technology can and can not
do.
Web search by 2007
7
 Are there even any true keyword queries?
› Lyrics, quotes and bugs… anything else?
 Remaining challenges are not computational, but in modeling user
cognition
› Need a deeper understanding of the query, the content and/or the world at large
Microsearch internal prototype (2007)
Personal and
private
homepage
of the same
person
(clear from the
snippet but it
could be also
automatically
de-duplicated)
Conferences
he plans to attend
and his vacations
from homepage
plus bio events
from LinkedIn
Geolocation
Enhanced Results
 Computing abstracts is hard
› Summarization of HTML
• Template detection
• Selecting relevant snippets
• Composing readable text
› Efficiency constraints
 Structured data to replace or complement text summary
› Key/value pairs
› Deep links
› Image or Video
Yahoo SearchMonkey (2008)
1. Extract structured data
› Semantic Web markup
• Example:
<span property=“vcard:city”>Santa Clara</span>
<span property=“vcard:region”>CA</span>
› Information Extraction
2. Presentation
› Fixed presentation templates
• One template per object type
› Applications
• Third-party modules to display data (SearchMonkey)
Effectiveness of enhanced results
 Explicit user feedback
› Side-by-side editorial evaluation (A/B testing)
• Editors are shown a traditional search result and enhanced result for the same page
• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)
 Implicit user feedback
› Click-through rate analysis
• Long dwell time limit of 100s (Ciemiewicz et al. 2010)
• 15% increase in „good‟ clicks
› User interaction model
• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than
textual (III)
• Enhanced results effectively reduce bad clicks!
 See
› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR
2011: 725-734
Adoption among search providers
 Google announces Rich Snippets - June, 2009
› Faceted search for recipes - Feb, 2011
 Bing tiles – Feb, 2011
 Facebook‟s Like button and the Open Graph Protocol (2010)
› Shows up in profiles and news feed
› Site owners can later reach users who have liked an object
schema.org
 Agreement on a shared set of schemas for common types of web
content
› Bing, Google, and Yahoo! as initial founders (June, 2011)
• Yandex joins schema.org in Nov, 2011
› Similar in intent to sitemaps.org
• Use a single format to communicate the same information to all three search engines
 schema.org covers areas of interest to all search engines
› Business listings (local), creative works (video), recipes, reviews and more
› Microdata, RDFa, JSON-LD syntax
 Collaborative effort
› Growing number of 3rd party contributions
› schema.org discussions at public-vocabs@w3.org
Adoption among publishers
 R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote)
› Over 15% of all pages now have schema.org markup
› Over 5 million sites, over 25 billion entity references
› In other words
• Same order of magnitude as the web
 See also
› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
• Based on Bing US corpus
• 31% of webpages, 5% of domains contain some metadata
› WebDataCommons
• Based on CommonCrawl Nov 2013
• 26% of webpages, 14% of domains contain some metadata
Semantic Search
 Active research field at the intersection of IR, NLP, DB and SemWeb
› ESAIR at SIGIR, SemSearch at ESWC/WWW, EOS and JIWES at SIGIR, Semantic Search
at VLDB
 Exploiting semantic understanding in the retrieval process
› User intent and resources are represented using semantic models
• Not just symbolic representations
› Semantic models are exploited in the matching and ranking of resources
 Tasks
› information extraction
› information reconciliation/tracking
› query understanding
› retrieving/ranking entities/attributes/relations
› result presentation
Information extraction and reconciliation
 Ontology management
› Editorially maintained OWL ontology with 300+ classes
› Covering the domains of interest of Yahoo
 Information extraction
› Automated information extraction
• e.g. wrapper induction
› Metadata from HTML pages
• Focused crawler
› Public datasets (e.g. Dbpedia)
› Proprietary data
 Data fusion
› Manual mapping from the source schemas to the ontology
› Supervised entity reconciliation
• Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane:
WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013
• Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012
 Curation and quality assessment
Yahoo‟s Knowledge Graph
Chicago Cubs
Chicago
Barack Obama
Carlos Zambrano
10% off tickets
for
plays for
plays in
lives in
Brad Pitt
Angelina Jolie
Steven Soderbergh
George Clooney
Ocean’s Twelve
partner
directs
casts in
E/R
casts
in
takes place in
Fight Club
casts in
Dust Brothers
casts
in
music by
Nicolas Torzec: Making knowledge reusable at Yahoo!:
a Look at the Yahoo! Knowledge Base (SemTech 2013)
Query understanding
21
 ~70% of queries contain a named entity
 Entity linking in queries and query sessions
› Online as input to ranking
› Semantic log mining
• Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW 2013:
561-570
 See tutorial on Entity Linking and Retrieval by Edgar Meij, Krisztián
Balog and Daan Odijk
list search
related entity finding
entity search
SemSearch 2010/11
list completion
SemSearch 2011
TREC ELC taskTREC REF-LOD task
semantic search
Common retrieval tasks in Semantic Search
question-answering
QALD 2012/13/14
Entity Retrieval evaluation
 SemSearch challenge (2010/2011)
› Queries
• 50 entity-mention queries selected from the Search Query Tiny Sample v1.0 dataset, provided by the
Yahoo! Webscope program
› Data
• Billion Triples Challenge 2009 data set
• Combination of crawls of multiple semantic search engines
› Evaluation
• Mechanical Turk
 See report:
› Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S.
Thompson, Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem.
21: 14-29 (2013)
Glimmer: open-source retrieval engine over RDF data
› Extension of MG4J from University of Milano
› Indexing
• MapReduce-based
• Horizontal indexing (subject/predicate/object fields)
• Vertical indexing (one field per predicate)
› Retrieval
• BM25F with machine-learned weights for properties and domains
• 52% improvement over the best system in SemSearch 2010
› Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF
Data. International Semantic Web Conference (1) 2011: 83-97
› https://github.com/yahoo/Glimmer/
 Entity-seeking queries make up
40-50% of the query volume
› Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc
object retrieval in the web of data. WWW 2010: 771-
780
› Thomas Lin, Patrick Pantel, Michael Gamon, Anitha
Kannan, Ariel Fuxman: Active objects: actions for
entity-centric search. WWW 2012: 589-598
 Show a summary of the most
likely information-needs
› Including related entities for navigation
› Roi Blanco, Berkant Barla
Cambazoglu, Peter Mika, Nicolas
Torzec: Entity Recommendations in
Web Search. ISWC 2013
Application:
entity displays in web search
Application: personalization in online news
 Entity linking
 Entity ranking according to relevance to the document
New applications
Mobile search on the rise
 Information access on-the-go requires hands-free operation
› Driving, walking, gym, etc.
• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]
 ~50% of queries are coming from mobile devices (and growing)
› Changing habits, e.g. iPad usage peaks before bedtime
› Limitations in input/output
[1] http://answers.google.com/answers/threadview?id=392456
[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622
Mobile search challenges and opportunities
29
 Interaction
› Question-answering
› Support for interactive retrieval
› Spoken-language access
› Task completion
 Contextualization
› Personalization
› Geo
› Context (work/home/travel)
• Try getaviate.com
Interactive, conversational voice search
 Parlance EU project
› Complex dialogs within a domain
• Requires complete semantic understanding
 Complete system (mixed license)
› Automated Speech Recognition (ASR)
› Spoken Language Understanding (SLU)
› Interaction Management
› Knowledge Base
› Natural Language Generation (NLG)
› Text-to-Speech (TTS)
 Video
Task completion
31
 We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› We need to understand what the available actions are
 Ongoing work in schema.org in modeling actions
› Understand what actions can be taken on a page
› Help users in mapping their query to potential actions
› Applications in web search, email etc.
THING
THING
Applications
Email (Gmail) SERP (Yandex)
Q&A
 Many thanks to members of the Semantic Search team
at Yahoo Labs Barcelona and to Yahoos around the world
 Contact me
› pmika@yahoo-inc.com
› @pmika
› www.slideshare.net/pmika
› Ask about our internships and other opportunities

More Related Content

What's hot

Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Bradley Allen
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyond
Bill Slawski
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
sssw2011
 

What's hot (20)

Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papers
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyond
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 

Viewers also liked

Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
toncho11
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
Atul Shridhar
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
Kerstin Forsberg
 

Viewers also liked (20)

Future of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! ResearchFuture of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! Research
 
Yahoo Search Marketing Help
Yahoo Search Marketing HelpYahoo Search Marketing Help
Yahoo Search Marketing Help
 
An unconventional loading strategy for YUI 3
An unconventional loading strategy for YUI 3An unconventional loading strategy for YUI 3
An unconventional loading strategy for YUI 3
 
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
 
Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pb
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log Analysis
 
Advanced Internet searching
Advanced  Internet searchingAdvanced  Internet searching
Advanced Internet searching
 
search engines
search enginessearch engines
search engines
 
Semantics and linked data at astra zeneca
Semantics and linked data at astra zenecaSemantics and linked data at astra zeneca
Semantics and linked data at astra zeneca
 
[WEBINAR] Boost Your Paid Search ROI with Bing and Yahoo! Search
[WEBINAR] Boost Your Paid Search ROI with Bing and Yahoo! Search[WEBINAR] Boost Your Paid Search ROI with Bing and Yahoo! Search
[WEBINAR] Boost Your Paid Search ROI with Bing and Yahoo! Search
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Web 3.0
Web 3.0Web 3.0
Web 3.0
 

Similar to Semantic Search at Yahoo

(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”
icwe2015
 

Similar to Semantic Search at Yahoo (20)

Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 
(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”
 
Semantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the WebSemantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the Web
 
Social Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsSocial Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 years
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
 
Design the Search Experience
Design the Search ExperienceDesign the Search Experience
Design the Search Experience
 
Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014
 
Not Your Mom's SEO
Not Your Mom's SEONot Your Mom's SEO
Not Your Mom's SEO
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
Walk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataWalk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked Data
 
Evaluating search engines
Evaluating search enginesEvaluating search engines
Evaluating search engines
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Search
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Practical Approaches to Sharing Information
Practical Approaches to Sharing InformationPractical Approaches to Sharing Information
Practical Approaches to Sharing Information
 

More from Peter Mika (6)

Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Semantic Search at Yahoo

  • 1. Semantic Search at Yahoo P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s ⎪ A p r i l 1 6 , 2 0 1 4
  • 2. The Semantic Web (2001-) 4/16/20142  Part of Tim Berners-Lee‟s original proposal for the Web  Beginning of a research community › Ontology engineering › Logical inference › Agents, web services  Rough start in deployment › Misplaced expectations › Lack of adoption
  • 3.  The Semantic Web, May 2001  “At the doctor's office, Lucy instructed her Semantic Web agent through her handheld Web browser. The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom's insurance within a 20-mile radius of her home and with a rating of excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules.”  (The emphasized keywords indicate terms whose semantics, or meaning, were defined for the agent through the Semantic Web.) 4/16/20143 Misplaced expectations?
  • 4. Lack of adoption  Standardization ahead of adoption › URI, RDF, RDF/XML, RDFa, JSON-LD, OWL, RIF, SPARQL, OWL-S, POWDER …  Chicken and egg problem › No users/use cases, hence no data › No data, because no users/use cases  By 2007, some modest progress › Metadata in HTML: microformats › Linked Data: simplifying the stack
  • 5. Web search by 2007 5  Large classes of queries are solved to perfection  Improvements in web search are harder and harder to come by › Relevance models, hyperlink structure and interaction data › Combination of features using machine learning › Heavy investment in computational power • real-time indexing, instant search, datacenters and edge services
  • 6.  Language issues › Multiple interpretations • jaguar • paris hilton › Secondary meaning • george bush (and I mean the beer brewer in Arizona) › Subjectivity • reliable digital camera • paris hilton sexy › Imprecise or overly precise searches • jim hendler  Complex needs › Missing information • brad pitt zombie • florida man with 115 guns • 35 year old computer scientist living in barcelona › Category queries • countries in africa • barcelona nightlife › Transactional or computational queries • 120 dollars in euros • digital camera under 300 dollars • world temperature in 2020 Poorly solved information needs remain Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
  • 7. Web search by 2007 7  Are there even any true keyword queries? › Lyrics, quotes and bugs… anything else?  Remaining challenges are not computational, but in modeling user cognition › Need a deeper understanding of the query, the content and/or the world at large
  • 8. Microsearch internal prototype (2007) Personal and private homepage of the same person (clear from the snippet but it could be also automatically de-duplicated) Conferences he plans to attend and his vacations from homepage plus bio events from LinkedIn Geolocation
  • 9. Enhanced Results  Computing abstracts is hard › Summarization of HTML • Template detection • Selecting relevant snippets • Composing readable text › Efficiency constraints  Structured data to replace or complement text summary › Key/value pairs › Deep links › Image or Video
  • 10. Yahoo SearchMonkey (2008) 1. Extract structured data › Semantic Web markup • Example: <span property=“vcard:city”>Santa Clara</span> <span property=“vcard:region”>CA</span> › Information Extraction 2. Presentation › Fixed presentation templates • One template per object type › Applications • Third-party modules to display data (SearchMonkey)
  • 11. Effectiveness of enhanced results  Explicit user feedback › Side-by-side editorial evaluation (A/B testing) • Editors are shown a traditional search result and enhanced result for the same page • Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)  Implicit user feedback › Click-through rate analysis • Long dwell time limit of 100s (Ciemiewicz et al. 2010) • 15% increase in „good‟ clicks › User interaction model • Enhanced results lead users to relevant documents (IV) even though less likely to clicked than textual (III) • Enhanced results effectively reduce bad clicks!  See › Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-734
  • 12. Adoption among search providers  Google announces Rich Snippets - June, 2009 › Faceted search for recipes - Feb, 2011  Bing tiles – Feb, 2011  Facebook‟s Like button and the Open Graph Protocol (2010) › Shows up in profiles and news feed › Site owners can later reach users who have liked an object
  • 13. schema.org  Agreement on a shared set of schemas for common types of web content › Bing, Google, and Yahoo! as initial founders (June, 2011) • Yandex joins schema.org in Nov, 2011 › Similar in intent to sitemaps.org • Use a single format to communicate the same information to all three search engines  schema.org covers areas of interest to all search engines › Business listings (local), creative works (video), recipes, reviews and more › Microdata, RDFa, JSON-LD syntax  Collaborative effort › Growing number of 3rd party contributions › schema.org discussions at public-vocabs@w3.org
  • 14. Adoption among publishers  R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote) › Over 15% of all pages now have schema.org markup › Over 5 million sites, over 25 billion entity references › In other words • Same order of magnitude as the web  See also › P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 • Based on Bing US corpus • 31% of webpages, 5% of domains contain some metadata › WebDataCommons • Based on CommonCrawl Nov 2013 • 26% of webpages, 14% of domains contain some metadata
  • 15. Semantic Search  Active research field at the intersection of IR, NLP, DB and SemWeb › ESAIR at SIGIR, SemSearch at ESWC/WWW, EOS and JIWES at SIGIR, Semantic Search at VLDB  Exploiting semantic understanding in the retrieval process › User intent and resources are represented using semantic models • Not just symbolic representations › Semantic models are exploited in the matching and ranking of resources  Tasks › information extraction › information reconciliation/tracking › query understanding › retrieving/ranking entities/attributes/relations › result presentation
  • 16. Information extraction and reconciliation  Ontology management › Editorially maintained OWL ontology with 300+ classes › Covering the domains of interest of Yahoo  Information extraction › Automated information extraction • e.g. wrapper induction › Metadata from HTML pages • Focused crawler › Public datasets (e.g. Dbpedia) › Proprietary data  Data fusion › Manual mapping from the source schemas to the ontology › Supervised entity reconciliation • Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013 • Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012  Curation and quality assessment
  • 17.
  • 18.
  • 19.
  • 20. Yahoo‟s Knowledge Graph Chicago Cubs Chicago Barack Obama Carlos Zambrano 10% off tickets for plays for plays in lives in Brad Pitt Angelina Jolie Steven Soderbergh George Clooney Ocean’s Twelve partner directs casts in E/R casts in takes place in Fight Club casts in Dust Brothers casts in music by Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013)
  • 21. Query understanding 21  ~70% of queries contain a named entity  Entity linking in queries and query sessions › Online as input to ranking › Semantic log mining • Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW 2013: 561-570  See tutorial on Entity Linking and Retrieval by Edgar Meij, Krisztián Balog and Daan Odijk
  • 22. list search related entity finding entity search SemSearch 2010/11 list completion SemSearch 2011 TREC ELC taskTREC REF-LOD task semantic search Common retrieval tasks in Semantic Search question-answering QALD 2012/13/14
  • 23. Entity Retrieval evaluation  SemSearch challenge (2010/2011) › Queries • 50 entity-mention queries selected from the Search Query Tiny Sample v1.0 dataset, provided by the Yahoo! Webscope program › Data • Billion Triples Challenge 2009 data set • Combination of crawls of multiple semantic search engines › Evaluation • Mechanical Turk  See report: › Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013)
  • 24. Glimmer: open-source retrieval engine over RDF data › Extension of MG4J from University of Milano › Indexing • MapReduce-based • Horizontal indexing (subject/predicate/object fields) • Vertical indexing (one field per predicate) › Retrieval • BM25F with machine-learned weights for properties and domains • 52% improvement over the best system in SemSearch 2010 › Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (1) 2011: 83-97 › https://github.com/yahoo/Glimmer/
  • 25.  Entity-seeking queries make up 40-50% of the query volume › Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010: 771- 780 › Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012: 589-598  Show a summary of the most likely information-needs › Including related entities for navigation › Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013 Application: entity displays in web search
  • 26. Application: personalization in online news  Entity linking  Entity ranking according to relevance to the document
  • 28. Mobile search on the rise  Information access on-the-go requires hands-free operation › Driving, walking, gym, etc. • Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]  ~50% of queries are coming from mobile devices (and growing) › Changing habits, e.g. iPad usage peaks before bedtime › Limitations in input/output [1] http://answers.google.com/answers/threadview?id=392456 [2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622
  • 29. Mobile search challenges and opportunities 29  Interaction › Question-answering › Support for interactive retrieval › Spoken-language access › Task completion  Contextualization › Personalization › Geo › Context (work/home/travel) • Try getaviate.com
  • 30. Interactive, conversational voice search  Parlance EU project › Complex dialogs within a domain • Requires complete semantic understanding  Complete system (mixed license) › Automated Speech Recognition (ASR) › Spoken Language Understanding (SLU) › Interaction Management › Knowledge Base › Natural Language Generation (NLG) › Text-to-Speech (TTS)  Video
  • 31. Task completion 31  We would like to help our users in task completion › But we have trained our users to talk in nouns • Retrieval performance decreases by adding verbs to queries › We need to understand what the available actions are  Ongoing work in schema.org in modeling actions › Understand what actions can be taken on a page › Help users in mapping their query to potential actions › Applications in web search, email etc. THING THING
  • 33. Q&A  Many thanks to members of the Semantic Search team at Yahoo Labs Barcelona and to Yahoos around the world  Contact me › pmika@yahoo-inc.com › @pmika › www.slideshare.net/pmika › Ask about our internships and other opportunities

Editor's Notes

  1. Non-trivial problems to solve without publisher assistance
  2. When the competition is copying you, you know that you are doing something right.
  3. Facebook invited, but continues to pursue OGP