SlideShare a Scribd company logo
1 of 69
Download to read offline
Leaving No Valuable Data Behind:
the Crazy Ideas and the Business
Xin Luna Dong, Amazon ML
MLConf @ Seattle, 5/2017
Acknowledgement
● Google
Fan Bu
Van Dang
Evgeniy Gabrilovich
Geremy Heitz
Wilko Horn
Kevin Lerman
Camillo Lugaresi
Kevin Murphy
Shaohua Sun
Ali Tamur
Wei Zhang
Sreeram Balakrishnan
Shawn Jeffrey
Anno Langen
Yang Li
Rod McChesney
Crystal Sno Riley
Mike Shwe
Anna Wolferman
● DB Comm
Laure Berti-Equille
Xiao Cheng
Anish Das Sarma
Alon Halevy
Furong Li
Xian Li
Kenneth Lyons
Alexandra Meliou
Weiyi Meng
Ravali Pochampally
Barna Saha
Divesh Srivastava
Xiaolan Wang
Knowledge Is Power
● Many Knowledge Bases (KB)
Using KB in Search
Using KB in Social Media
Using KB in Recommendation
What is a Knowledge Base
● Entities, entity types
○ An entity is an instance of multiple types
○ Entity types organized in a hierarchy
What is a Knowledge Base
● Entities, entity types
● Predicates, (sub, pred, obj) triples
○ A triple describes an attribute of an entity,
or, the relationship between two entities
What is a Knowledge Base
● Entities, entity types
● Predicates, (sub, pred, obj) triples
● Knowledge base: graph with entity nodes
and predicate-labeled edges
Advantages over Traditional DBs
● Easy to model complex relationships in
the real world
● Easy to extend schema
● Easy to specify rules and make inference
Existing Knowledge Bases [DGH+14]
Name
# of Entity
Types
# Predicates # Entities
# Confident
Triples
Knowledge
Vault (KV)
1100 4469 45M 271M
DeepDive 4 34 2.7M 7M
NELL 271 306 5.1M 0.435M
PROSPERA 11 14 N/A 0.1M
Yago2 350,000 100 9.8M 150M
Freebase 1500 35,000 40M 637M
Knowledge
Graph
1500 35,000 570M 18,000M70B
Freebase Statistics
● 2.3B triples on 130M entities
● Break-down
#Triples
Total 2.3B
Name/Alias 1.3B
Type 341M
Webpages 88M
Description 31M
Facts 482M (20%)
● Rich knowledge for head entities in head
verticals. E.g., (Freebase)
Category I. Head Entities in Head
Verticals
Vertical
Percentage
>= 5 facts
Example entity #Facts
country 80% USA 151K
person 43% Barack Obama 1.5K
business 21% Google Inc. 1K
film 73% Frozen 200
album 66% American Idiot 21
Category II. Tail Entities in Head Verticals
● #Facts/Entity in Freebase
○ 40% entities with no fact
○ 56% entities with <3 facts
● Example 1
Category II. Tail Entities in Head Verticals
Freebase
● Example 2
Category II. Tail Entities in Head Verticals
Freebase www.bookdepository.com
● 100 sample tail verticals (Freebase)
○ Example verticals: philosopher, profession,
yoga_poses, pokemon_characters
○ Entities collected from 1-3 authoritative
sources for the vertical
○ In total 17K entities; 6.5K (40%) entities not
in Freebase
○ No vertical-related attributes (~1K in total)
Category III. Head Entities in Tail
Verticals
● Example: Aquamarine (March gemstone)
○ No entity in Freebase
○ No gemstone-related attributes
Category III. Head Entities in Tail
Verticals
Wikipedia
http://www.gemselect.com/
Gap Between KBs and World Knowledge
A1
A2
A3
A4
A5
A6
... ... An UNKNOWN ATTRIBUTES
E1
E2
EXISTING
KNOWLEDGEE3
E4
E5
E6
...
UNKNOWN VALUES
Em
UNKNOWN
ENTITIES
Head knowledge mainly collected
by manual curation or importing
large data sets
How to collect long-tail knowledge
in a scalable way?
Mission
Leaving NO Valuable Data Behind
Mission of a data-integration researcher:
Science is to test crazy ideas;
Engineering is to bring these
ideas into Business
–Andreas Holzinger
The Crazy Ideas
Science is to test crazy ideas
Crazy Idea I. Knowledge Vault
● Conventional approach
Manually curate knowledge from a few major data
sources; e.g., Google Knowledge Graph
● Knowledge Vault
Automatically extract knowledge from the Web
Knowledge Vault: Automatically
extracting knowledge from Web
#Triples
3.2B
(0.3B w. pr>=0.7)
#URLs
2.5B
(28M Websites)
#Extractors 16
[SIGKDD, 2014]
[VLDB, 2014]
Four Types of Web Sources
Knowledge Extraction
● Texts/DOM: distant supervision
● Web tables/lists: schema mapping
● Annotations: semi-automatic mapping
Pattern 1: X “born” Y
→ (X, /people/person
/date_of_birth, Y)
Statistics for Data Sources
As of 11/2013
Knowledge Quality
● Gold standard: Freebase under LCWA
(Local Closed-World Assumption)
○ If (s,p,o) exists in FB: true
○ Otherwise,
■ If (s,p) exists in FB: false (Freebase
knowledge is locally complete)
■ Otherwise: UNKNOWN
Knowledge Quality
● Gold standard: Freebase under LCWA
(Local Closed-World Assumption)
● Well-calibrated probabilities
Crazy Idea II. Knowledge-Based Trust
● Conventional approaches
Evaluate trustworthiness of sources by exogenous
signals: hyperlinks, click-rate, etc.; e.g., PageRank
● Knowledge-based trust
Evaluate by endogenous signals: the correctness of
its factual information
Crazy Idea II: Knowledge-Based Trust:
Evaluating Trustworthiness of Factual Info
Fact 1
Fact 2
Fact 3
Fact 4
Fact 5
Fact 6
Fact 7
Fact 8
Fact 9
Fact 10
...
Accu 0.7
✓
✓
✘
✓
✘
✓
✓
✓
✓
✘
...
Crazy Idea II: Knowledge-Based Trust:
Evaluating Trustworthiness of Factual Info
1.0
1.0
1.0
1.0
0.9
0.9
0.8
0.2
0.1
0.1
...
Fact 1
Fact 2
Fact 3
Fact 4
Fact 5
Fact 6
Fact 7
Fact 8
Fact 9
Fact 10
...
Accu 0.7
✓
✓
✘
✓
✘
✓
✓
✓
✓
✘
...
Triple 1
Triple 2
Triple 3
Triple 4
Triple 5
Triple 6
Triple 7
Triple 8
Triple 9
Triple 10
...
1.0
0.9
0.3
0.8
0.4
0.8
0.9
1.0
0.7
0.2
...
Triple
Corr
Extraction
Corr
Accu 0.73
Knowledge-Based Trust (KBT)
Trustworthiness in [0,1] for 5.6M websites and
119M webpages
[VLDB, 2015]
Knowledge-Based Trust vs. PageRank
Often unpopular
sources w. high
trustworthiness
Correlated
scoresOften sources
w. low accuracy
KBT for Gossip Websites
http://www.ebizmba.com/articles/g
ossip-websites
Gossip Websites
Domain
www.eonline.com
perezhilton.com
radaronline.com
www.zimbio.com
mediatakeout.com
gawker.com
www.popsugar.com
www.people.com
www.tmz.com
www.fishwrapper.com
celebrity.yahoo.com
wonderwall.msn.com
hollywoodlife.com
www.wetpaint.com
14 out of 15 have a
PageRank among top
15% of the websites
All have
knowledge-based trust
in bottom 50%
KBT for Social-Media Webpages
Knowledge Vault in Media
Knowledge-Based Trust in Media
Limitations of the Crazy Ideas
A1
A2
A3
A4
A5
A6
... ... An UNKNOWN ATTRIBUTES
E1
E2
EXISTING
KNOWLEDGEE3
E4
E5
E6
...
UNKNOWN VALUES
Em
UNKNOWN
ENTITIES
Limitations of the Crazy Ideas
A1
A2
A3
A4
A5
A6
... ... An UNKNOWN ATTRIBUTES
E1
E2
EXISTING
KNOWLEDGEE3
E4
E5
E6
...
Em
UNKNOWN
ENTITIES
● Focusing on existing entities and attributes
○ Training data contain only FB predicates
○ Entities need to be annotated as FB entities
● KV: Among the 0.3B high-confidence triples
○ 0.18B triples not in KG
○ KG contains 18B triples (100X) [KDD’14]
● KBT: Compute reliable KBT for <20% websites
and <<5% webpages
The Business
Engineering is to bring these
ideas into Business
● Input: Knowledge triples and their provenances (i.e.,
which extractor extracts from which source)
● Output: a probability in [0,1] for each triple
The Model Behind the Crazy Ideas:
Knowledge Fusion
(S, P)
Model I. Single-Truth Model
Prov1 Prov2 Prov3
Jagadish UM ATT UM
Dewitt MS MS UW
Bernstein MSR MSR MSR
Carey UCI ATT BEA
Franklin UCB UCB UMD
Researcher affiliation
[VLDB, 2009]
Prov1 Prov2 Prov3
Jagadish UM ATT UM
Dewitt MS MS UW
Bernstein MSR MSR MSR
Carey UCI ATT BEA
Franklin UCB UCB UMD
Voting--Trust the majority.
Researcher affiliation
Model I. Single-Truth Model [VLDB, 2009]
Researcher affiliation
Prov1 Prov2 Prov3
Jagadish UM ATT UM
Dewitt MS MS UW
Bernstein MSR MSR MSR
Carey UCI ATT BEA
Franklin UCB UCB UMD
Model I. Single-Truth Model [VLDB, 2009]
Researcher affiliation
Prov1 Prov2 Prov3
Jagadish UM ATT UM
Dewitt MS MS UW
Bernstein MSR MSR MSR
Carey UCI ATT BEA
Franklin UCB UCB UMD
Quality-based--Give higher votes to more
accurate sources.
Model I. Single-Truth Model [VLDB, 2009]
Model II. Multi-Truth Model
Harry Potter actors/actresses
Harry
Potter
Prov1 Prov2 Prov3
Daniel ✓ ✓ ✓
Emma ✓ ✓
Rupert ✓ ✓
Jonny ✓
Eric ✓
[Sigmod, 2014]
Voting--Trust the majority.
Harry Potter actors/actresses
Harry
Potter
Prov1 Prov2 Prov3
Daniel ✓ ✓ ✓
Emma ✓ ✓
Rupert ✓ ✓
Jonny ✓
Eric ✓
Model II. Multi-Truth Model [Sigmod, 2014]
Harry
Potter
Prov1
(high rec)
Prov2
(high prec)
Prov3
(med prec/rec)
Daniel ✓ ✓ ✓
Emma ✓ ✓
Rupert ✓ ✓
Jonny ✓
Eric ✓
Harry Potter actors/actresses
Model II. Multi-Truth Model [Sigmod, 2014]
Quality-based--More likely to be correct if provided
by high-precision provenances; more likely to be
wrong if not provided by high-recall provenances
Harry Potter actors/actresses
Harry
Potter
Prov1
(high rec)
Prov2
(high prec)
Prov3
(med prec/rec)
Daniel ✓ ✓ ✓
Emma ✓ ✓
Rupert ✓ ✓
Jonny ✓
Eric ✓
Model II. Multi-Truth Model [Sigmod, 2014]
Multi-Layer Graphical Model
Observations
● Xewdv
: whether extractor e
extracts from source w the (d,v)
item-value pair
Latent variables
● Cwdv
: whether source w indeed
provides (d,v) pair
● Vd
: the correct value(s) for d
Parameters
● Aw
: Trust of source w
● Pe
: Precision of extractor e
● Re
: Recall of extractor eExtractor
Web
source
Data item
Library of Fusion Models
● Application 1: Google Now email extraction
○ Single-truth model
○ Prec = 0.999, Rec = 0.993
○ Remove 84% errors by rule-based fusion
● Application 2: Entity type identification
○ Multi-truth model
○ Prec = 0.91, Rec = 0.98
● Goal: Collecting knowledge for tail
verticals (e.g., yoga pose, hindu deity)
Lightweight Vertical Project
● Goal: Collecting knowledge for tail
verticals (e.g., yoga pose, hindu deity)
● Method
○ Step 1. Decide interesting tail verticals and up to
3 sources for each vertical
○ Step 2. Have the crowd collect triples from the
given sources through annotation tools
Lightweight Vertical Project
● Goal: Collecting knowledge for tail
verticals (e.g., yoga pose, hindu deity)
● Method
○ Step 1. Decide interesting tail verticals and up to
3 sources for each vertical
○ Step 2. Have the crowd collect triples from the
given sources through annotation tools
○ Step 3. Heavy curation to reach 99.9% precision
Lightweight Vertical Project
● Challenge 1. Find interesting verticals and
relevant high-quality sources
● Challenge 2: Detect errors from curation
and from sources
Challenges in Lightweight Verticals
● Challenge 1. Find interesting verticals and
relevant high-quality sources
● Challenge 2: Detect errors from curation
and from sources
○ Solution: Triangulate from 3 web sources
■ Evidence quality: Prec = 0.2, Rec = 0.65
■ Fusion quality: Prec = 0.85, Rec = 0.5
Challenges in Lightweight Verticals
● Knowledge in 100+ tail verticals
○ 2.2M triples
○ 10K entities, ~700 predicates
○ millions of daily registered users
● Most vs. least popular tail vertical
Knowledge Collected on Tail Verticals
Contributions of Lightweight Verticals
A1
A2
A3
A4
A5
A6
... ... An UNKNOWN ATTRIBUTES
E1
E2
EXISTING
KNOWLEDGEE3
E4
E5
E6
...
UNKNOWN VALUES
Em
UNKNOWN
ENTITIES
Contributions of Lightweight Verticals
A1
A2
A3
A4
A5
A6
... ... An
UNKNOWN
ATTRIBUTES
E1
E2
EXISTING
KNOWLEDGEE3
E4
E5
E6
...
UNKNOWN VALUES
Em
5K NEW ENTITIES AND ~700 ATTRIBUTES
UNKNOWN
ENTITIES
The New Commission
Building a Product Graph at Amazon
● Goal: Build the authoritative knowledge
base for every product in the world
Generic KG
PG
Building a Product Graph at Amazon
● Goal: Build the authoritative knowledge
base for every product in the world
Generic KG
PG
(movie,
music,
book)
Building a Product Graph at Amazon
● Goal: Build the authoritative knowledge
base for every product in the world
Generic KG
Product
Graph
Movie,
music,
book,
etc.
Challenges in Building Product Graph I
● No major sources to curate product data
from
○ Wikipedia does not help too much
○ A lot of structured data buried in text
descriptions in Catalog
○ Retailers gaming with the system so even
more noisy data
Challenges in Building Product Graph II
● Large number of new products everyday
○ Curation is impossible
○ Freshness is a big challenge
Challenges in Building Product Graph II
● Large number of product categories
○ A lot of work to manually define ontology
○ Hard to catch the trend of new product
categories and properties
● Human-in-the-loop knowledge learning
○ Crowd-sourcing evaluation
○ Active learning
○ Close IE vs. Open IE
● Hands-off-the-wheel data integration
○ Machine learning for data quality
○ Building generic systems for data cleaning
and data integration
Our Solution
We are HIRING!!
lunadong@amazon.com
Questions?
THANK YOU!

More Related Content

Viewers also liked

Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
MLconf
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
MLconf
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
MLconf
 

Viewers also liked (20)

Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba
 
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
 

Similar to Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017

The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
Rebecca Bilbro
 

Similar to Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017 (20)

Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalog
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalog
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
 
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
How to Use Data Effectively by Abra Sr. Business Analyst
How to Use Data Effectively by Abra Sr. Business AnalystHow to Use Data Effectively by Abra Sr. Business Analyst
How to Use Data Effectively by Abra Sr. Business Analyst
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Building Entities & Connections in 2020
Building Entities & Connections in 2020 Building Entities & Connections in 2020
Building Entities & Connections in 2020
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Entity-Relationship Queries over Wikipedia
Entity-Relationship Queries over WikipediaEntity-Relationship Queries over Wikipedia
Entity-Relationship Queries over Wikipedia
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
The Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From DataThe Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From Data
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017

  • 1. Leaving No Valuable Data Behind: the Crazy Ideas and the Business Xin Luna Dong, Amazon ML MLConf @ Seattle, 5/2017
  • 2. Acknowledgement ● Google Fan Bu Van Dang Evgeniy Gabrilovich Geremy Heitz Wilko Horn Kevin Lerman Camillo Lugaresi Kevin Murphy Shaohua Sun Ali Tamur Wei Zhang Sreeram Balakrishnan Shawn Jeffrey Anno Langen Yang Li Rod McChesney Crystal Sno Riley Mike Shwe Anna Wolferman ● DB Comm Laure Berti-Equille Xiao Cheng Anish Das Sarma Alon Halevy Furong Li Xian Li Kenneth Lyons Alexandra Meliou Weiyi Meng Ravali Pochampally Barna Saha Divesh Srivastava Xiaolan Wang
  • 3. Knowledge Is Power ● Many Knowledge Bases (KB)
  • 4. Using KB in Search
  • 5. Using KB in Social Media
  • 6. Using KB in Recommendation
  • 7. What is a Knowledge Base ● Entities, entity types ○ An entity is an instance of multiple types ○ Entity types organized in a hierarchy
  • 8. What is a Knowledge Base ● Entities, entity types ● Predicates, (sub, pred, obj) triples ○ A triple describes an attribute of an entity, or, the relationship between two entities
  • 9. What is a Knowledge Base ● Entities, entity types ● Predicates, (sub, pred, obj) triples ● Knowledge base: graph with entity nodes and predicate-labeled edges
  • 10. Advantages over Traditional DBs ● Easy to model complex relationships in the real world ● Easy to extend schema ● Easy to specify rules and make inference
  • 11. Existing Knowledge Bases [DGH+14] Name # of Entity Types # Predicates # Entities # Confident Triples Knowledge Vault (KV) 1100 4469 45M 271M DeepDive 4 34 2.7M 7M NELL 271 306 5.1M 0.435M PROSPERA 11 14 N/A 0.1M Yago2 350,000 100 9.8M 150M Freebase 1500 35,000 40M 637M Knowledge Graph 1500 35,000 570M 18,000M70B
  • 12. Freebase Statistics ● 2.3B triples on 130M entities ● Break-down #Triples Total 2.3B Name/Alias 1.3B Type 341M Webpages 88M Description 31M Facts 482M (20%)
  • 13. ● Rich knowledge for head entities in head verticals. E.g., (Freebase) Category I. Head Entities in Head Verticals Vertical Percentage >= 5 facts Example entity #Facts country 80% USA 151K person 43% Barack Obama 1.5K business 21% Google Inc. 1K film 73% Frozen 200 album 66% American Idiot 21
  • 14. Category II. Tail Entities in Head Verticals ● #Facts/Entity in Freebase ○ 40% entities with no fact ○ 56% entities with <3 facts
  • 15. ● Example 1 Category II. Tail Entities in Head Verticals Freebase
  • 16. ● Example 2 Category II. Tail Entities in Head Verticals Freebase www.bookdepository.com
  • 17. ● 100 sample tail verticals (Freebase) ○ Example verticals: philosopher, profession, yoga_poses, pokemon_characters ○ Entities collected from 1-3 authoritative sources for the vertical ○ In total 17K entities; 6.5K (40%) entities not in Freebase ○ No vertical-related attributes (~1K in total) Category III. Head Entities in Tail Verticals
  • 18. ● Example: Aquamarine (March gemstone) ○ No entity in Freebase ○ No gemstone-related attributes Category III. Head Entities in Tail Verticals Wikipedia http://www.gemselect.com/
  • 19. Gap Between KBs and World Knowledge A1 A2 A3 A4 A5 A6 ... ... An UNKNOWN ATTRIBUTES E1 E2 EXISTING KNOWLEDGEE3 E4 E5 E6 ... UNKNOWN VALUES Em UNKNOWN ENTITIES Head knowledge mainly collected by manual curation or importing large data sets How to collect long-tail knowledge in a scalable way?
  • 20. Mission Leaving NO Valuable Data Behind Mission of a data-integration researcher:
  • 21. Science is to test crazy ideas; Engineering is to bring these ideas into Business –Andreas Holzinger
  • 22. The Crazy Ideas Science is to test crazy ideas
  • 23. Crazy Idea I. Knowledge Vault ● Conventional approach Manually curate knowledge from a few major data sources; e.g., Google Knowledge Graph ● Knowledge Vault Automatically extract knowledge from the Web
  • 24. Knowledge Vault: Automatically extracting knowledge from Web #Triples 3.2B (0.3B w. pr>=0.7) #URLs 2.5B (28M Websites) #Extractors 16 [SIGKDD, 2014] [VLDB, 2014]
  • 25. Four Types of Web Sources
  • 26. Knowledge Extraction ● Texts/DOM: distant supervision ● Web tables/lists: schema mapping ● Annotations: semi-automatic mapping Pattern 1: X “born” Y → (X, /people/person /date_of_birth, Y)
  • 27. Statistics for Data Sources As of 11/2013
  • 28. Knowledge Quality ● Gold standard: Freebase under LCWA (Local Closed-World Assumption) ○ If (s,p,o) exists in FB: true ○ Otherwise, ■ If (s,p) exists in FB: false (Freebase knowledge is locally complete) ■ Otherwise: UNKNOWN
  • 29. Knowledge Quality ● Gold standard: Freebase under LCWA (Local Closed-World Assumption) ● Well-calibrated probabilities
  • 30. Crazy Idea II. Knowledge-Based Trust ● Conventional approaches Evaluate trustworthiness of sources by exogenous signals: hyperlinks, click-rate, etc.; e.g., PageRank ● Knowledge-based trust Evaluate by endogenous signals: the correctness of its factual information
  • 31. Crazy Idea II: Knowledge-Based Trust: Evaluating Trustworthiness of Factual Info Fact 1 Fact 2 Fact 3 Fact 4 Fact 5 Fact 6 Fact 7 Fact 8 Fact 9 Fact 10 ... Accu 0.7 ✓ ✓ ✘ ✓ ✘ ✓ ✓ ✓ ✓ ✘ ...
  • 32. Crazy Idea II: Knowledge-Based Trust: Evaluating Trustworthiness of Factual Info 1.0 1.0 1.0 1.0 0.9 0.9 0.8 0.2 0.1 0.1 ... Fact 1 Fact 2 Fact 3 Fact 4 Fact 5 Fact 6 Fact 7 Fact 8 Fact 9 Fact 10 ... Accu 0.7 ✓ ✓ ✘ ✓ ✘ ✓ ✓ ✓ ✓ ✘ ... Triple 1 Triple 2 Triple 3 Triple 4 Triple 5 Triple 6 Triple 7 Triple 8 Triple 9 Triple 10 ... 1.0 0.9 0.3 0.8 0.4 0.8 0.9 1.0 0.7 0.2 ... Triple Corr Extraction Corr Accu 0.73
  • 33. Knowledge-Based Trust (KBT) Trustworthiness in [0,1] for 5.6M websites and 119M webpages [VLDB, 2015]
  • 34. Knowledge-Based Trust vs. PageRank Often unpopular sources w. high trustworthiness Correlated scoresOften sources w. low accuracy
  • 35. KBT for Gossip Websites http://www.ebizmba.com/articles/g ossip-websites Gossip Websites Domain www.eonline.com perezhilton.com radaronline.com www.zimbio.com mediatakeout.com gawker.com www.popsugar.com www.people.com www.tmz.com www.fishwrapper.com celebrity.yahoo.com wonderwall.msn.com hollywoodlife.com www.wetpaint.com 14 out of 15 have a PageRank among top 15% of the websites All have knowledge-based trust in bottom 50%
  • 39. Limitations of the Crazy Ideas A1 A2 A3 A4 A5 A6 ... ... An UNKNOWN ATTRIBUTES E1 E2 EXISTING KNOWLEDGEE3 E4 E5 E6 ... UNKNOWN VALUES Em UNKNOWN ENTITIES
  • 40. Limitations of the Crazy Ideas A1 A2 A3 A4 A5 A6 ... ... An UNKNOWN ATTRIBUTES E1 E2 EXISTING KNOWLEDGEE3 E4 E5 E6 ... Em UNKNOWN ENTITIES ● Focusing on existing entities and attributes ○ Training data contain only FB predicates ○ Entities need to be annotated as FB entities ● KV: Among the 0.3B high-confidence triples ○ 0.18B triples not in KG ○ KG contains 18B triples (100X) [KDD’14] ● KBT: Compute reliable KBT for <20% websites and <<5% webpages
  • 41. The Business Engineering is to bring these ideas into Business
  • 42. ● Input: Knowledge triples and their provenances (i.e., which extractor extracts from which source) ● Output: a probability in [0,1] for each triple The Model Behind the Crazy Ideas: Knowledge Fusion (S, P)
  • 43. Model I. Single-Truth Model Prov1 Prov2 Prov3 Jagadish UM ATT UM Dewitt MS MS UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Researcher affiliation [VLDB, 2009]
  • 44. Prov1 Prov2 Prov3 Jagadish UM ATT UM Dewitt MS MS UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Voting--Trust the majority. Researcher affiliation Model I. Single-Truth Model [VLDB, 2009]
  • 45. Researcher affiliation Prov1 Prov2 Prov3 Jagadish UM ATT UM Dewitt MS MS UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Model I. Single-Truth Model [VLDB, 2009]
  • 46. Researcher affiliation Prov1 Prov2 Prov3 Jagadish UM ATT UM Dewitt MS MS UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Quality-based--Give higher votes to more accurate sources. Model I. Single-Truth Model [VLDB, 2009]
  • 47. Model II. Multi-Truth Model Harry Potter actors/actresses Harry Potter Prov1 Prov2 Prov3 Daniel ✓ ✓ ✓ Emma ✓ ✓ Rupert ✓ ✓ Jonny ✓ Eric ✓ [Sigmod, 2014]
  • 48. Voting--Trust the majority. Harry Potter actors/actresses Harry Potter Prov1 Prov2 Prov3 Daniel ✓ ✓ ✓ Emma ✓ ✓ Rupert ✓ ✓ Jonny ✓ Eric ✓ Model II. Multi-Truth Model [Sigmod, 2014]
  • 49. Harry Potter Prov1 (high rec) Prov2 (high prec) Prov3 (med prec/rec) Daniel ✓ ✓ ✓ Emma ✓ ✓ Rupert ✓ ✓ Jonny ✓ Eric ✓ Harry Potter actors/actresses Model II. Multi-Truth Model [Sigmod, 2014]
  • 50. Quality-based--More likely to be correct if provided by high-precision provenances; more likely to be wrong if not provided by high-recall provenances Harry Potter actors/actresses Harry Potter Prov1 (high rec) Prov2 (high prec) Prov3 (med prec/rec) Daniel ✓ ✓ ✓ Emma ✓ ✓ Rupert ✓ ✓ Jonny ✓ Eric ✓ Model II. Multi-Truth Model [Sigmod, 2014]
  • 51. Multi-Layer Graphical Model Observations ● Xewdv : whether extractor e extracts from source w the (d,v) item-value pair Latent variables ● Cwdv : whether source w indeed provides (d,v) pair ● Vd : the correct value(s) for d Parameters ● Aw : Trust of source w ● Pe : Precision of extractor e ● Re : Recall of extractor eExtractor Web source Data item
  • 52. Library of Fusion Models ● Application 1: Google Now email extraction ○ Single-truth model ○ Prec = 0.999, Rec = 0.993 ○ Remove 84% errors by rule-based fusion ● Application 2: Entity type identification ○ Multi-truth model ○ Prec = 0.91, Rec = 0.98
  • 53. ● Goal: Collecting knowledge for tail verticals (e.g., yoga pose, hindu deity) Lightweight Vertical Project
  • 54. ● Goal: Collecting knowledge for tail verticals (e.g., yoga pose, hindu deity) ● Method ○ Step 1. Decide interesting tail verticals and up to 3 sources for each vertical ○ Step 2. Have the crowd collect triples from the given sources through annotation tools Lightweight Vertical Project
  • 55. ● Goal: Collecting knowledge for tail verticals (e.g., yoga pose, hindu deity) ● Method ○ Step 1. Decide interesting tail verticals and up to 3 sources for each vertical ○ Step 2. Have the crowd collect triples from the given sources through annotation tools ○ Step 3. Heavy curation to reach 99.9% precision Lightweight Vertical Project
  • 56. ● Challenge 1. Find interesting verticals and relevant high-quality sources ● Challenge 2: Detect errors from curation and from sources Challenges in Lightweight Verticals
  • 57. ● Challenge 1. Find interesting verticals and relevant high-quality sources ● Challenge 2: Detect errors from curation and from sources ○ Solution: Triangulate from 3 web sources ■ Evidence quality: Prec = 0.2, Rec = 0.65 ■ Fusion quality: Prec = 0.85, Rec = 0.5 Challenges in Lightweight Verticals
  • 58. ● Knowledge in 100+ tail verticals ○ 2.2M triples ○ 10K entities, ~700 predicates ○ millions of daily registered users ● Most vs. least popular tail vertical Knowledge Collected on Tail Verticals
  • 59. Contributions of Lightweight Verticals A1 A2 A3 A4 A5 A6 ... ... An UNKNOWN ATTRIBUTES E1 E2 EXISTING KNOWLEDGEE3 E4 E5 E6 ... UNKNOWN VALUES Em UNKNOWN ENTITIES
  • 60. Contributions of Lightweight Verticals A1 A2 A3 A4 A5 A6 ... ... An UNKNOWN ATTRIBUTES E1 E2 EXISTING KNOWLEDGEE3 E4 E5 E6 ... UNKNOWN VALUES Em 5K NEW ENTITIES AND ~700 ATTRIBUTES UNKNOWN ENTITIES
  • 62. Building a Product Graph at Amazon ● Goal: Build the authoritative knowledge base for every product in the world Generic KG PG
  • 63. Building a Product Graph at Amazon ● Goal: Build the authoritative knowledge base for every product in the world Generic KG PG (movie, music, book)
  • 64. Building a Product Graph at Amazon ● Goal: Build the authoritative knowledge base for every product in the world Generic KG Product Graph Movie, music, book, etc.
  • 65. Challenges in Building Product Graph I ● No major sources to curate product data from ○ Wikipedia does not help too much ○ A lot of structured data buried in text descriptions in Catalog ○ Retailers gaming with the system so even more noisy data
  • 66. Challenges in Building Product Graph II ● Large number of new products everyday ○ Curation is impossible ○ Freshness is a big challenge
  • 67. Challenges in Building Product Graph II ● Large number of product categories ○ A lot of work to manually define ontology ○ Hard to catch the trend of new product categories and properties
  • 68. ● Human-in-the-loop knowledge learning ○ Crowd-sourcing evaluation ○ Active learning ○ Close IE vs. Open IE ● Hands-off-the-wheel data integration ○ Machine learning for data quality ○ Building generic systems for data cleaning and data integration Our Solution We are HIRING!! lunadong@amazon.com