To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
2. Trey Grainger
Chief Algorithms Officer
• Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research publications
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me
4. Agenda
• About Lucidworks
• What is AI-powered Search?
• What is Natural Language Search?
• What is a Knowledge Graph (and related terminology)?
• Philosophy of Language (enough to get the approach…)
• Semantic Knowledge Graphs
• Knowledge Graph Goals for Natural Language Search
• Relevant Tools
• Solr Text Tagger
• Solr’s Semantic Knowledge Graph
• Automated Graph Generation
• Demos!
5. The Search & AI Conference
COMPANY BEHIND
Who are we?
300+ CUSTOMERS ACROSS THE
FORTUNE 1000
400+EMPLOYEES
OFFICES IN
San Francisco, CA (HQ)
Raleigh-Durham, NC
Cambridge, UK
Bangalore, India
Hong Kong
Employ about
40% of the active
committers on
the Solr project
40%
Contribute over
70% of Solr's open
source codebase
70%
DEVELOP & SUPPORT
Apache
15. Proudly built with open-source
tech at its core: Apache Solr &
Apache Spark
Personalizes search
with applied
machine learning
Proven on the
world’s biggest
information systems
22. What is a Knowledge Graph?
(vs. Ontology vs. Taxonomy vs. Synonyms, etc.)
23.
24. Overly Simplistic Definitions
Alternative Labels: Substitute words with identical meanings
[ CTO => Chief Technology Officer; specialise => specialize ]
Synonyms List: Provides substitute words that can be used to represent
the same or very similar things
[ human => homo sapien, mankind; food => sustenance, meal ]
Taxonomy: Classifies things into Categories
[ john is Human; Human is Mammal; Mammal is Animal ]
Ontology: Defines relationships between types of things
[ animal eats food; human is animal ]
Knowledge Graph: Instantiation of an
Ontology (contains the things that are related)
[ john is human; john eats food ]
In practice, there is significant overlap…
25.
26. What sort of Knowledge Graph can
help us with the kinds of problems we
encounter in Search use cases?
29. But… unstructured data is really
more like “hyper-structured”
data. It is a graph that contains
much more structure than typical
“structured data.”
30. Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
31. Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the Activate 2019 conference.
#Activate19 (Activate) is being held in
Washington, DC September 9-12, 2019. Trey got
his masters from Georgia Tech.
32. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data
33. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Foreign Key?
34. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)
35. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
36. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Not so fast!
37.
38.
39. Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
40. How do we easily harness this
“semantic graph” of relationships
within unstructured information?
42. id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The Semantic
Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any
relationship within a domain”.
DSAA 2016.
Knowledge
Graph
field term postings
list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
43. Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of
relationships between items in a domain.
Semantic Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage the core relationship scoring
engine to score any list of entities against any other
list
Full domain support
Keywords, categories, tags, based upon any field on your
documents. Graph is build automatically from the
content representing your domain.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else
have the system dynamically discover the most related
items (or both).
Knowledge
Graph
44. Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The
Semantic Knowledge
Graph: A compact, auto-
generated model for real-
time traversal and ranking of
any relationship within a
domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill:
Java
skill:
Scala
skill:
Hibernate
skill:
Oncology
has_related_skill
has_related_skill
has_related_skill
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill:
Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
45. DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
46. Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
47. Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg
52. Use Case: Query Disambiguation
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
53. Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
59. What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
"embrace"
60. What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
61. So what’s the end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
63. I talked about integrating these last year:
See my Activate 2018 talk on
“How to Build a Semantic Search System”
For details on extended Lucidworks Fusion capabilities.
64. For Today’s talk, I’ll only focus on these two
• Solr Text Tagger
• Semantic Knowledge Graph
65.
66.
67. Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: distributed queries only traverse intra-shard docs
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
69. Demo Data
Places (also includes geonames database)
Entities (includes search commands)
Text Content
[ Web crawl of restaurant and product reviews sites ]
75. Named Entity Recognition (NER)
NER translates…
Barack Obama was the president of the United States of America. Before that, Obama was a
senator.
into…
<person id="barack_obama">Barack Obama</person> was the <role>president</role> of the
<country id="usa">United States of America</country>. Before that, <person
id="barack_obama">Obama</person> was a <role>senator</role>.
In Solr, this would become:
text: Barack Obama was the president of the United States of America. Before that, Obama was a
senator.
person: Barack Obama
country: United States of America
role: [ president, senator ]
79. popular barbeque near Activate
(popular same as "good", "top", "best")
Hotels near Activate
hotels near popular BBQ in Washington DC
BBQ near airports near Activate
hotels near movie theaters in Washington DC …
And that’s really just the beginning!