Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
%in Harare+277-882-255-28 abortion pills for sale in Harare
How to Build a Semantic Search System
1. How to Build a Semantic
Search System
Trey Grainger
SVP of Engineering, Lucidworks
@treygrainger
#Activate18 #ActivateSearch
2. Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me
8. My Three Philosophical Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical
“structured data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
9. Assertion 1:
Unstructured data is actually
“hyper-structured” data. It is a
graph that contains much
more structure than typical
“structured data.”
10. Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
11. Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at Activate 2018. #ActivateSearch
(Activate) is being held in Montreal October 15-18,
2018. Trey got his masters from Georgia Tech.
12. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data
13. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Foreign Key?
14. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)
15. Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
16. Fuzzier Foreign Key? (metadata, latent features)
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Not so fast!
17.
18.
19. Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
20. Assertion 1 (Summary):
Unstructured data is actually
“hyper-structured” data. It is a
graph that contains much
more structure than typical
“structured data.”
21. Assertion 2:
That graph is very rich, but is a
compression of meaning into a
lossy format. Much of data science
is essentially the decompression
from this lossy format into a
reconstituted form.
23. How do we easily harness this
“semantic graph” of relationships
within unstructured information?
24. Search Engines are really good at
querying across character sequences,
term sequences, and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
26. id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The Semantic
Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any
relationship within a domain”.
DSAA 2016.
Knowledge
Graph
field term postings
list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
28. DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
29. Search engines also do relevancy ranking
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency
saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document
normalization.
30. Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
31. Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a
lossy format. Much of data science
is essentially the decompression
from this lossy format into a
reconstituted form.
34. Ambiguity
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
35. Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
36. Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
37. Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
41. What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
42. What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
43. My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical
“structured data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
45. So why all the philosophy?
Because it’s much more important to intuitively understand the
kinds of problem we’re trying to solve with Semantic Search than to
jump head-first into the Solution.
Because otherwise we may build the wrong thing, which can
sometimes be worse than not doing anything.
And once you have an intuitive sense of the problems you need to
solve, you can confidently use the tools I’m about to describe to
build the right solution for your specific domain.
46. So what’s the end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
48. In 2018, Lucidworks has added the
following capabilities to Solr:
• Solr Text Tagger
• Semantic Knowledge Graph
• Statistical Phrase Identifier
49. which all integrate seamlessly
with the following in Fusion:
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion Phrase Identification Job
• Fusion Query Rules Engine
51. … and enables domain experts to easily
accept or adjust the built in AI… …completely deferring to Fusion’s AI, or
trusting it above a certain
confidence level, or
even manually
approving every
suggestion.
58. Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously built,
cleaned, and refined based upon common inputs from
interactions with real users of the system. We use the Solr Text
Tagger (already covered) for this at query time.*
2) Also invoke a probabilistic query parser to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
3) Final algorithm to choose the best merge when the two
approaches disagree.
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
59. Probabilistic Query Parsing
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.