SlideShare a Scribd company logo
1 of 106
Download to read offline
Trey Grainger
Chief Algorithms Officer
Thought Vectors and Knowledge Graphs
in AI-powered Search
October 15, 2019
Trey Grainger
Chief Algorithms Officer
• Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research publications
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me
http://aiPoweredSearch.com
... is my new book!
• About Lucidworks
• What is AI-powered Search?
• What are Thought Vectors?
• Vector Search in Apache Solr
• What is a Knowledge Graph (and related terminology)?
• Philosophy of Language
• Semantic Knowledge Graphs
• Implementing Knowledge Graph-based Search
• Solr Text Tagger
• Solr’s Semantic Knowledge Graph
• Demos!
Agenda
The Search & AI Conference
COMPANY BEHIND
Who are we?
300+ CUSTOMERS ACROSS THE
FORTUNE 1000
400+EMPLOYEES
OFFICES IN
San Francisco, CA (HQ)
Raleigh-Durham, NC
Cambridge, UK
Bangalore, India
Hong Kong
Employ about
40% of the active
committers on
the Solr project
40%
Contribute over
70% of Solr's open
source codebase
70%
DEVELOP & SUPPORT
Apache
Industry’s most powerful
Intelligent Search & Discovery Platform.
Proudly built with open-source
tech at its core: Apache Solr &
Apache Spark
Personalizes search
with applied
machine learning
Proven on the
world’s biggest
information systems
What is
?
AI?
Machine Learning?
Data Science?
Neural Search?
AI-Powered Search?
Deep Learning?
AI-powered Search
AI-powered Search
Question / Answer
Systems
Virtual Assistants
• Signals Boosting Models
• Learning to Rank
• Semantic Search
• Collaborative Filtering
• Personalized Search
• Content Clustering
• NLP / Entity Resolution
• Semantic Knowledge Graphs
• Document Classification
• etc.
• Neural Search
• Word Embeddings
• Vector Search
• Image / Voice Search
• etc.
• Question / Answer Systems
• Virtual Assistants
• Chatbots
• Rules-based Relevancy
• etc.
Tonight, we’ll touch only on these…
Question / Answer
Systems
Virtual Assistants
• Signals Boosting Models
• Learning to Rank
• Semantic Search
• Collaborative Filtering
• Personalized Search
• Content Clustering
• NLP / Entity Resolution
• Semantic Knowledge Graphs
• Document Classification
• etc.
• Neural Search
• Word Embeddings
• Vector Search
• Image / Voice Search
• etc.
• Question / Answer Systems
• Virtual Assistants
• Chatbots
• Rules-based Relevancy
• etc.
AI-powered Search
Search Engine Basics
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped
over the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
An inverted index (“how a search engine works”)
/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
BM25 (Relevance Scoring between Query and Documents)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.
The standard for
enterprise
search.
of Fortune 500
uses Solr.
90%
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics (nested / relational)
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● Graph Queries and Traversals
● SQL Query Support
● Streaming Aggregations
● Batch and Streaming processing
● Highly Configurable / Plugins
● Learning to Rank
● Building machine-learning models
● … many more
*source: Solr in Action, chapter 2
What is a Thought Vector?
Sentence Embeddings:
[ 2, 3, 2, 4, 2, 1, 5, 3 ]
[ 5, 3, 2, 3, 4, 0, 3, 4 ]
. . .
Document Embedding:
[ 4, 1, 4, 2, 1, 2, 4, 3 ]
Word Embeddings:
[ 5, 1, 3, 4, 2, 1, 5, 3 ]
[ 4, 1, 3, 0, 1, 1, 4, 2 ]
. . .
Paragraph Embeddings:
[ 5, 1, 4, 1, 0, 2, 4, 0 ]
[ 1, 1, 4, 2, 1, 0, 0, 0 ]
. . .
Thought Vectors
apple caffeine cheese coffee drink donut food juice pizza tea water
cappuccino 0 0 0 0 0 0 0 0 0 0 0
apple 1 0 0 0 0 0 0 0 0 0 0
juice 0 0 0 0 0 0 0 1 0 0 0
cheese 0 0 1 0 0 0 0 0 0 0 0
pizza 0 0 0 0 0 0 0 0 1 0 0
donut 0 0 0 0 0 1 0 0 0 0 0
green 0 0 0 0 0 0 0 0 0 0 0
tea 0 0 0 0 0 0 0 0 0 1 0
bread 0 0 1 0 0 0 0 0 0 0 0
sticks 0 0 0 0 0 0 0 0 0 0 0
exact term lookup in inverted indexquery
Single Term Searches (as a Vector)
Combined Vector
query
Multi-term Query Vectors
juice 0 0 0 0 0 0 0 1 0 0 0
apple 1 0 0 0 0 0 0 0 0 0 0
+
apple juice 1 0 0 0 0 0 0 1 0 0 0
apple caffeine cheese coffee drink donut food juice pizza tea water
latte 0 0 0 0 0 0 0 0 0 0 0
cappuccino 0 0 0 0 0 0 0 0 0 0 0
apple juice 1 0 0 0 0 0 0 1 0 0 0
cheese pizza 0 0 1 0 0 0 0 0 1 0 0
donut 0 0 0 0 0 1 0 0 0 0 0
soda 0 0 0 0 0 0 0 0 0 0 0
green tea 0 0 0 0 0 0 0 0 0 1 0
water 0 0 0 0 0 0 0 0 0 0 1
cheese bread
sticks
0 0 1 0 0 0 0 0 0 0 0
cinnamon sticks 0 0 0 0 0 0 0 0 0 0 0
exact term lookup in inverted indexquery
Multi-term Searches
food drink dairy bread caffeine sweet calories healthy
apple juice 0 5 0 0 0 4 4 3
cappuccino 0 5 3 0 4 1 2 3
cheese bread
sticks
5 0 4 5 0 1 4 2
cheese pizza 5 0 4 4 0 1 5 2
cinnamon
bread sticks
5 0 1 5 0 3 4 2
donut 5 0 1 5 0 4 5 1
green tea 0 5 0 0 2 1 1 5
latte 0 5 4 0 4 1 3 3
soda 0 5 0 0 3 5 5 0
water 0 5 0 0 0 0 0 5
Dimensionality Reduction
Phrase: Vector:
apple juice: [ 0, 5, 0, 0, 0, 4, 4, 3 ]
cappuccino: [ 0, 5, 3, 0, 4, 1, 2, 3 ]
cheese bread sticks: [ 5, 0, 4, 5, 0, 1, 4, 2 ]
cheese pizza: [ 5, 0, 4, 4, 0, 1, 5, 2 ]
cinnamon bread sticks: [ 5, 0, 4, 5, 0, 1, 4, 2 ]
donut: [ 5, 0, 1, 5, 0, 4, 5, 1 ]
green tea: [ 0, 5, 0, 0, 2, 1, 1, 5 ]
latte: [ 0, 5, 4, 0, 4, 1, 3, 3 ]
soda: [ 0, 5, 0, 0, 3, 5, 5, 0 ]
water: [ 0, 5, 0, 0, 0, 0, 0, 5 ]
Ranked Results: Green Tea
0.94 water
0.85 cappuccino
0.80 latte
0.78 apple juice
0.60 soda
… …
0.19 donut
Vector Similarity Scores:
Vector Similarity (a, b):
cos(θ) = a · b
|a| × |b|
Ranked Results: Cheese Pizza
0.99 cheese bread sticks
0.91 cinnamon bread sticks
0.89 donut
0.47 latte
0.46 apple juice
… …
0.19 water
Vector Similarity Scoring
Vector Similarity Scores:
Performance Considerations
Problem: Vector Scoring is Slow
• Unlike keyword search, which looks up pre-indexed answers to queries, Vector Search must instead calculate
similarities between the query vector and every document’s vectors to determine best matches, which is
slow at scale.
Solution: Quantized Vectors
• “Quantization” is the process for mapping vectors features to discrete values.
• Creating “tokens” which map to a similar vector space, enables matching on those tokens to perform an ANN
(Approximate Nearest Neighbor) search
• This enables converting vector scoring into a search problem (term lookup and scoring), which is fast again,
at the expense of some recall and scoring accuracy
Recommended Approach: Quantized Vector Search + Vector Similarity Reranking
• Combine the best of both worlds by running an initial ANN search on a quantized vector representation, and
then re-rank the top-N results using full Vector similarity scoring.
ANN Benchmarks
(Approximate Nearest Neighbor)
Implementation Options
Option 1: Streaming Expressions
curl -X POST -H "Content-Type: application/json" 
http://localhost:8983/solr/food/update?commit=true 
--data-binary ' [
{"id": "1", "name_s":"donut", "vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple juice",
"vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
{"id": "3", "name_s":"cappuccino",
"vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]},
{"id": "4", "name_s":"cheese pizza",
"vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]},
{"id": "5", "name_s":"green tea",
"vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]},
{"id": "6", "name_s":"latte", "vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]},
{"id": "7", "name_s":"soda", "vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]},
{"id": "8", "name_s":"cheese bread sticks",
"vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]},
{"id": "9", "name_s":"water", "vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]},
{"id": "10", "name_s":"cinnamon bread sticks",
"vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]}
] '
Send Documents to Solr:
Streaming Expressions
Option 2:
Streaming Expressions Query Parser
http://localhost:8983/solr/food/select?q=*:*&fl=id,name_s&
fq={!streaming_expression}top(
select(
search(food, q="*:*", fl="id,vector_fs", sort="id asc"),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as cos, id),
n=5, sort="cos desc”
)
{ "responseHeader":{
… },
"response":{"numFound":5,"start":0,"docs":[
{ "name_s":"donut", "id":"1"},
{ "name_s":"apple juice", "id":"2"},
{ "name_s":"cheese pizza", "id":"4"},
{ "name_s":"cheese bread sticks", "id":"8"},
{ "name_s":"cinnamon bread sticks", "id":"10"}]
}}
Request:
Response:
Streaming Expressions Query Parser
Option 3:
Solr Vector Scoring Plugin
Send Documents to Solr:
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/{your-collection-name}/update?commit=true --
data-binary ‘
[
{"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33"},
{"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25"},
{"name":"example 2", "vector":"0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01"},
{"name":"example 3", "vector":"0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9"},
{"name":"example 4", "vector":"0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1"},
{"name":"example 5", "vector":"0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09"}
]'
Solr Vector Scoring Plugin
Request:
Response:
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="0.1,4.75,0.3,1.2,0.7,4.0”
}
{ "responseHeader":{ "status":0, "QTime":1}},
"response":{ "numFound":6,"start":0,"maxScore":0.99984086,
"docs":[
{ "name":["example 3"], "vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":0.99984086},
{ "name":["example 0"], "vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "], "score":0.7693964},
{ "name":["example 5"], "vector":["0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "], "score":0.76322395},
{ "name":["example 4"], "vector":["0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "], "score":0.5328145},
{ "name":["example 1"], "vector":["0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "], "score":0.48513117},
{ "name":["example 2"], "vector":["0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "], "score":0.44909418}]
}}
Solr Vector Scoring Plugin
Option 4:
Solr Vector Scoring + LSH Plugin
Send Documents to Solr:
Solr Vector Scoring + LSH Plugin
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-
name}/update?update.chain=LSH&commit=true --data-binary ‘
[
{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"},
{"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}
]'
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true"
reRankDocs="5"}&fl=name,score,vector,_vector_,_lsh_hash_
Request:
Response:
Solr Vector Scoring + LSH Plugin
{
"responseHeader":{ "status":0, "QTime":8, "response":{"numFound":1,"start":0,"maxScore":36.65736,
"docs":[
{ "id": "1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8", "1_35", "2_7", "3_10", "4_2", "5_35", "6_16", "7_30", "8_27", "9_12", "10_7",
"11_32", "12_48", "13_36", "14_10", "15_7", "16_42", "17_5", "18_3", "19_2", "20_1",
"21_0", "22_24", "23_18", "24_42", "25_31", "26_35", "27_8", "28_1", "29_24", "30_47",
"31_14", "32_22", "33_39", "34_0", "35_34", "36_34", "37_39", "38_27", "39_27",
"40_45", "41_10", "42_21", "43_34", "44_41", "45_9", "46_31", "47_0", "48_4", "49_43"],
"score":36.65736}
] } }
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true"
reRankDocs="5"}&fl=name,score,vector,_vector_,_lsh_hash_
Request:
Option 5 (Work in Progress):
First-class Vector Fields in Lucene/Solr
Now In Progress
Vector Encoders
• Take queries, documents, sentences, paragraphs, etc. and
transform them into vectors.
• Usually leverage deep learning, which can discover rich language
usage rules and map them to combinations of features in the
vector
• Popular Libraries:
• Bert
• Elmo
• Universal Sentence Encoder
• Word2Ve
• Sentence2Vec
• Glove
• fastText
Vector Encoders
Query Type Likely Outcome
Obscure keyword combinations
Q. (software OR hardware) AND enginee*
• Keyword search succeeds
• Vector Search fails
Natural Language Queries
Q. Can my wife drive on my insurance?
• Keyword search might get
lucky, but probably fails
• Vector Search succeeds
Fuzzy Language Queries
Q. famous french tower
• Keyword search mismatch
yields poor results
• Vector Search succeeds
Structured Relationship Queries
Q. popular barbeque near Activate
• Keyword search fails
• Vector search fails
• Need a Knowledge Graph!
Keyword Search vs. Vector Search
What is a Knowledge Graph?
(vs. Ontology vs. Taxonomy vs. Synonyms, etc.)
Overly Simplistic Definitions
Alternative Labels: Substitute words with identical meanings
[ CTO => Chief Technology Officer; specialise => specialize ]
Synonyms List: Provides substitute words that can be used to represent
the same or very similar things
[ human => homo sapien, mankind; food => sustenance, meal ]
Taxonomy: Classifies things into Categories
[ john is Human; Human is Mammal; Mammal is Animal ]
Ontology: Defines relationships between types of things
[ animal eats food; human is animal ]
Knowledge Graph: Instantiation of an
Ontology (contains the things that are related)
[ john is human; john eats food ]
In practice, there is significant overlap…
What sort of Knowledge Graph can
help us with the kinds of problems we
encounter in Search use cases?
most often used in
reference to
“free text”
But… unstructured data is really
more like “hyper-structured”
data. It is a graph that contains
much more structure than typical
“structured data.”
Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the Activate 2019 conference.
#Activate19 (Activate) is being held in
Washington, DC September 9-12, 2019. Trey got
his masters from Georgia Tech.
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Foreign Key?
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Not so fast!
Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2019
conference.
#Activate19
(Activate) is being held in Washington, DC
September 9-12, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
How do we easily harness this
“semantic graph” of relationships
within unstructured information?
Semantic Knowledge Graph
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The Semantic
Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any
relationship within a domain”.
DSAA 2016.
Knowledge
Graph
field term postings
list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The
Semantic Knowledge
Graph: A compact, auto-
generated model for real-
time traversal and ranking of
any relationship within a
domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill:
Java
skill:
Scala
skill:
Hibernate
skill:
Oncology
has_related_skill
has_related_skill
has_related_skill
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill:
Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg
Content-based Recommendations (More Like This on Steroids)
http://localhost:8983/solr/job-postings/skg
Who’s in Love with Jean Grey?
Differentiating related terms
Misspellings: managr => manager
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => registered nurse
Ambiguous Terms*: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
*differentiated based upon user and query context
Thought Exercise
What do you think of when I say the
word “driver”?
What about “architect”?
Use Case: Query Disambiguation
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph
Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean
and representative your documents are.
Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of
categories. Effectively an dynamically-materialized probabilistic graphical model
Disambiguation by Category Example
Meaning 1: Restaurant => bbq, brisket, ribs, pork, …
Meaning 2: Outdoor Equipment => bbq, grill, charcoal, propane, …
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg
What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
"embrace"
What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
So what’s the end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
Semantic Search Components:
• Apache Solr
• Solr Text Tagger
• Semantic Knowledge Graph
• Statistical Phrase Identifier
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion AI Phrase Identification Job
• Fusion Query Rules Engine
See my Activate 2018 talk on
“How to Build a Semantic Search System”
For details on extended Lucidworks Fusion capabilities.
For Today’s talk, I’ll only focus on these two
• Solr Text Tagger
• Semantic Knowledge Graph
Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: distributed queries only traverse intra-shard docs
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
Example Query:
Demo!
Demo Data
Places (also includes geonames database)
Entities (includes search commands)
Text Content
[ Web crawl of restaurant and product reviews sites ]
Find Location (Graph Query)
http://localhost:8983/solr/POI/select
Graph Traversal converted to Facet
http://localhost:8983/solr/POI/select
For Remaining keywords, find doc type + related terms
http://localhost:8983/solr/POI/select
Full Knowledge Graph Traversal in Single Request!
popular barbeque near Activate
(popular same as "good", "top", "best")
Hotels near Activate
hotels near popular BBQ in Boston
BBQ near airports near Activate
hotels near movie theaters in Boston …
And that’s really just the beginning!
But it’s unfortunately also the end
of our presentation for today : (
We operationalize AI for the
largest businesses on the planet.
Questions?
Trey Grainger
trey@lucidworks.com
@treygrainger
Other presentations:
http://www.treygrainger.com
40% Discount code: mtpapache19
http://aiPoweredSearch.com
http://solrinaction.com
Books:
Thank You!

More Related Content

What's hot

Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureKoray Tugberk GUBUR
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOKoray Tugberk GUBUR
 
Automating Google Lighthouse
Automating Google LighthouseAutomating Google Lighthouse
Automating Google LighthouseHamlet Batista
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateKoray Tugberk GUBUR
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsDan Taylor
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Dawn Anderson MSc DigM
 
Antifragility in Digital Marketing
Antifragility in Digital MarketingAntifragility in Digital Marketing
Antifragility in Digital MarketingElias Dabbas
 
AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAnton Shulke
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Koray Tugberk GUBUR
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Pythonsearchsolved
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Koray Tugberk GUBUR
 
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Dawn Anderson MSc DigM
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering Bill Slawski
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you thinkDawn Anderson MSc DigM
 
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráfico
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráficoDiscover, pa’ tipos como tú: Los 13 factores para disparar tu tráfico
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráficoClara Soteras
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 

What's hot (20)

Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
 
Automating Google Lighthouse
Automating Google LighthouseAutomating Google Lighthouse
Automating Google Lighthouse
 
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core UpdateSEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 
Antifragility in Digital Marketing
Antifragility in Digital MarketingAntifragility in Digital Marketing
Antifragility in Digital Marketing
 
AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBUR
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Python
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
 
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you think
 
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráfico
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráficoDiscover, pa’ tipos como tú: Los 13 factores para disparar tu tráfico
Discover, pa’ tipos como tú: Los 13 factores para disparar tu tráfico
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 

Similar to Thought Vectors and Knowledge Graphs in AI-powered Search

The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
The Next Generation of AI-Powered Search
The Next Generation of AI-Powered SearchThe Next Generation of AI-Powered Search
The Next Generation of AI-Powered SearchLucidworks
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BIText Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BISanil Mhatre
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive Search
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive SearchTHAT Conference 2021 - State-of-the-art Search with Azure Cognitive Search
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive SearchBrian McKeiver
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Abhay Ratnaparkhi
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureSanil Mhatre
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 

Similar to Thought Vectors and Knowledge Graphs in AI-powered Search (20)

The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
The Next Generation of AI-Powered Search
The Next Generation of AI-Powered SearchThe Next Generation of AI-Powered Search
The Next Generation of AI-Powered Search
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BIText Mining & Sentiment Analysis made easy, with Azure and Power BI
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive Search
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive SearchTHAT Conference 2021 - State-of-the-art Search with Azure Cognitive Search
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive Search
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
How Google works
How Google worksHow Google works
How Google works
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & Azure
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 

More from Trey Grainger

Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 

More from Trey Grainger (20)

Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Thought Vectors and Knowledge Graphs in AI-powered Search

  • 1. Trey Grainger Chief Algorithms Officer Thought Vectors and Knowledge Graphs in AI-powered Search October 15, 2019
  • 2. Trey Grainger Chief Algorithms Officer • Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder • Georgia Tech – MBA, Management of Technology • Furman University – BA, Computer Science, Business, & Philosophy • Stanford University – Information Retrieval & Web Search Other fun projects: • Co-author of Solr in Action, plus numerous research publications • Advisor to Presearch, the decentralized search engine • Lucene / Solr contributor About Me
  • 4. • About Lucidworks • What is AI-powered Search? • What are Thought Vectors? • Vector Search in Apache Solr • What is a Knowledge Graph (and related terminology)? • Philosophy of Language • Semantic Knowledge Graphs • Implementing Knowledge Graph-based Search • Solr Text Tagger • Solr’s Semantic Knowledge Graph • Demos! Agenda
  • 5. The Search & AI Conference COMPANY BEHIND Who are we? 300+ CUSTOMERS ACROSS THE FORTUNE 1000 400+EMPLOYEES OFFICES IN San Francisco, CA (HQ) Raleigh-Durham, NC Cambridge, UK Bangalore, India Hong Kong Employ about 40% of the active committers on the Solr project 40% Contribute over 70% of Solr's open source codebase 70% DEVELOP & SUPPORT Apache
  • 6. Industry’s most powerful Intelligent Search & Discovery Platform.
  • 7. Proudly built with open-source tech at its core: Apache Solr & Apache Spark Personalizes search with applied machine learning Proven on the world’s biggest information systems
  • 9. AI? Machine Learning? Data Science? Neural Search? AI-Powered Search? Deep Learning?
  • 10.
  • 12. AI-powered Search Question / Answer Systems Virtual Assistants • Signals Boosting Models • Learning to Rank • Semantic Search • Collaborative Filtering • Personalized Search • Content Clustering • NLP / Entity Resolution • Semantic Knowledge Graphs • Document Classification • etc. • Neural Search • Word Embeddings • Vector Search • Image / Voice Search • etc. • Question / Answer Systems • Virtual Assistants • Chatbots • Rules-based Relevancy • etc.
  • 13. Tonight, we’ll touch only on these… Question / Answer Systems Virtual Assistants • Signals Boosting Models • Learning to Rank • Semantic Search • Collaborative Filtering • Personalized Search • Content Clustering • NLP / Entity Resolution • Semantic Knowledge Graphs • Document Classification • etc. • Neural Search • Word Embeddings • Vector Search • Image / Voice Search • etc. • Question / Answer Systems • Virtual Assistants • Chatbots • Rules-based Relevancy • etc. AI-powered Search
  • 15. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): An inverted index (“how a search engine works”)
  • 16. /solr/collection/select/?q=apache solr Term Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents
  • 17. BM25 (Relevance Scoring between Query and Documents) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  • 18. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 19. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics (nested / relational) ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● Graph Queries and Traversals ● SQL Query Support ● Streaming Aggregations ● Batch and Streaming processing ● Highly Configurable / Plugins ● Learning to Rank ● Building machine-learning models ● … many more *source: Solr in Action, chapter 2
  • 20.
  • 21.
  • 22. What is a Thought Vector?
  • 23. Sentence Embeddings: [ 2, 3, 2, 4, 2, 1, 5, 3 ] [ 5, 3, 2, 3, 4, 0, 3, 4 ] . . . Document Embedding: [ 4, 1, 4, 2, 1, 2, 4, 3 ] Word Embeddings: [ 5, 1, 3, 4, 2, 1, 5, 3 ] [ 4, 1, 3, 0, 1, 1, 4, 2 ] . . . Paragraph Embeddings: [ 5, 1, 4, 1, 0, 2, 4, 0 ] [ 1, 1, 4, 2, 1, 0, 0, 0 ] . . . Thought Vectors
  • 24. apple caffeine cheese coffee drink donut food juice pizza tea water cappuccino 0 0 0 0 0 0 0 0 0 0 0 apple 1 0 0 0 0 0 0 0 0 0 0 juice 0 0 0 0 0 0 0 1 0 0 0 cheese 0 0 1 0 0 0 0 0 0 0 0 pizza 0 0 0 0 0 0 0 0 1 0 0 donut 0 0 0 0 0 1 0 0 0 0 0 green 0 0 0 0 0 0 0 0 0 0 0 tea 0 0 0 0 0 0 0 0 0 1 0 bread 0 0 1 0 0 0 0 0 0 0 0 sticks 0 0 0 0 0 0 0 0 0 0 0 exact term lookup in inverted indexquery Single Term Searches (as a Vector)
  • 25. Combined Vector query Multi-term Query Vectors juice 0 0 0 0 0 0 0 1 0 0 0 apple 1 0 0 0 0 0 0 0 0 0 0 + apple juice 1 0 0 0 0 0 0 1 0 0 0
  • 26. apple caffeine cheese coffee drink donut food juice pizza tea water latte 0 0 0 0 0 0 0 0 0 0 0 cappuccino 0 0 0 0 0 0 0 0 0 0 0 apple juice 1 0 0 0 0 0 0 1 0 0 0 cheese pizza 0 0 1 0 0 0 0 0 1 0 0 donut 0 0 0 0 0 1 0 0 0 0 0 soda 0 0 0 0 0 0 0 0 0 0 0 green tea 0 0 0 0 0 0 0 0 0 1 0 water 0 0 0 0 0 0 0 0 0 0 1 cheese bread sticks 0 0 1 0 0 0 0 0 0 0 0 cinnamon sticks 0 0 0 0 0 0 0 0 0 0 0 exact term lookup in inverted indexquery Multi-term Searches
  • 27. food drink dairy bread caffeine sweet calories healthy apple juice 0 5 0 0 0 4 4 3 cappuccino 0 5 3 0 4 1 2 3 cheese bread sticks 5 0 4 5 0 1 4 2 cheese pizza 5 0 4 4 0 1 5 2 cinnamon bread sticks 5 0 1 5 0 3 4 2 donut 5 0 1 5 0 4 5 1 green tea 0 5 0 0 2 1 1 5 latte 0 5 4 0 4 1 3 3 soda 0 5 0 0 3 5 5 0 water 0 5 0 0 0 0 0 5 Dimensionality Reduction
  • 28. Phrase: Vector: apple juice: [ 0, 5, 0, 0, 0, 4, 4, 3 ] cappuccino: [ 0, 5, 3, 0, 4, 1, 2, 3 ] cheese bread sticks: [ 5, 0, 4, 5, 0, 1, 4, 2 ] cheese pizza: [ 5, 0, 4, 4, 0, 1, 5, 2 ] cinnamon bread sticks: [ 5, 0, 4, 5, 0, 1, 4, 2 ] donut: [ 5, 0, 1, 5, 0, 4, 5, 1 ] green tea: [ 0, 5, 0, 0, 2, 1, 1, 5 ] latte: [ 0, 5, 4, 0, 4, 1, 3, 3 ] soda: [ 0, 5, 0, 0, 3, 5, 5, 0 ] water: [ 0, 5, 0, 0, 0, 0, 0, 5 ] Ranked Results: Green Tea 0.94 water 0.85 cappuccino 0.80 latte 0.78 apple juice 0.60 soda … … 0.19 donut Vector Similarity Scores: Vector Similarity (a, b): cos(θ) = a · b |a| × |b| Ranked Results: Cheese Pizza 0.99 cheese bread sticks 0.91 cinnamon bread sticks 0.89 donut 0.47 latte 0.46 apple juice … … 0.19 water Vector Similarity Scoring
  • 29. Vector Similarity Scores: Performance Considerations Problem: Vector Scoring is Slow • Unlike keyword search, which looks up pre-indexed answers to queries, Vector Search must instead calculate similarities between the query vector and every document’s vectors to determine best matches, which is slow at scale. Solution: Quantized Vectors • “Quantization” is the process for mapping vectors features to discrete values. • Creating “tokens” which map to a similar vector space, enables matching on those tokens to perform an ANN (Approximate Nearest Neighbor) search • This enables converting vector scoring into a search problem (term lookup and scoring), which is fast again, at the expense of some recall and scoring accuracy Recommended Approach: Quantized Vector Search + Vector Similarity Reranking • Combine the best of both worlds by running an initial ANN search on a quantized vector representation, and then re-rank the top-N results using full Vector similarity scoring.
  • 32. Option 1: Streaming Expressions
  • 33. curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food/update?commit=true --data-binary ' [ {"id": "1", "name_s":"donut", "vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]}, {"id": "2", "name_s":"apple juice", "vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]}, {"id": "3", "name_s":"cappuccino", "vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]}, {"id": "4", "name_s":"cheese pizza", "vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]}, {"id": "5", "name_s":"green tea", "vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]}, {"id": "6", "name_s":"latte", "vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]}, {"id": "7", "name_s":"soda", "vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]}, {"id": "8", "name_s":"cheese bread sticks", "vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]}, {"id": "9", "name_s":"water", "vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]}, {"id": "10", "name_s":"cinnamon bread sticks", "vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]} ] ' Send Documents to Solr: Streaming Expressions
  • 34.
  • 36. http://localhost:8983/solr/food/select?q=*:*&fl=id,name_s& fq={!streaming_expression}top( select( search(food, q="*:*", fl="id,vector_fs", sort="id asc"), cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as cos, id), n=5, sort="cos desc” ) { "responseHeader":{ … }, "response":{"numFound":5,"start":0,"docs":[ { "name_s":"donut", "id":"1"}, { "name_s":"apple juice", "id":"2"}, { "name_s":"cheese pizza", "id":"4"}, { "name_s":"cheese bread sticks", "id":"8"}, { "name_s":"cinnamon bread sticks", "id":"10"}] }} Request: Response: Streaming Expressions Query Parser
  • 37. Option 3: Solr Vector Scoring Plugin
  • 38. Send Documents to Solr: curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-name}/update?commit=true -- data-binary ‘ [ {"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33"}, {"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25"}, {"name":"example 2", "vector":"0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01"}, {"name":"example 3", "vector":"0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9"}, {"name":"example 4", "vector":"0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1"}, {"name":"example 5", "vector":"0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09"} ]' Solr Vector Scoring Plugin
  • 39. Request: Response: http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0” } { "responseHeader":{ "status":0, "QTime":1}}, "response":{ "numFound":6,"start":0,"maxScore":0.99984086, "docs":[ { "name":["example 3"], "vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "], "score":0.99984086}, { "name":["example 0"], "vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "], "score":0.7693964}, { "name":["example 5"], "vector":["0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "], "score":0.76322395}, { "name":["example 4"], "vector":["0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "], "score":0.5328145}, { "name":["example 1"], "vector":["0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "], "score":0.48513117}, { "name":["example 2"], "vector":["0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "], "score":0.44909418}] }} Solr Vector Scoring Plugin
  • 40. Option 4: Solr Vector Scoring + LSH Plugin
  • 41. Send Documents to Solr: Solr Vector Scoring + LSH Plugin curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection- name}/update?update.chain=LSH&commit=true --data-binary ‘ [ {"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"}, {"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"} ]' http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"}&fl=name,score,vector,_vector_,_lsh_hash_ Request:
  • 42. Response: Solr Vector Scoring + LSH Plugin { "responseHeader":{ "status":0, "QTime":8, "response":{"numFound":1,"start":0,"maxScore":36.65736, "docs":[ { "id": "1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33", "_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==", "_lsh_hash_":["0_8", "1_35", "2_7", "3_10", "4_2", "5_35", "6_16", "7_30", "8_27", "9_12", "10_7", "11_32", "12_48", "13_36", "14_10", "15_7", "16_42", "17_5", "18_3", "19_2", "20_1", "21_0", "22_24", "23_18", "24_42", "25_31", "26_35", "27_8", "28_1", "29_24", "30_47", "31_14", "32_22", "33_39", "34_0", "35_34", "36_34", "37_39", "38_27", "39_27", "40_45", "41_10", "42_21", "43_34", "44_41", "45_9", "46_31", "47_0", "48_4", "49_43"], "score":36.65736} ] } } http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"}&fl=name,score,vector,_vector_,_lsh_hash_ Request:
  • 43. Option 5 (Work in Progress): First-class Vector Fields in Lucene/Solr
  • 46. • Take queries, documents, sentences, paragraphs, etc. and transform them into vectors. • Usually leverage deep learning, which can discover rich language usage rules and map them to combinations of features in the vector • Popular Libraries: • Bert • Elmo • Universal Sentence Encoder • Word2Ve • Sentence2Vec • Glove • fastText Vector Encoders
  • 47. Query Type Likely Outcome Obscure keyword combinations Q. (software OR hardware) AND enginee* • Keyword search succeeds • Vector Search fails Natural Language Queries Q. Can my wife drive on my insurance? • Keyword search might get lucky, but probably fails • Vector Search succeeds Fuzzy Language Queries Q. famous french tower • Keyword search mismatch yields poor results • Vector Search succeeds Structured Relationship Queries Q. popular barbeque near Activate • Keyword search fails • Vector search fails • Need a Knowledge Graph! Keyword Search vs. Vector Search
  • 48. What is a Knowledge Graph? (vs. Ontology vs. Taxonomy vs. Synonyms, etc.)
  • 49.
  • 50. Overly Simplistic Definitions Alternative Labels: Substitute words with identical meanings [ CTO => Chief Technology Officer; specialise => specialize ] Synonyms List: Provides substitute words that can be used to represent the same or very similar things [ human => homo sapien, mankind; food => sustenance, meal ] Taxonomy: Classifies things into Categories [ john is Human; Human is Mammal; Mammal is Animal ] Ontology: Defines relationships between types of things [ animal eats food; human is animal ] Knowledge Graph: Instantiation of an Ontology (contains the things that are related) [ john is human; john eats food ] In practice, there is significant overlap…
  • 51.
  • 52. What sort of Knowledge Graph can help us with the kinds of problems we encounter in Search use cases?
  • 53.
  • 54. most often used in reference to “free text”
  • 55. But… unstructured data is really more like “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  • 56. Structured Data Employees Table id name company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key
  • 57. Unstructured Data Trey Grainger works at Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters from Georgia Tech.
  • 58. Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Unstructured Data
  • 59. Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Foreign Key?
  • 60. Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzy Foreign Key? (Entity Resolution)
  • 61. Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features)
  • 62. Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features) Not so fast!
  • 63.
  • 64.
  • 65. Giant Graph of Relationships... Trey Grainger works for Lucidworks. He is speaking at the Activate 2019 conference. #Activate19 (Activate) is being held in Washington, DC September 9-12, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail
  • 66. How do we easily harness this “semantic graph” of relationships within unstructured information?
  • 68. id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  • 69. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto- generated model for real- time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology has_related_skill has_related_skill has_related_skill doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5
  • 70. DOI: 10.1109/DSAA.2016.51 Conference: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Graph Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_skill has_related_skill has_related_skill has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title
  • 71. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 72. Related term vector (for query concept expansion) http://localhost:8983/solr/stack-exchange-health/skg
  • 73. Content-based Recommendations (More Like This on Steroids) http://localhost:8983/solr/job-postings/skg
  • 74. Who’s in Love with Jean Grey?
  • 75. Differentiating related terms Misspellings: managr => manager Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse Ambiguous Terms*: driver => driver (trucking) ~80% likelihood driver => driver (software) ~20% likelihood Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *differentiated based upon user and query context
  • 76. Thought Exercise What do you think of when I say the word “driver”? What about “architect”?
  • 77. Use Case: Query Disambiguation Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … …
  • 78. Use Case: Query Disambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 79. A few methodologies: 1) Query Log Mining 2) Semantic Knowledge Graph Knowledge Graph
  • 80. Semantic Knowledge Graph: Discovering ambiguous phrases 1) Exact same concept, but use a document classification field (i.e. category) as the first level of your graph, and the related terms as the second level to which you traverse. 2) Has the benefit that you don’t need query logs to mine, but it will be representative of your data, as opposed to your user’s intent, so the quality depends on how clean and representative your documents are. Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of categories. Effectively an dynamically-materialized probabilistic graphical model
  • 81. Disambiguation by Category Example Meaning 1: Restaurant => bbq, brisket, ribs, pork, … Meaning 2: Outdoor Equipment => bbq, grill, charcoal, propane, …
  • 82. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 83. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label
  • 84. What does “love” mean? http://localhost:8983/solr/thesaurus/skg
  • 85. What does “love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg "embrace"
  • 86. What does “love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  • 87. So what’s the end goal here? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: "machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 88. Semantic Search Components: • Apache Solr • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion AI Phrase Identification Job • Fusion Query Rules Engine
  • 89. See my Activate 2018 talk on “How to Build a Semantic Search System” For details on extended Lucidworks Fusion capabilities.
  • 90. For Today’s talk, I’ll only focus on these two • Solr Text Tagger • Semantic Knowledge Graph
  • 91.
  • 92.
  • 93. Graph Query Parser • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: distributed queries only traverse intra-shard docs Examples: • http://localhost:8983/solr/graph/query?fl=id,score& q={!graph from=in_edge to=out_edge}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
  • 95. Demo!
  • 96. Demo Data Places (also includes geonames database) Entities (includes search commands) Text Content [ Web crawl of restaurant and product reviews sites ]
  • 97. Find Location (Graph Query) http://localhost:8983/solr/POI/select
  • 98. Graph Traversal converted to Facet http://localhost:8983/solr/POI/select
  • 99. For Remaining keywords, find doc type + related terms http://localhost:8983/solr/POI/select
  • 100. Full Knowledge Graph Traversal in Single Request!
  • 101.
  • 102. popular barbeque near Activate (popular same as "good", "top", "best") Hotels near Activate hotels near popular BBQ in Boston BBQ near airports near Activate hotels near movie theaters in Boston … And that’s really just the beginning!
  • 103. But it’s unfortunately also the end of our presentation for today : (
  • 104. We operationalize AI for the largest businesses on the planet.
  • 106. Trey Grainger trey@lucidworks.com @treygrainger Other presentations: http://www.treygrainger.com 40% Discount code: mtpapache19 http://aiPoweredSearch.com http://solrinaction.com Books: Thank You!