Searching for Meaning

Searching for Meaning:
The hidden structure in unstructured data
Trey Grainger
SVP of Engineering, Lucidworks
Southern Data Science Conference
2018.04.13

Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Advisor to Presearch, the decentralized search engine
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene / Solr contributor
About Me

Based in San Francisco, offices
and employees worldwide
Fusion, the platform for building
data-driven, smart apps
Over 400 customers running our
commercial software
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary commercial
contributor to the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%

Fusion powers search for the brightest companies in the world.

most often used in
reference to

My Three Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.

Assertion 1:
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science

Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon
Musk
5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Southern
Data Science

Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018. Southern Data
Science Conference (SDSC) is being held in Atlanta
April 12-14, 2018. Trey got his masters from
Georgia Tech.
Southern
Data Science

Unstructured Data
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
He is speaking at the SDSC 2018.
Trey’s Voicemail

Foreign Key?
April 12-14, 2018.
Georgia Tech.
Southern
Data Science
Trey’s Voicemail

Fuzzy Foreign Key? (Entity Resolution)
April 12-14, 2018.
Georgia Tech.
Southern
Data Science
Trey’s Voicemail

Fuzzier Foreign Key? (metadata, latent features)
April 12-14, 2018.
Georgia Tech.
Trey’s Voicemail
Southern
Data Science

Fuzzier Foreign Key? (metadata, latent features)
April 12-14, 2018.
Georgia Tech.
Trey’s Voicemail
Southern
Data Science
Not so Fast!

Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
April 12-14, 2018.
Georgia Tech.
Southern
Data Science
Trey’s Voicemail

Assertion 1 (Summary):
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science

Assertion 2:
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science

Southern
Data Science
01
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…

How do we easily harness this
“semantic graph” or relationships
within unstructured information?
Southern
Data Science

Search Engines are really good at querying
across characters sequences, term sequences,
and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”

Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
An inverted index (“how a search engine works”)
Southern
Data Science

/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
Southern
Data Science

Search engines also do relevancy ranking (query to doc)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency
saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document
normalization.

DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)

• “A compact, auto-generated model for real-time traversal and
ranking of any relationship within a domain”
• A multi-dimensional term-to-term (vs. term-to-document) search
engine
• A tool which enables knowledge modeling and reasoning, natural language
processing, anomaly detection, data cleansing, semantic search, analytics,
data classification, root cause analysis, and recommendations systems
• It’s kind of like Word2Vec, but vectors (or matrices) are generated
on the fly and are better suited for interpreting the nuanced intent of
typical search queries
What is the Semantic Knowledge Graph?

Southern
Data Science
Knowledge
Graph

Southern
Data Science
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …

Southern
Data Science
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5

Southern
Data Science
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title

Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph

Southern
Data Science
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55

Southern
Data Science
Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg

Southern
Data Science
Who’s in Love with Jean Grey?

Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science

Assertion 3:
Every instance of a word or phrase you
ever encounter has a unique meaning.
Southern
Data Science

Thought Exercise
What do you think of when I say the
word “driver”?
Southern
Data Science

Ambiguity
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science

Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Southern
Data Science

Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Southern
Data Science

Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Southern
Data Science

Thought Exercise
What do you think of when I say the
word “Apple”?
Southern
Data Science

Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
Southern
Data Science

Southern
Data Science
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg

Southern
Data Science
What does “love” mean in the context of “hug”?

Southern
Data Science
What does “love” mean in the context of “child”?

My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.

Why do we care?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: ctwsdsc18
Southern
Data Science

Searching for Meaning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Searching for Meaning

Similar to Searching for Meaning (20)

Recently uploaded

Recently uploaded (20)

Searching for Meaning