Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
Gen AI in Business - Global Trends Report 2024.pdf
The Apache Solr Smart Data Ecosystem
1. The Apache Solr Smart Data Ecosystem
Trey Grainger
SVP of Engineering, Lucidworks
DFW Data Science
2017.01.09
2. Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
About Me
7. Lucidworks enables Search-Driven Everything
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations &
Alerts
Analytics & InsightsExtreme Relevancy
CUSTOMER
SERVICE
RESEARCH
PORTAL
DIGITAL
CONTENT
CUSTOMER
INSIGHTS
FRAUD
SURVEILLANCE
ONLINE
RETAIL
• Access all your data in a
number of ways from one
place.
• Secure storage and
processing from Solr and
Spark.
• Acquire data from any source
with pre-built connectors and
adapters.
Machine learning and
advanced analytics turn all
of your apps into intelligent
data-driven applications.
16. • Over 50 connectors to
integrate all your data
• Robust parsing framework
to seamlessly ingest all your
document types
• Point and click Indexing
configuration and iterative
simulation of results for full
control over your ETL
process
• Your security model
enforced end-to-end from
ingest to search across your
different datasources
18. • Relevancy tuning: Point-and-click
query pipeline configuration allow
fine-grained control of results.
• Machine-driven relevancy:
Signals aggregation learn and
automatically tune relevancy and
drive recommendations out of the
box .
• Powerful pipeline stages:
Customize fields, stages,
synonyms, boosts, facets,
machine learning models, your
own scripted behavior, and
dozens of other powerful search
stages.
• Turnkey search UI
(Lucidworks View): Build a
sophisticated end-to-end search
application in just hours.
21. • 75% decrease in
development time
• Licensing costs cut
by 50%
With Fusion’s out-of-the-box capabilities, we skipped
months in our dev cycle so we could focus our team
where they would have the most impact.
We cut our licensing costs by 50% and improved
application usability. The Lucidworks professional
services team amplified our success even further. We’re
all Fusion from here on out!”
“
Lourduraju Pamishetty
Senior IT Application Architect
—
22. • Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
• End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
• Distributed, fault-tolerant
scaling and supervision of
your entire search
application
23. Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RESTAPI
Admin UI
Lucidworks
View
LOGS FILE WEB DATABASE CLOUD
• Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
• End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
• Distributed, fault-tolerant
scaling and supervision of
your entire search
application
28. Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
DFW Data Science
31. Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
*From Solr in Action, Chapter 6
DFW Data Science
32. A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
33. A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
34. A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
41. When Stemming goes awry
Fixing Stemming Mistakes:
• Unfortunately, every stemmer will have problem-cases that aren’t handled as you
would expect
• Thankfully, Stemmers can be overriden
• KeywordMarkerFilter: protects a list of terms you specify from being stemmed
• StemmerOverrideFilter: applies a list of custom term mappings you specify
Alternate strategy:
• Use Lemmatization (root-form analysis) instead of Stemming
• Commercial vendors help tremendously in this space
• The Hunspell stemmer enables dictionary-based support of varying quality in over
100 languages
DFW Data Science
43. Classic Lucene Relevancy Algorithm (now switched to BM25):
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
DFW Data Science
44. • Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
DFW Data Science
45. News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good
domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
DFW Data Science
46. John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
DFW Data Science
54. The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
DFW Data Science
56. ● Recommendation Algorithms
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out automatically-generated ontologies from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
● Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
● Discovering misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
DFW Data Science
58. How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precision = B/A
Recall = B/C
Problem:
Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at
the top of the retrieved documents, is that OK?
DFW Data Science
59. Normalized Discounted Cumulative Gain
Rank Relevancy
3 0.95
1 0.70
2 0.60
4 0.45
Rank Relevancy
1 0.95
2 0.85
3 0.80
4 0.65
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
DFW Data Science
61. Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination
of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for given set
of queries
● Features used for ranking are usually more computationally expensive
than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
DFW Data Science
63. Common LTR Algorithms
• RankNet* (Neural Network, boosted trees)
• LambdaMart* (set of regression trees)
• SVM Rank** (SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
DFW Data Science
64. LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
66. Obtaining Relevancy Judgements
Typical Methodologies
1) Hire employees, contractors, or interns
-Pros:
Accuracy
-Cons:
Expensive
Not scalable (cost or man-power-wise)
Data Becomes Stale
2) Crowdsource
-Pros:
Less cost, more scalable
-Cons:
Less accurate
Data still becomes stale
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
67. Reflected Intelligence: Possible to infer relevancy judgements?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Query
Doc1 Doc2 Doc3
0
1 1
Query
Doc1 Doc2 Doc3
1
0 0
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
74. Building a Taxonomy of Entities
Many ways to generate this:
• Topic Modelling
• Clustering of documents
• Statistical Analysis of interesting phrases
- Word2Vec / Glove / Dice Conceptual Search
• Buy a dictionary (often doesn’t work for
domain-specific search problems)
• Generate a model of domain-specific phrases by
mining query logs for commonly searched phrases within the domain*
* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
82. Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
DFW Data Science
84. Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously
built, cleaned, and refined based upon common inputs from
interactions with real users of the system. The SolrTextTagger
works well for this.*
2) Also invoke a probabilistic query parser to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
87. Knowledge
Graph
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…
DFW Data Science
88. id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
DFW Data Science
89. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
DFW Data Science
91. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
DFW Data Science
92. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge
Graph
Scoring nodes in the Graph
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
+
-
Foreground Query:
"Hadoop"
DFW Data Science
93. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
DFW Data Science
94. Knowledge
Graph
Use Case: Document Summarization
Experiment: Pass in raw text
(extracting phrases as needed), and
rank their similarity to the documents
using the SKG.
Additionally, can traverse the graph
to “related” entities/keyword phrases
NOT found in the original document
Applications: Content-based and
multi-modal recommendations
(no cold-start problem), data cleansing
prior to clustering or other ML methods,
semantic search / similarity scoring
101. • Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
102. Streaming Expressions - Examples
Shortest-path Graph
Traversal
Parallel Batch
Procesing
Train a Logistic Regression
Model
Distributed Joins
Rapid Export of all
Search Results
Pull Results from External Database
Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classifying
Search Results
104. • SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer,
Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve
iteratively
SQL is natural extension for Solr’s parallel computing engine
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
105. Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:*
&fq=genre_ss:action
&fq=rating_i:[4 TO *]
&facet=true
&facet.limit=5
&facet.mincount=1
&facet.field=title_s
SELECT title_s, COUNT(*) as cnt
FROM movielens
WHERE genre_ss='action'
AND rating_i='[4 TO *]’
GROUP BY title_s
ORDER BY cnt desc
LIMIT 5
{ ...
"facet_counts":{
"facet_fields":{
"title_s":[
"Star Wars (1977)",501,
"Return of the Jedi (1983)",379,
"Godfather, The (1972)",351,
"Raiders of the Lost Ark (1981)",348,
"Empire Strikes Back, The (1980)",293]},
...}}
{"result-set":{"docs":[
{"title_s":"Star Wars (1977)”,"cnt":501},
{"title_s":"Return of the Jedi (1983)","cnt":379},
{"title_s":"Godfather, The (1972)","cnt":351},
{"title_s":"Raiders of the Lost Ark (1981)","cnt":348},
{"title_s":"Empire Strikes Back, The (1980)","cnt":293},
{"EOF":true,"RESPONSE_TIME":42}]}}
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
106. SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating
FROM movielens
WHERE genre_ss='romance' AND age_i='[30 TO *]'
GROUP BY gender_s
ORDER BY num_ratings desc
SQL Examples
SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating
FROM movielens
GROUP BY title_s, genre_s
HAVING num_ratings >= 100
ORDER BY avg_rating desc
LIMIT 5
SELECT DISTINCT(user_id_i) as user_id
FROM movielens
WHERE genre_ss='documentary'
ORDER BY user_id desc
Give me the avg rating for men
and women over 30 for
romance movies
Give me the top 5 rated movies
with at least 100 ratings
Give me the set of unique users
that have rated documentaries
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
107. parallel(workers,
hashJoin(
search(movielens, q=*:*,
fl="user_id_i,movie_id_i,rating_i",
sort="movie_id_i asc",
partitionKeys="movie_id_i"),
hashed=search(movielens_movies, q=*:*,
fl="movie_id_i,title_s,genre_s",
sort="movie_id_i asc",
partitionKeys="movie_id_i"),
on="movie_id_i"
),
workers="4",
sort="movie_id_i asc"
)
Streaming Expression Example: hashJoin
The small “right” side of the join
gets loaded into memory on
each worker node
Each shard queried by N
workers, so 4 workers x 4
shards means 16 queries
(usually all replicas per shard
are hit)
Workers collection isolates parallel
computation nodes from data nodes
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
108. • spark-solr project uses streaming API to pull data
from Solr into Spark jobs if docValues enabled,
see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to
compute boosts and recommendations using
Spark
• Custom Scala script jobs to perform complex
analysis on data in Solr, e.g. sessionize request
logs
• Power rich data visualizations using Fusion’s SQL
Engine powered by SparkSQL + Solr streaming
aggregations
How we use Solr streaming API in Fusion
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
110. Comparing SQL Capabilities
Fusion Solr Hive Drill SparkSQL
Secret Sauce
SparkSQL Benefits
+ Solr Benefits
+ Enterprise Security
Push complex
query constructs
into engine (full
text, spatial,
relevancy, graph,
functions, etc)
Mature SQL solution
for Hadoop stack
Execute SQL over
NoSQL data sources
Spark core
(optimized shuffle,
in-memory, etc),
integration of other
APIs: ML,
Streaming, GraphX
SQL Features Maturing Evolving Mature Maturing Maturing
Scaling
Linear (shards and
replicas) backed by
inverted index;
Linear (shards and
replicas) backed
by inverted index
Limited by Hadoop
infrastructure (table
scans)
Good, but need to
benchmark
Memory intensive;
Scale out using
Spark cluster,
backed by RDDs
Integration w/
external systems
Analytics Catalog API,
JDCB Driver, ODBC
Bridge
JDBC stream
source
external tables /
plugin API
many drivers
available
DataSource API,
many systems
supported
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
113. Graph Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
114. Solr Graph Timeline
• Some data is much more naturally represented as a graph
structure
• Solr 6.0: Introduced the Graph Query Parser
• Solr 6.1: Introduced Graph Streaming expressions
…
• Solr 6.3: Current Version
• TBD: Semantic Knowledge Graph (patch available)
DFW Data Science
115. Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
116. Graph Streaming Expressions
• Part of Solr’s broader Streaming Expressions capability
• Implements a powerful, breadth-first traversal
• Works across shards AND collections
• Supports aggregations
• Cycle aware
curl -X POST -H "Content-Type: application/x-www-form-urlencoded"
-d ‘expr=…’"http://localhost:18984/solr/movielens/stream"
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
117. All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
118. All movies that viewers of a specific movie watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Movie 161: “The Air Up There”
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
119. Collaborative Filtering
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens,
top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i",
sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*)
)
),
walk="node->user_id_i", gather="movie_id_i", count(*)
)
)
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
120. Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Limited to sequential,
term relatedness
exploration only
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: Co-located
Shards only
Expressions: Yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(e.g. Gephi)
Kibana Neo4j browser 3rd party
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.