SlideShare a Scribd company logo
1 of 113
Download to read offline
Measuring Relevance in the
Trey Grainger
Chief Algorithms Officer, Lucidworks
@treygrainger
Trey Grainger
Chief Algorithms Officer
• Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research publications
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me
Agenda
• Fraudulent AI
• Adversarial Machine Learning
• Cancer
• War
• Bikinis
• Brainwashing
• Alt-right
• White Supremacism
• Time Travel
• Avengers Endgame
Spoilers
• Negative Space
• Dark Data
• Pornography
• Global Warming
• Algorithmic Bias
• Diet & Exercise
• Self-crashing Cars
• Racism
• Sexism
Who are we?
230 CUSTOMERS ACROSS THE
FORTUNE 1000
400+EMPLOYEES
OFFICES IN
San Francisco, CA (HQ)
Raleigh-Durham, NC
Cambridge, UK
Bangalore, India
Hong Kong
The Search & AI Conference
COMPANY BEHIND
Employ about
40% of the active
committers on
the Solr project
40%
Contribute over
70% of Solr's open
source codebase
70%
DEVELOP & SUPPORT
Apache
The standard for
enterprise
search.
of Fortune 500
uses Solr.
90%
Industry’s most powerful
Intelligent Search & Discovery Platform.
Let the most respected
analysts in the world
speak on our behalf
Dassault Systèmes
Mindbreeze
Coveo
Microsoft
Attivio
Expert System
Smartlogic
Sinequa
IBM
IHS Markit
Funnelback
Micro Focus
COMPLETENESS OF VISION
ABILITYTOEXECUTE
CHALLENGERS LEADERS
NICHE PLAYERS VISIONARIES
Source: June 2018 Gartner Magic Quadrant report on Insight Engines.
© Gartner, Inc.
Goals of this Talk
1. Help identify patterns for uncovering overlooked
data hidden in plain sight
2. Point out current failures and dangers of
overlooking this negative space.
3. Discuss applications to my field (information
retrieval) and how my company is working to
overcome some of these failures in our own
technology.
So what is
?
Negative Space in Data Science
• Definition: “The missing or hidden data that gives shape to the
data you do have”
• If you think of your data within a vector space, then it’s very analogous
to negative space in art (art is just usually projected onto two
dimensions)
• “Negative” is a polysemous word. It can mean
“undesirable/bad” or it can mean “taken away/not there”.
• This talk intentionally uses both senses to make the point that
not leveraging missing or hidden data often leads to
bad/undesirable outcomes.”
Data
System Generated
Human Generated
Application Generated
Content
Index
Facet,
Topic &
Cluster
Query
Rule
Matching
Natural
Language
Machine
Learning
Boosted
Results
Signals
Search & Discovery
Customer Analytics
Digital Commerce
40%
of the S&P 500 will be extinct in 10 years
Filling in the Negative
Space
aka: connecting the dots, or traversing the knowledge graph
https://svs.gsfc.nasa.gov/30919
What is this a picture of?
Stars in the Sky Lights on a Map
Mouse Brain
with Dementia Jellyfish Larvae
https://svs.gsfc.nasa.gov/30919
Any idea?
How about now?
If we zoom out a little bit…
And if we keep zooming out…
We see a map of all lights in the world
And similar patterns emerge in other
contexts…
Let’s explore airline flight patterns…
https://xkcd.com/1138/
Heatmap
Watson: “You appeared to [see a good deal] which was quite invisible to me”
Sherlock: “Not invisible but unnoticed, Watson. You did not know
where to look, and so you missed all that was important.”
The Adventures of Sherlock Holmes, ADVENTURE III. A CASE OF IDENTITY, Sir. Oliver Conan Doyle
Head?
Pipe?
Coat Collar? Back of Hat?
Hat?
Smoke?
Nose?
Abstract Concept of
Detective with Pipe
Specific hypothesis from Experience (leveraging social cue that this is probably a well-known answer)
Detective (Deerstalker) Hat!
Final Answer + conceptual context
Fighting Algorithmic Bias
aka: slapping ourselves in the face for a bit
Ok, Google…
Is Agave Nectar good for you?
So I bought a few…
…and then one day I checked again…
!
Ok, so AI can definitely be wrong,
but can it be malicious?
Racist Algorithms?
Sexist Algorithms?
Creepy Algorithms?
Negligent Algorithms?
Fraudulent Algorithms?
Malicious Algorithms?
Adversarial
Machine Learning
“Adversarial Patch”, Tom P. Brown, et. al, 2017.
Racist Algorithms?
Sexist Algorithms?
Creepy Algorithms?
Negligent Algorithms?
Fraudulent Algorithms?
Malicious Algorithms?
Fraudulent Algorithms?
“Adversarial Attacks on Medical
Machine Learning”, Samuel G.
Finlayson, et. al., 2019.
Fraudulent Algorithms?
Negligent Algorithms?
Negligent Algorithms?
Racist Algorithms?
Sexist Algorithms?
Racist Algorithms?
Sexist Algorithms?
Sexist Algorithms?
Creepy Algorithms?
Manual Override By Facebook
Still Available through
Query Variations
Sexist Algorithms?
Creepy Algorithms?
Malicious Algorithms?
Malicious Algorithms?
Racist Algorithms?
Sexist Algorithms?
Creepy Algorithms?
Negligent Algorithms?
Fraudulent Algorithms?
Malicious Algorithms?
Biased Algorithms!
Youtube: Relevance = “Most likely to capture attention” (ads)
Facebook: Relevance = “Most likely to capture attention” (ads)
Amazon: Relevance = “Satisfied Customer Purchases” (purchases)
Lucidworks: Relevance = “Whatever our customers want it to be…”
Why the bias?
So how can we help
our customers
avoid these pitfalls?
Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail Digital Content
Significance of Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
Southern Data Science
Signal Boosting
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonzo pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc12
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
Query Document Signal
Boost
pizza doc22 54,321
pizza doc12 987
soup doc17 1,234
soup doc2 2,345
… …
pizza ⌕
query: pizza
boost: doc22^54321
boost: doc12^987
ƒ(x) = Σ(click * click_weight * time_decay) +
Σ(purchase * purchase_weight * time_decay)
+ other_factors
Search
ipad
Search
Search
ipad
• 200%+ increase in
click-through rates
• 91% lower TCO
• 50,000 fewer support
tickets
• Increased customer
satisfaction
Signal Boosting
• Benefits: dramatically improves relevance (increased conversions,
most popular documents / answers at the top)
• Risks:
• Reinforces current biases: Documents at the top already are more likely to be
clicked on / purchased / interacted with, and therefore diversity is harder to
achieve
• Solution: Learning to Rank: Learn relevance patterns and feature weights from
aggregate behavior instead of overfitting to specific documents
• Subject to Manipulation: Once users realize their behaviors (searches, clicks,
etc.) influence the ranking, they can manipulate the engine with fake actions
to boost or bury content through adversarial actions.
• Solutions:
• Session-filtering: limit to one action, per-type, per user. Further limit by IP
address, browser fingerprint, etc. if necessary
• Quality vs. Quality Weighting: For users acting on lots of queries or documents,
reduce the weight of each action proportionate to the total actions. The
more actions taken per user, the less they count toward the aggregate.
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best
combination of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for
given set of queries
● Features used for ranking are usually more computationally
expensive than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top
1000)
# Run Searches
http://localhost:8983/solr/techproducts/select?q=ipod
# Supply User Relevancy Judgements
nano contrib/ltr/example/user_queries.txt
#Format: query | doc id | relevancy judgement | source
# Train and Upload Model
./train_and_upload_demo_model.py -c config.json
# Re-run Searches using Machine-learned Ranking Model
http://localhost:8984/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
Collaborative Filtering (Recommendations)
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonzo pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc12
Alonzo purchase doc22
Ming click doc22
Ming purchase doc12
Elena click doc2
… … …
User Item Weight
Alonzo doc22 1.0
Alonzo doc12 0.4
… … …
Ming doc12 0.9
Ming doc22 0.6
… … …
pizza ⌕
Matrix Factorization
Recommendations for Alonzo:
• doc22: “Peperoni Pizza”
• doc12: “Cheese Pizza”
…
Collaborative Filtering
• Benefits: crowd-sources related content discovery based on real user
interactions with no a-priori understanding of the content required
• Risks:
• Reinforces biases: People interact with what they are recommended, so those
same items get recommended to the next person ad-infinitum
• Solutions:
• Combine with Content-based Features: Multi-modal recommendations enable
mixing non-behavior-based matches and overcome the cold-start problem
• Only Count Explicit Actions: If content is on “autoplay”, don’t assume an
interaction is positive. Only count explicit clicks, likes, dislikes, etc.
• Inject Conceptual Diversity: Use techniques like concept clustering or the
Semantic Knowledge Graph to determine key conceptual differences between
content, and ensure results coming back represent diverse viewpoints and not
just identical ones.
• Subject to Manipulation: Same concerns as signals boosting
• Solutions: Same solutions as Signals Boosting (Session-filtering,
Quality vs. Quality Weighting)
What is the Negative Space
between two words?
What’s in the Negative Space Between
the words “Jean Grey” and “In Love”?
Jean
Grey
In Love
Semantic Knowledge Graph
Content-based Recommendations
http://localhost:8983/solr/job-postings/skg
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Techniques like the
Semantic Knowledge Graph
can be used to score
“diversity” across content,
which can aid in reducing
the bias of Signals and
Collaborative Filtering.
So, can we go back in time and fix our mistakes?
No, but we do have a wizard….
User
Searches
User
Sees
Results
User
takes an
action
Well, today, most of us run
A/B experiments to test hypothesis to
“limit” the unknown negative impact
to a subset of users
What if we could use the negative space
to view alternate futures…
…and then make only the specific choices
that will achieve the desired outcomes
Imagine if we could
simulate user interactions to
changes before having to expose
real users to those changes?
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonz
o
pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc10
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
We DO have historical user behavior,
but it’s biased to the current
algorithm...
The click and purchase
counts are all higher
for docs that are already
ranked higher, since
they’re seen more often…
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonz
o
pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc10
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
What other data do we have available
that we’re not leveraging?
User Query Results
Alonz
o
pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc10
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
What we already know:
• What the user searched
• What the user interacted with (click,
purchase)
• Results returned to the user
What would we ideally like to know?
• Which documents are relevant (user liked)
• Which documents are irrelevant (user
didn’t like)
• What is the ideal ranking of documents?
Can we use the Negative Space to connect the dots?
How to infer relevance?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Query
Doc1 Doc2 Doc3
0
1 1
Query
Doc1 Doc2 Doc3
1
0 0
Click Graph
Skip Graph
?
From this click-skip graph, we
can generate a ground truth
data set mapping known
queries to an ideal ranking
of documents.
How to Measure Relevance?
A B C
Retrieved
Documents
Relevant
Documents
Precision = B / A
Recall = B / C
Problem:
Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at
the top of the retrieved documents, is that OK?
Discounted Cumulative Gain
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
User Query Results
Alonz
o
pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc10
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
Relevance Backtesting Simulation
Did we cover our Agenda?
• Fraudulent AI
• Adversarial Machine Learning
• Cancer
• War
• Bikinis
• Brainwashing
• Alt-right
• White Supremacism
• Time Travel
• Avengers Endgame
Spoilers
• Negative Space
• Dark Data
• Pornography
• Global Warming
• Algorithmic Bias
• Diet & Exercise
• Self-crashing Cars
• Racism
• Sexism
Goals of this Talk
1. Help identify patterns for uncovering overlooked
data hidden in plain sight
2. Point out current failures and dangers of
overlooking this negative space.
3. Discuss applications to my field (information
retrieval) and how my company is working to
overcome some of these failures in our own
technology.
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
Thank you!
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: ctwdsc19
Book Signing
3:00 pm today!
(coffee break)
@ Registration Desk

More Related Content

What's hot

Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User IntentTrey Grainger
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchTrey Grainger
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query UnderstandingDaniel Tunkelang
 
Building Knowledge Graphs in DIG
Building Knowledge Graphs in DIGBuilding Knowledge Graphs in DIG
Building Knowledge Graphs in DIGPalak Modi
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 

What's hot (20)

Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Building Knowledge Graphs in DIG
Building Knowledge Graphs in DIGBuilding Knowledge Graphs in DIG
Building Knowledge Graphs in DIG
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 

Similar to Measuring Relevance in the Negative Space

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century Human Capital Media
 
116 Machine learning for Product Managers
116   Machine learning for Product Managers116   Machine learning for Product Managers
116 Machine learning for Product ManagersProductCamp Boston
 
Machine learning for product managers. Presented at Boston ProductCamp (June...
Machine learning for product  managers. Presented at Boston ProductCamp (June...Machine learning for product  managers. Presented at Boston ProductCamp (June...
Machine learning for product managers. Presented at Boston ProductCamp (June...Mukund Seshadri
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceEnes Bolfidan
 
Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Matt Dusig
 
Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Matt Dusig
 
Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014Marianne Sweeny
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Artificial Intelligence disruption: How technologies are predicted to change ...
Artificial Intelligence disruption: How technologies are predicted to change ...Artificial Intelligence disruption: How technologies are predicted to change ...
Artificial Intelligence disruption: How technologies are predicted to change ...LinkedIn Talent Solutions
 
AI in Talent Acquisition - Talent Connect 2017
AI in Talent Acquisition - Talent Connect 2017AI in Talent Acquisition - Talent Connect 2017
AI in Talent Acquisition - Talent Connect 2017Przemek Berendt
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management ToolkitJack Moore
 
Juliette Melton - Mobile User Experience Research
Juliette Melton - Mobile User Experience ResearchJuliette Melton - Mobile User Experience Research
Juliette Melton - Mobile User Experience ResearchWeb Directions
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptxRUDRAPRASADSABAR
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
Opportunities with data science
Opportunities with data scienceOpportunities with data science
Opportunities with data scienceAshiq Rahman
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 

Similar to Measuring Relevance in the Negative Space (20)

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
 
116 Machine learning for Product Managers
116   Machine learning for Product Managers116   Machine learning for Product Managers
116 Machine learning for Product Managers
 
Machine learning for product managers. Presented at Boston ProductCamp (June...
Machine learning for product  managers. Presented at Boston ProductCamp (June...Machine learning for product  managers. Presented at Boston ProductCamp (June...
Machine learning for product managers. Presented at Boston ProductCamp (June...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!
 
Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!Webinar: Everyone cares about sample quality but not everyone values it!
Webinar: Everyone cares about sample quality but not everyone values it!
 
Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014Smashing silos ia-ux-meetup-mar112014
Smashing silos ia-ux-meetup-mar112014
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Artificial Intelligence disruption: How technologies are predicted to change ...
Artificial Intelligence disruption: How technologies are predicted to change ...Artificial Intelligence disruption: How technologies are predicted to change ...
Artificial Intelligence disruption: How technologies are predicted to change ...
 
AI in Talent Acquisition - Talent Connect 2017
AI in Talent Acquisition - Talent Connect 2017AI in Talent Acquisition - Talent Connect 2017
AI in Talent Acquisition - Talent Connect 2017
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management Toolkit
 
Juliette Melton - Mobile User Experience Research
Juliette Melton - Mobile User Experience ResearchJuliette Melton - Mobile User Experience Research
Juliette Melton - Mobile User Experience Research
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptx
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Design, AI, and "-isms"
Design, AI, and "-isms"Design, AI, and "-isms"
Design, AI, and "-isms"
 
Opportunities with data science
Opportunities with data scienceOpportunities with data science
Opportunities with data science
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Measuring Relevance in the Negative Space

  • 1. Measuring Relevance in the Trey Grainger Chief Algorithms Officer, Lucidworks @treygrainger
  • 2. Trey Grainger Chief Algorithms Officer • Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder • Georgia Tech – MBA, Management of Technology • Furman University – BA, Computer Science, Business, & Philosophy • Stanford University – Information Retrieval & Web Search Other fun projects: • Co-author of Solr in Action, plus numerous research publications • Advisor to Presearch, the decentralized search engine • Lucene / Solr contributor About Me
  • 3. Agenda • Fraudulent AI • Adversarial Machine Learning • Cancer • War • Bikinis • Brainwashing • Alt-right • White Supremacism • Time Travel • Avengers Endgame Spoilers • Negative Space • Dark Data • Pornography • Global Warming • Algorithmic Bias • Diet & Exercise • Self-crashing Cars • Racism • Sexism
  • 4. Who are we? 230 CUSTOMERS ACROSS THE FORTUNE 1000 400+EMPLOYEES OFFICES IN San Francisco, CA (HQ) Raleigh-Durham, NC Cambridge, UK Bangalore, India Hong Kong The Search & AI Conference COMPANY BEHIND Employ about 40% of the active committers on the Solr project 40% Contribute over 70% of Solr's open source codebase 70% DEVELOP & SUPPORT Apache
  • 5. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 6.
  • 7. Industry’s most powerful Intelligent Search & Discovery Platform.
  • 8. Let the most respected analysts in the world speak on our behalf Dassault Systèmes Mindbreeze Coveo Microsoft Attivio Expert System Smartlogic Sinequa IBM IHS Markit Funnelback Micro Focus COMPLETENESS OF VISION ABILITYTOEXECUTE CHALLENGERS LEADERS NICHE PLAYERS VISIONARIES Source: June 2018 Gartner Magic Quadrant report on Insight Engines. © Gartner, Inc.
  • 9. Goals of this Talk 1. Help identify patterns for uncovering overlooked data hidden in plain sight 2. Point out current failures and dangers of overlooking this negative space. 3. Discuss applications to my field (information retrieval) and how my company is working to overcome some of these failures in our own technology.
  • 11.
  • 12. Negative Space in Data Science • Definition: “The missing or hidden data that gives shape to the data you do have” • If you think of your data within a vector space, then it’s very analogous to negative space in art (art is just usually projected onto two dimensions) • “Negative” is a polysemous word. It can mean “undesirable/bad” or it can mean “taken away/not there”. • This talk intentionally uses both senses to make the point that not leveraging missing or hidden data often leads to bad/undesirable outcomes.”
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. Data System Generated Human Generated Application Generated Content Index Facet, Topic & Cluster Query Rule Matching Natural Language Machine Learning Boosted Results Signals Search & Discovery Customer Analytics Digital Commerce
  • 20.
  • 21. 40% of the S&P 500 will be extinct in 10 years
  • 22.
  • 23. Filling in the Negative Space aka: connecting the dots, or traversing the knowledge graph
  • 25. Stars in the Sky Lights on a Map Mouse Brain with Dementia Jellyfish Larvae
  • 28. If we zoom out a little bit…
  • 29. And if we keep zooming out… We see a map of all lights in the world
  • 30. And similar patterns emerge in other contexts… Let’s explore airline flight patterns…
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. Watson: “You appeared to [see a good deal] which was quite invisible to me” Sherlock: “Not invisible but unnoticed, Watson. You did not know where to look, and so you missed all that was important.” The Adventures of Sherlock Holmes, ADVENTURE III. A CASE OF IDENTITY, Sir. Oliver Conan Doyle
  • 37. Head? Pipe? Coat Collar? Back of Hat? Hat? Smoke? Nose? Abstract Concept of Detective with Pipe Specific hypothesis from Experience (leveraging social cue that this is probably a well-known answer) Detective (Deerstalker) Hat! Final Answer + conceptual context
  • 38. Fighting Algorithmic Bias aka: slapping ourselves in the face for a bit
  • 39.
  • 40. Ok, Google… Is Agave Nectar good for you?
  • 41.
  • 42. So I bought a few…
  • 43. …and then one day I checked again… !
  • 44.
  • 45. Ok, so AI can definitely be wrong, but can it be malicious?
  • 46. Racist Algorithms? Sexist Algorithms? Creepy Algorithms? Negligent Algorithms? Fraudulent Algorithms? Malicious Algorithms?
  • 48.
  • 49. “Adversarial Patch”, Tom P. Brown, et. al, 2017.
  • 50. Racist Algorithms? Sexist Algorithms? Creepy Algorithms? Negligent Algorithms? Fraudulent Algorithms? Malicious Algorithms?
  • 52. “Adversarial Attacks on Medical Machine Learning”, Samuel G. Finlayson, et. al., 2019.
  • 55.
  • 56.
  • 59.
  • 62.
  • 63.
  • 64. Manual Override By Facebook Still Available through Query Variations
  • 67.
  • 68.
  • 70. Racist Algorithms? Sexist Algorithms? Creepy Algorithms? Negligent Algorithms? Fraudulent Algorithms? Malicious Algorithms?
  • 72. Youtube: Relevance = “Most likely to capture attention” (ads) Facebook: Relevance = “Most likely to capture attention” (ads) Amazon: Relevance = “Satisfied Customer Purchases” (purchases) Lucidworks: Relevance = “Whatever our customers want it to be…” Why the bias?
  • 73. So how can we help our customers avoid these pitfalls?
  • 75. Significance of Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements Southern Data Science
  • 76. Signal Boosting User Searches User Sees Results User takes an action Users’ actions inform system improvements User Query Results Alonzo pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc12 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … Query Document Signal Boost pizza doc22 54,321 pizza doc12 987 soup doc17 1,234 soup doc2 2,345 … … pizza ⌕ query: pizza boost: doc22^54321 boost: doc12^987 ƒ(x) = Σ(click * click_weight * time_decay) + Σ(purchase * purchase_weight * time_decay) + other_factors
  • 80.
  • 81. • 200%+ increase in click-through rates • 91% lower TCO • 50,000 fewer support tickets • Increased customer satisfaction
  • 82. Signal Boosting • Benefits: dramatically improves relevance (increased conversions, most popular documents / answers at the top) • Risks: • Reinforces current biases: Documents at the top already are more likely to be clicked on / purchased / interacted with, and therefore diversity is harder to achieve • Solution: Learning to Rank: Learn relevance patterns and feature weights from aggregate behavior instead of overfitting to specific documents • Subject to Manipulation: Once users realize their behaviors (searches, clicks, etc.) influence the ranking, they can manipulate the engine with fake actions to boost or bury content through adversarial actions. • Solutions: • Session-filtering: limit to one action, per-type, per user. Further limit by IP address, browser fingerprint, etc. if necessary • Quality vs. Quality Weighting: For users acting on lots of queries or documents, reduce the weight of each action proportionate to the total actions. The more actions taken per user, the less they count toward the aggregate.
  • 83. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It typically re-ranks a subset of the matched documents (e.g. top 1000)
  • 85. # Supply User Relevancy Judgements nano contrib/ltr/example/user_queries.txt #Format: query | doc id | relevancy judgement | source # Train and Upload Model ./train_and_upload_demo_model.py -c config.json
  • 86. # Re-run Searches using Machine-learned Ranking Model http://localhost:8984/solr/techproducts/browse?q=ipod &rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
  • 87. Collaborative Filtering (Recommendations) User Searches User Sees Results User takes an action Users’ actions inform system improvements User Query Results Alonzo pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc12 Alonzo purchase doc22 Ming click doc22 Ming purchase doc12 Elena click doc2 … … … User Item Weight Alonzo doc22 1.0 Alonzo doc12 0.4 … … … Ming doc12 0.9 Ming doc22 0.6 … … … pizza ⌕ Matrix Factorization Recommendations for Alonzo: • doc22: “Peperoni Pizza” • doc12: “Cheese Pizza” …
  • 88. Collaborative Filtering • Benefits: crowd-sources related content discovery based on real user interactions with no a-priori understanding of the content required • Risks: • Reinforces biases: People interact with what they are recommended, so those same items get recommended to the next person ad-infinitum • Solutions: • Combine with Content-based Features: Multi-modal recommendations enable mixing non-behavior-based matches and overcome the cold-start problem • Only Count Explicit Actions: If content is on “autoplay”, don’t assume an interaction is positive. Only count explicit clicks, likes, dislikes, etc. • Inject Conceptual Diversity: Use techniques like concept clustering or the Semantic Knowledge Graph to determine key conceptual differences between content, and ensure results coming back represent diverse viewpoints and not just identical ones. • Subject to Manipulation: Same concerns as signals boosting • Solutions: Same solutions as Signals Boosting (Session-filtering, Quality vs. Quality Weighting)
  • 89. What is the Negative Space between two words?
  • 90. What’s in the Negative Space Between the words “Jean Grey” and “In Love”? Jean Grey In Love
  • 93. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 94. Techniques like the Semantic Knowledge Graph can be used to score “diversity” across content, which can aid in reducing the bias of Signals and Collaborative Filtering.
  • 95. So, can we go back in time and fix our mistakes?
  • 96. No, but we do have a wizard….
  • 97. User Searches User Sees Results User takes an action Well, today, most of us run A/B experiments to test hypothesis to “limit” the unknown negative impact to a subset of users
  • 98. What if we could use the negative space to view alternate futures… …and then make only the specific choices that will achieve the desired outcomes
  • 99. Imagine if we could simulate user interactions to changes before having to expose real users to those changes?
  • 100. User Searches User Sees Results User takes an action Users’ actions inform system improvements User Query Results Alonz o pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc10 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … We DO have historical user behavior, but it’s biased to the current algorithm... The click and purchase counts are all higher for docs that are already ranked higher, since they’re seen more often…
  • 101. User Searches User Sees Results User takes an action Users’ actions inform system improvements User Query Results Alonz o pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc10 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … What other data do we have available that we’re not leveraging?
  • 102. User Query Results Alonz o pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc10 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … What we already know: • What the user searched • What the user interacted with (click, purchase) • Results returned to the user What would we ideally like to know? • Which documents are relevant (user liked) • Which documents are irrelevant (user didn’t like) • What is the ideal ranking of documents? Can we use the Negative Space to connect the dots?
  • 103.
  • 104. How to infer relevance? Rank Document ID 1 Doc1 2 Doc2 3 Doc3 4 Doc4 Query Query Doc1 Doc2 Doc3 0 1 1 Query Doc1 Doc2 Doc3 1 0 0 Click Graph Skip Graph ?
  • 105. From this click-skip graph, we can generate a ground truth data set mapping known queries to an ideal ranking of documents.
  • 106. How to Measure Relevance? A B C Retrieved Documents Relevant Documents Precision = B / A Recall = B / C Problem: Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK?
  • 107. Discounted Cumulative Gain Rank Relevancy 1 0.95 2 0.65 3 0.80 4 0.85 Rank Relevancy 1 0.95 2 0.65 3 0.80 4 0.85 Ranking Ideal Given • Position is considered in quantifying relevancy. • Labeled dataset is required.
  • 108. User Query Results Alonz o pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc10 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … Relevance Backtesting Simulation
  • 109.
  • 110. Did we cover our Agenda? • Fraudulent AI • Adversarial Machine Learning • Cancer • War • Bikinis • Brainwashing • Alt-right • White Supremacism • Time Travel • Avengers Endgame Spoilers • Negative Space • Dark Data • Pornography • Global Warming • Algorithmic Bias • Diet & Exercise • Self-crashing Cars • Racism • Sexism
  • 111. Goals of this Talk 1. Help identify patterns for uncovering overlooked data hidden in plain sight 2. Point out current failures and dangers of overlooking this negative space. 3. Discuss applications to my field (information retrieval) and how my company is working to overcome some of these failures in our own technology.
  • 112.
  • 113. Trey Grainger trey.grainger@lucidworks.com @treygrainger Thank you! http://solrinaction.com Other presentations: http://www.treygrainger.com Discount code: ctwdsc19 Book Signing 3:00 pm today! (coffee break) @ Registration Desk