SlideShare a Scribd company logo
1 of 38
1
The Manifold Path to Search Quality
Enterprise Search & Analytics Meetup
Mark David – Architect, Data Scientist
Avi Rappaport – Senior Search Quality Analyst
19 March 2015
2
“manifold”
• adjective
– having numerous different parts, elements, features, forms, etc.
- dictionary.com
3
Search Technologies: Who We Are
The leading independent IT services firm specializing in the design,
implementation, and management of enterprise search and big data
search solutions.
4
Solutions
Corporate Wide Search – “Google for the Enterprise.” A single, secure point of search for all
users and all content. Strategic initiative for corporate wide information distribution and search.
Data Warehouse Search – A Big Data search solution that enables interactive query and analytics
with extremely large data sets for business intelligence and fraud detection.
E-Commerce Search – Leverages machine learning and accuracy metrics to deliver a better
online user experience and maximize revenues from visitor search activity.
Search & Match – Increase recruiter productivity and fill rates in the staffing industry. Provides a
better search experience followed by automated candidate-to-job matching.
Search for Media & Publishing – Improve user search experience for publishers of large amounts
of content such as government organizations, research firms, and media publications.
Government Search – A solution focused on design and development search for government
information portals or archiving systems.
5
Search Technologies: Background
San Diego
London UK
San Jose, CR
Cincinnati
Prague, CZ
Washington
(HQ)
Frankfurt DE
• Founded 2005
• 150+ employees
• 600+ customers worldwide
• Deep enterprise search expertise
• Consistent revenue growth
• Consistent profitability
6
600+ Customers
7
Search Technologies: What We Do
• All aspects of search application implementation
– Content access and processing, search system architecture, configuration, deployment
– Accuracy analysis, metrics, engine scoring, relevancy ranking, query enhancement
– User interface, analytics, visualization
• Technology assets to support implementation
– Aspire high performance content processing
– Content Connectors (Document, Jive, SharePoint, Salesforce, Box.com, etc.)
• Engagement models
– Most projects start with an “assessment”
– Fully project-managed solutions, designed, delivered, and supported
– Experts for hire, supporting in-house teams or as a subcontractor
8
Search Engine and Big Data Expertise
Our Technology and Integration Partners
9
Content
sources
Connectors
Aspire
Content Processing
Pipelines
Indexes
Search Engine
Web
Browser
Staging
Repository
Publishers
Technology Assets
1. Aspire Framework
– High Performance Content Processing
– Ingests and processes content and publishes to a variety of indexes
for commercial and open source search engines
2. Aspire Data Connectors
– API level access to content repositories
3. Query Processing Language (QPL)
– Advanced query processing
Complements to commercial and open source search technologies
1
2
3 QPL
10
Engine
Data Users
11
Understand Your Data
• Data Analysis
– Access patterns & rates, sources, schemas, field typing,
duplicates, near-duplicates, term frequencies, etc.
• Content Processing
– Source connection, format conversion, sub-document
separation, field boundaries, multiple-source assembly, etc.
• Text Processing
– Character decoding, tag stripping, tokenization, sentence
boundaries, normalization, entity extraction, pattern
recognition, disambiguation, filtering, etc.
12
Understand Your Users
• Search Scope
– Interviews
– Log Analysis
– Scenarios
– Wireframes & mockups
• Search Quality
• Improvements
– Relevance
– Coverage
– UX
13
Understand Your Search Engine
• How does it score results?
• How accurate is it for the short head?
• How accurate is it for the long tail?
• When you change it to improve a particular type of query,
how do you know that the overall accuracy improved?
14
Regression Testing of Search
• Step 1: Gather a Set of Judgments
• If you already have lots of user data:
– Use click log analysis to gather sets of clearly good and clearly
bad results
– Ignore unclear tracks
• If user data not yet available:
– Manual judgments
• End up with a set of queries with associated “good” and
“bad” documents
15
Regression Testing of Search
• Step 2: Instrument the Search Results
• Periodically execute all those queries, and score the results
• How to score:
– Every good document adds a position-based amount
– Every bad document subtracts the same amount
– Unknown documents don’t affect the score (except by
occupying a position)
16
Understand the Data
17
Relevancy Improvements from Data
• Text Processing
– Typos
– Entity Extraction
– Breaks
– Parts of Speech
• Data Analysis
– TF-IDF
– Phrase Dictionary
– Boilerplate
18
To Correct or Not To Correct
• Should typos be “fixed”?
• This goes back to knowing your audience
• Example: Haircutz
• In document-to-document situations, generally yes.
19
Bigger Needles in the Haystack
• Entity Extraction: How big a chunk?
• Example: mdavid@searchtechnologies.com
– Is that 1, 2, 3, 4, or 5 tokens?
• Multi-indexing is a key component of accuracy
– Different people think differently, so the indexes need to have
different ways of representing the data.
20
Breaker, Breaker
• Don’t match across boundaries
– Paragraph
– Sentence
– Phrase
• Whitespace does have meaning!
• Punctuation does have meaning!
21
Parts is Parts
• Figuring out the part of speech (noun vs. verb vs. adjective)
would seem to clearly help
– We avoid matching on the incorrect version
• Study after study shows that it does not!
• Why not?
– Closely related (in English)
• Example: to go on a run
– Prevalence of noun phrases in the group of “important” terms
22
How Common are Tokens Terms?
• Term Frequency (not “Token Frequency”)
– Example: The West, West London, The Wild West
• Do your full text processing when you’re gathering statistics
– And adjust it and re-run it when the data changes
• Inverse Document Frequency
– In how many docs does this term occur?
– NOT: How many times does this term occur across all docs?
23
Let Me Re-Phrase That
• Some general dictionaries are freely available
– Example: locations (geonames.org)
• Others can be derived
– Example: Company names from stock markets, business
registries, Wikipedia, etc.
• More useful are terms from your industry
– Can you think of lists that are available internally?
– Example: Job titles in a recruiting company
• Most useful are terms from your data
– Statistical generation of common 2-shingles and 3-shingles
– Query log analysis
24
Lorem ipsum…
• Boilerplate text recognition
• Pre-process:
– Simple text processing this time
– Split by paragraphs
– Calculating hash signatures for paragraphs
– Count occurrences
• Find the cliff
• Filter out early in the main pipeline
– Early steps must match the entire pre-processing pipeline
25
Understand the Users
26
Search Quality
• Best possible results
– Given the searchable data
– For the primary users and their primary tasks
• Simple query term matching - relevance
• And beyond
– Enriched content
– Query enhancement
• Results presentation
– Clarity
– Context
27
Short Head & Long Tail
• Query Frequency
– Short Head
• A few frequent queries
– Short Middle
• Often to 50% by traffic
– Long Tail
• Rare to unique queries
• Can be to 75% distinct
28
But What Do They Really Want?
• Query log reports show what users think they’re looking for
– Domain research for more about why
• Behavior shows more about whether they’re finding it
– Session ending
• Frequent for zero matches
– CTR - click-through rate
• Results (with bounce rates)
– Query refinement
• Typing, facets
• Navigation via search
29
You say “tomay-toe”
• Users vocabulary is not content vocabulary
– Consistent problems from small to web-scale search
• Create synonyms
• Scalable automated disambiguation
– Data analysis
• Using dictionaries and co-occurrence
– Search log behavior analysis
• Query refinement and reformulation, click tracks
– Language ambiguity - even Netflix has a hard time past 85%
– Human domain expertise, editorial oversight
30
Scope (aka, this is not Google)
• User confusion
– Is this a location box?
– Is it Google?
• Design for clarity
– UI and graphic design
– Watch out for default to subscope searches
• Improve content coverage
• Add Best Bets for internal and external locations
• Link to other search engines
• Federate search
31
Fix The Short Head Issues
32
Best Bang for the Buck
• Concentrate on the short head
– Top 10% by traffic
• Simple relevance test
– Perform query
– Evaluate results
• Are there any results?
• Are they the most useful available? (Domain expertise)
• Validate against user behavior
– Store judgments
– Easy fixes
– Re-test (easy to miss this)
33
Choosing, Not Typing
• Auto-suggest
– Curated search suggestions
• Best Bets
• Did you mean?
34
Context and Navigation
• Facets
• Results grouping / diversity
– options for ambiguous queries
• Integrate with collaboration tools
– Allow user comments, reviews
35
Relevance and Ranking
• Best results patterns
– Part or serial number queries
• Tuned boosting
– Feedback on clicks and other signals
– Freshness
• de-duplication!
36
Relevance and Regression Testing
37
Search Technologies
Co
38
Reference Architecture
Content
sources
Connectors
Indexes
Semantics
Text Mining
Quality
Metrics
Aspire
Content Processing Pipelines
Aspire Aspire Aspire Aspire
Aspire Aspire Aspire Aspire
Big Data Framework
Big Data Array
Indexes
QPL
Search Engine
Web Browser
Staging
Repository

More Related Content

What's hot

Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query UnderstandingDaniel Tunkelang
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...semanticsconference
 
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...Ringgold Inc
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at ScaleAndrei Lopatenko
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectOntotext
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Thanawalla
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query UnderstandingAbhay Prakash
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics DemoOntotext
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsOntotext
 

What's hot (15)

Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Text analytics
Text analyticsText analytics
Text analytics
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
 
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...
Ringgold Webinar Series: 3. Lean and Mean - Publication Metadata to Enhance D...
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your Project
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Enterprise search
Enterprise searchEnterprise search
Enterprise search
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics Demo
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging News
 

Viewers also liked

Concept - Planning
Concept - PlanningConcept - Planning
Concept - PlanningSami Huhtala
 
RELIEVES / ESCULTURAS
RELIEVES / ESCULTURASRELIEVES / ESCULTURAS
RELIEVES / ESCULTURASWilliam Pico
 
Perubahan dan pengembangan organisasi 11&12
Perubahan dan pengembangan organisasi 11&12Perubahan dan pengembangan organisasi 11&12
Perubahan dan pengembangan organisasi 11&12ulungfurtuna
 
加鑫(蔣堂)產品目錄 2015
加鑫(蔣堂)產品目錄 2015加鑫(蔣堂)產品目錄 2015
加鑫(蔣堂)產品目錄 2015mkt-jdg
 
교육평가 제5징 표준화검사화 컴퓨터화검사
교육평가 제5징 표준화검사화 컴퓨터화검사교육평가 제5징 표준화검사화 컴퓨터화검사
교육평가 제5징 표준화검사화 컴퓨터화검사은임 백
 

Viewers also liked (10)

Concept - Planning
Concept - PlanningConcept - Planning
Concept - Planning
 
Infertility by Dr qaisar
Infertility by Dr qaisarInfertility by Dr qaisar
Infertility by Dr qaisar
 
cv shahil
cv shahilcv shahil
cv shahil
 
RELIEVES / ESCULTURAS
RELIEVES / ESCULTURASRELIEVES / ESCULTURAS
RELIEVES / ESCULTURAS
 
MUSIK
MUSIKMUSIK
MUSIK
 
Eutanasia
EutanasiaEutanasia
Eutanasia
 
Bab iv
Bab ivBab iv
Bab iv
 
Perubahan dan pengembangan organisasi 11&12
Perubahan dan pengembangan organisasi 11&12Perubahan dan pengembangan organisasi 11&12
Perubahan dan pengembangan organisasi 11&12
 
加鑫(蔣堂)產品目錄 2015
加鑫(蔣堂)產品目錄 2015加鑫(蔣堂)產品目錄 2015
加鑫(蔣堂)產品目錄 2015
 
교육평가 제5징 표준화검사화 컴퓨터화검사
교육평가 제5징 표준화검사화 컴퓨터화검사교육평가 제5징 표준화검사화 컴퓨터화검사
교육평가 제5징 표준화검사화 컴퓨터화검사
 

Similar to Relevancy and Search Quality Analysis - Search Technologies

How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsShalin Hai-Jew
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findabilityKristian Norling
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob contentJeff Fried
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyIndiana Online Users Group
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)Duncan MacGruer
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementTrey Grainger
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceLucidworks
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information Netwoven Inc.
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for FindabilityFindwise
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for BusinessClement Levallois
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 

Similar to Relevancy and Search Quality Analysis - Search Technologies (20)

How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient Analytics
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findability
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob content
 
Summit EU Machine Learning
Summit EU Machine LearningSummit EU Machine Learning
Summit EU Machine Learning
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled Technology
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 

More from enterprisesearchmeetup

More from enterprisesearchmeetup (6)

Cisco meetup-25 april2017
Cisco meetup-25 april2017Cisco meetup-25 april2017
Cisco meetup-25 april2017
 
Algolia - Hosted Search API
Algolia - Hosted Search API Algolia - Hosted Search API
Algolia - Hosted Search API
 
ElasticSearch - Introduction to Aggregations
ElasticSearch - Introduction to AggregationsElasticSearch - Introduction to Aggregations
ElasticSearch - Introduction to Aggregations
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Practical Relevance Measurement
Practical Relevance MeasurementPractical Relevance Measurement
Practical Relevance Measurement
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 

Relevancy and Search Quality Analysis - Search Technologies

  • 1. 1 The Manifold Path to Search Quality Enterprise Search & Analytics Meetup Mark David – Architect, Data Scientist Avi Rappaport – Senior Search Quality Analyst 19 March 2015
  • 2. 2 “manifold” • adjective – having numerous different parts, elements, features, forms, etc. - dictionary.com
  • 3. 3 Search Technologies: Who We Are The leading independent IT services firm specializing in the design, implementation, and management of enterprise search and big data search solutions.
  • 4. 4 Solutions Corporate Wide Search – “Google for the Enterprise.” A single, secure point of search for all users and all content. Strategic initiative for corporate wide information distribution and search. Data Warehouse Search – A Big Data search solution that enables interactive query and analytics with extremely large data sets for business intelligence and fraud detection. E-Commerce Search – Leverages machine learning and accuracy metrics to deliver a better online user experience and maximize revenues from visitor search activity. Search & Match – Increase recruiter productivity and fill rates in the staffing industry. Provides a better search experience followed by automated candidate-to-job matching. Search for Media & Publishing – Improve user search experience for publishers of large amounts of content such as government organizations, research firms, and media publications. Government Search – A solution focused on design and development search for government information portals or archiving systems.
  • 5. 5 Search Technologies: Background San Diego London UK San Jose, CR Cincinnati Prague, CZ Washington (HQ) Frankfurt DE • Founded 2005 • 150+ employees • 600+ customers worldwide • Deep enterprise search expertise • Consistent revenue growth • Consistent profitability
  • 7. 7 Search Technologies: What We Do • All aspects of search application implementation – Content access and processing, search system architecture, configuration, deployment – Accuracy analysis, metrics, engine scoring, relevancy ranking, query enhancement – User interface, analytics, visualization • Technology assets to support implementation – Aspire high performance content processing – Content Connectors (Document, Jive, SharePoint, Salesforce, Box.com, etc.) • Engagement models – Most projects start with an “assessment” – Fully project-managed solutions, designed, delivered, and supported – Experts for hire, supporting in-house teams or as a subcontractor
  • 8. 8 Search Engine and Big Data Expertise Our Technology and Integration Partners
  • 9. 9 Content sources Connectors Aspire Content Processing Pipelines Indexes Search Engine Web Browser Staging Repository Publishers Technology Assets 1. Aspire Framework – High Performance Content Processing – Ingests and processes content and publishes to a variety of indexes for commercial and open source search engines 2. Aspire Data Connectors – API level access to content repositories 3. Query Processing Language (QPL) – Advanced query processing Complements to commercial and open source search technologies 1 2 3 QPL
  • 11. 11 Understand Your Data • Data Analysis – Access patterns & rates, sources, schemas, field typing, duplicates, near-duplicates, term frequencies, etc. • Content Processing – Source connection, format conversion, sub-document separation, field boundaries, multiple-source assembly, etc. • Text Processing – Character decoding, tag stripping, tokenization, sentence boundaries, normalization, entity extraction, pattern recognition, disambiguation, filtering, etc.
  • 12. 12 Understand Your Users • Search Scope – Interviews – Log Analysis – Scenarios – Wireframes & mockups • Search Quality • Improvements – Relevance – Coverage – UX
  • 13. 13 Understand Your Search Engine • How does it score results? • How accurate is it for the short head? • How accurate is it for the long tail? • When you change it to improve a particular type of query, how do you know that the overall accuracy improved?
  • 14. 14 Regression Testing of Search • Step 1: Gather a Set of Judgments • If you already have lots of user data: – Use click log analysis to gather sets of clearly good and clearly bad results – Ignore unclear tracks • If user data not yet available: – Manual judgments • End up with a set of queries with associated “good” and “bad” documents
  • 15. 15 Regression Testing of Search • Step 2: Instrument the Search Results • Periodically execute all those queries, and score the results • How to score: – Every good document adds a position-based amount – Every bad document subtracts the same amount – Unknown documents don’t affect the score (except by occupying a position)
  • 17. 17 Relevancy Improvements from Data • Text Processing – Typos – Entity Extraction – Breaks – Parts of Speech • Data Analysis – TF-IDF – Phrase Dictionary – Boilerplate
  • 18. 18 To Correct or Not To Correct • Should typos be “fixed”? • This goes back to knowing your audience • Example: Haircutz • In document-to-document situations, generally yes.
  • 19. 19 Bigger Needles in the Haystack • Entity Extraction: How big a chunk? • Example: mdavid@searchtechnologies.com – Is that 1, 2, 3, 4, or 5 tokens? • Multi-indexing is a key component of accuracy – Different people think differently, so the indexes need to have different ways of representing the data.
  • 20. 20 Breaker, Breaker • Don’t match across boundaries – Paragraph – Sentence – Phrase • Whitespace does have meaning! • Punctuation does have meaning!
  • 21. 21 Parts is Parts • Figuring out the part of speech (noun vs. verb vs. adjective) would seem to clearly help – We avoid matching on the incorrect version • Study after study shows that it does not! • Why not? – Closely related (in English) • Example: to go on a run – Prevalence of noun phrases in the group of “important” terms
  • 22. 22 How Common are Tokens Terms? • Term Frequency (not “Token Frequency”) – Example: The West, West London, The Wild West • Do your full text processing when you’re gathering statistics – And adjust it and re-run it when the data changes • Inverse Document Frequency – In how many docs does this term occur? – NOT: How many times does this term occur across all docs?
  • 23. 23 Let Me Re-Phrase That • Some general dictionaries are freely available – Example: locations (geonames.org) • Others can be derived – Example: Company names from stock markets, business registries, Wikipedia, etc. • More useful are terms from your industry – Can you think of lists that are available internally? – Example: Job titles in a recruiting company • Most useful are terms from your data – Statistical generation of common 2-shingles and 3-shingles – Query log analysis
  • 24. 24 Lorem ipsum… • Boilerplate text recognition • Pre-process: – Simple text processing this time – Split by paragraphs – Calculating hash signatures for paragraphs – Count occurrences • Find the cliff • Filter out early in the main pipeline – Early steps must match the entire pre-processing pipeline
  • 26. 26 Search Quality • Best possible results – Given the searchable data – For the primary users and their primary tasks • Simple query term matching - relevance • And beyond – Enriched content – Query enhancement • Results presentation – Clarity – Context
  • 27. 27 Short Head & Long Tail • Query Frequency – Short Head • A few frequent queries – Short Middle • Often to 50% by traffic – Long Tail • Rare to unique queries • Can be to 75% distinct
  • 28. 28 But What Do They Really Want? • Query log reports show what users think they’re looking for – Domain research for more about why • Behavior shows more about whether they’re finding it – Session ending • Frequent for zero matches – CTR - click-through rate • Results (with bounce rates) – Query refinement • Typing, facets • Navigation via search
  • 29. 29 You say “tomay-toe” • Users vocabulary is not content vocabulary – Consistent problems from small to web-scale search • Create synonyms • Scalable automated disambiguation – Data analysis • Using dictionaries and co-occurrence – Search log behavior analysis • Query refinement and reformulation, click tracks – Language ambiguity - even Netflix has a hard time past 85% – Human domain expertise, editorial oversight
  • 30. 30 Scope (aka, this is not Google) • User confusion – Is this a location box? – Is it Google? • Design for clarity – UI and graphic design – Watch out for default to subscope searches • Improve content coverage • Add Best Bets for internal and external locations • Link to other search engines • Federate search
  • 31. 31 Fix The Short Head Issues
  • 32. 32 Best Bang for the Buck • Concentrate on the short head – Top 10% by traffic • Simple relevance test – Perform query – Evaluate results • Are there any results? • Are they the most useful available? (Domain expertise) • Validate against user behavior – Store judgments – Easy fixes – Re-test (easy to miss this)
  • 33. 33 Choosing, Not Typing • Auto-suggest – Curated search suggestions • Best Bets • Did you mean?
  • 34. 34 Context and Navigation • Facets • Results grouping / diversity – options for ambiguous queries • Integrate with collaboration tools – Allow user comments, reviews
  • 35. 35 Relevance and Ranking • Best results patterns – Part or serial number queries • Tuned boosting – Feedback on clicks and other signals – Freshness • de-duplication!
  • 38. 38 Reference Architecture Content sources Connectors Indexes Semantics Text Mining Quality Metrics Aspire Content Processing Pipelines Aspire Aspire Aspire Aspire Aspire Aspire Aspire Aspire Big Data Framework Big Data Array Indexes QPL Search Engine Web Browser Staging Repository