Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Relevancy and Search Quality Analysis - Search Technologies
1. 1
The Manifold Path to Search Quality
Enterprise Search & Analytics Meetup
Mark David – Architect, Data Scientist
Avi Rappaport – Senior Search Quality Analyst
19 March 2015
3. 3
Search Technologies: Who We Are
The leading independent IT services firm specializing in the design,
implementation, and management of enterprise search and big data
search solutions.
4. 4
Solutions
Corporate Wide Search – “Google for the Enterprise.” A single, secure point of search for all
users and all content. Strategic initiative for corporate wide information distribution and search.
Data Warehouse Search – A Big Data search solution that enables interactive query and analytics
with extremely large data sets for business intelligence and fraud detection.
E-Commerce Search – Leverages machine learning and accuracy metrics to deliver a better
online user experience and maximize revenues from visitor search activity.
Search & Match – Increase recruiter productivity and fill rates in the staffing industry. Provides a
better search experience followed by automated candidate-to-job matching.
Search for Media & Publishing – Improve user search experience for publishers of large amounts
of content such as government organizations, research firms, and media publications.
Government Search – A solution focused on design and development search for government
information portals or archiving systems.
5. 5
Search Technologies: Background
San Diego
London UK
San Jose, CR
Cincinnati
Prague, CZ
Washington
(HQ)
Frankfurt DE
• Founded 2005
• 150+ employees
• 600+ customers worldwide
• Deep enterprise search expertise
• Consistent revenue growth
• Consistent profitability
7. 7
Search Technologies: What We Do
• All aspects of search application implementation
– Content access and processing, search system architecture, configuration, deployment
– Accuracy analysis, metrics, engine scoring, relevancy ranking, query enhancement
– User interface, analytics, visualization
• Technology assets to support implementation
– Aspire high performance content processing
– Content Connectors (Document, Jive, SharePoint, Salesforce, Box.com, etc.)
• Engagement models
– Most projects start with an “assessment”
– Fully project-managed solutions, designed, delivered, and supported
– Experts for hire, supporting in-house teams or as a subcontractor
8. 8
Search Engine and Big Data Expertise
Our Technology and Integration Partners
11. 11
Understand Your Data
• Data Analysis
– Access patterns & rates, sources, schemas, field typing,
duplicates, near-duplicates, term frequencies, etc.
• Content Processing
– Source connection, format conversion, sub-document
separation, field boundaries, multiple-source assembly, etc.
• Text Processing
– Character decoding, tag stripping, tokenization, sentence
boundaries, normalization, entity extraction, pattern
recognition, disambiguation, filtering, etc.
13. 13
Understand Your Search Engine
• How does it score results?
• How accurate is it for the short head?
• How accurate is it for the long tail?
• When you change it to improve a particular type of query,
how do you know that the overall accuracy improved?
14. 14
Regression Testing of Search
• Step 1: Gather a Set of Judgments
• If you already have lots of user data:
– Use click log analysis to gather sets of clearly good and clearly
bad results
– Ignore unclear tracks
• If user data not yet available:
– Manual judgments
• End up with a set of queries with associated “good” and
“bad” documents
15. 15
Regression Testing of Search
• Step 2: Instrument the Search Results
• Periodically execute all those queries, and score the results
• How to score:
– Every good document adds a position-based amount
– Every bad document subtracts the same amount
– Unknown documents don’t affect the score (except by
occupying a position)
17. 17
Relevancy Improvements from Data
• Text Processing
– Typos
– Entity Extraction
– Breaks
– Parts of Speech
• Data Analysis
– TF-IDF
– Phrase Dictionary
– Boilerplate
18. 18
To Correct or Not To Correct
• Should typos be “fixed”?
• This goes back to knowing your audience
• Example: Haircutz
• In document-to-document situations, generally yes.
19. 19
Bigger Needles in the Haystack
• Entity Extraction: How big a chunk?
• Example: mdavid@searchtechnologies.com
– Is that 1, 2, 3, 4, or 5 tokens?
• Multi-indexing is a key component of accuracy
– Different people think differently, so the indexes need to have
different ways of representing the data.
20. 20
Breaker, Breaker
• Don’t match across boundaries
– Paragraph
– Sentence
– Phrase
• Whitespace does have meaning!
• Punctuation does have meaning!
21. 21
Parts is Parts
• Figuring out the part of speech (noun vs. verb vs. adjective)
would seem to clearly help
– We avoid matching on the incorrect version
• Study after study shows that it does not!
• Why not?
– Closely related (in English)
• Example: to go on a run
– Prevalence of noun phrases in the group of “important” terms
22. 22
How Common are Tokens Terms?
• Term Frequency (not “Token Frequency”)
– Example: The West, West London, The Wild West
• Do your full text processing when you’re gathering statistics
– And adjust it and re-run it when the data changes
• Inverse Document Frequency
– In how many docs does this term occur?
– NOT: How many times does this term occur across all docs?
23. 23
Let Me Re-Phrase That
• Some general dictionaries are freely available
– Example: locations (geonames.org)
• Others can be derived
– Example: Company names from stock markets, business
registries, Wikipedia, etc.
• More useful are terms from your industry
– Can you think of lists that are available internally?
– Example: Job titles in a recruiting company
• Most useful are terms from your data
– Statistical generation of common 2-shingles and 3-shingles
– Query log analysis
24. 24
Lorem ipsum…
• Boilerplate text recognition
• Pre-process:
– Simple text processing this time
– Split by paragraphs
– Calculating hash signatures for paragraphs
– Count occurrences
• Find the cliff
• Filter out early in the main pipeline
– Early steps must match the entire pre-processing pipeline
26. 26
Search Quality
• Best possible results
– Given the searchable data
– For the primary users and their primary tasks
• Simple query term matching - relevance
• And beyond
– Enriched content
– Query enhancement
• Results presentation
– Clarity
– Context
27. 27
Short Head & Long Tail
• Query Frequency
– Short Head
• A few frequent queries
– Short Middle
• Often to 50% by traffic
– Long Tail
• Rare to unique queries
• Can be to 75% distinct
28. 28
But What Do They Really Want?
• Query log reports show what users think they’re looking for
– Domain research for more about why
• Behavior shows more about whether they’re finding it
– Session ending
• Frequent for zero matches
– CTR - click-through rate
• Results (with bounce rates)
– Query refinement
• Typing, facets
• Navigation via search
29. 29
You say “tomay-toe”
• Users vocabulary is not content vocabulary
– Consistent problems from small to web-scale search
• Create synonyms
• Scalable automated disambiguation
– Data analysis
• Using dictionaries and co-occurrence
– Search log behavior analysis
• Query refinement and reformulation, click tracks
– Language ambiguity - even Netflix has a hard time past 85%
– Human domain expertise, editorial oversight
30. 30
Scope (aka, this is not Google)
• User confusion
– Is this a location box?
– Is it Google?
• Design for clarity
– UI and graphic design
– Watch out for default to subscope searches
• Improve content coverage
• Add Best Bets for internal and external locations
• Link to other search engines
• Federate search
32. 32
Best Bang for the Buck
• Concentrate on the short head
– Top 10% by traffic
• Simple relevance test
– Perform query
– Evaluate results
• Are there any results?
• Are they the most useful available? (Domain expertise)
• Validate against user behavior
– Store judgments
– Easy fixes
– Re-test (easy to miss this)
34. 34
Context and Navigation
• Facets
• Results grouping / diversity
– options for ambiguous queries
• Integrate with collaboration tools
– Allow user comments, reviews
35. 35
Relevance and Ranking
• Best results patterns
– Part or serial number queries
• Tuned boosting
– Feedback on clicks and other signals
– Freshness
• de-duplication!