See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
"I know it when I see it".
This term was coined by a Supreme Court Justice in reference to obscenity, but he might as well been talking about relevancy and search engine results. Testing search engines is rarely a binary process of "it works, it doesn't work", instead it draws on our human skills to design tests that capture the intangibles that make up a great search engine implementation! The behavior of a search engine changes as the data changes, so a search that returns one set of results today will return a different set tomorrow. Is that a bug? Or just a finely tuned search engine responding to changes in the data it searches? Search Engine testing often focuses on the very first layer of functionality, "Do I get results?", without digging deeper into "Do I get great relevant results?".
Scaling API-first â The story of a global engineering organization
Â
Better Search Engine Testing - Eric Pugh
1. BETTER SEARCH
ENGINE TESTING
STPCON 2011 | EPUGH@O19S.COM | @DEP4B
1
WHY AM I QUALIFIED TO BE
UP HERE?
⢠President of OpenSource
Connections
⢠Contributor to CruiseControl
and Continuum CI projects
⢠Member of Apache Software
Foundation
⢠Presenter at conferences
(OSCON, ApacheCON, jTDS,
ExpoQA, STPcon 2009!)
2
4. AGENDA
Why is Search Becoming More
Important?
What is a Search Engine?
Techniques for Testing
Wrap Up
7
WHY IS SEARCH
BECOMING MORE
IMPORTANT?
8
5. INFORMATION IS EXPLODING
âinformation workers ... are each bombarded with1.6
gigabytes of information on average every day through
emails, reports, blogs, text messages,calls and moreâ.
⢠http://online.wsj.com/article/SB124252211780027326.html
9
UNSTRUCTURED
⢠emails, spreadsheets, documents, presentations, images,
databases
⢠75% unstructured to 25% structured
10
6. MANAGING DATA IS
EXPENSIVE
â˘1 GB costs $.20 to store
â˘1 GB costs $3500 to Manage
11
WHAT DOES 3500 BUY YOU?
⢠69% of respondents felt 50% or less of data could be found
online
⢠Knowledge workers spend 25% of their time engaged in
search-related activities.
12
7. WHY NOT JUST USE GOOGLE
⢠We donât want 44 million results, we want 1
⢠we want âtheâ answer, not âanâ answer
⢠we tolerate inefďŹcieny in the Internet search
As
John Allenhappy toputs it: âThe Internet is
⢠We are Paulos âsatisďŹceâ
the world's largest library.It's just that all
the books are on
the ďŹoor.â
13
WHAT IS A SEARCH
ENGINE?
14
11. CONTENT INDEXING
â˘- creating an index by crawling the content directories,
databases, other repositories using an automated process
(either pushing or pulling changes)
⢠create an Index, which is a searchable key to a collection.
⢠In
Enterprise Search, the indexing mechanism should be able
to access company private data (with access privileges
maintained)
⢠control
indexing schedule - being able to index rapidly
changing content quickly, other content more slowly.
⢠rather than having the bot look for the data.
21
CONTENT INDEXING
⢠Indexing may also support
⢠Metadata extraction
⢠Auto-summarization, which is analyse of the collection and
group its content into categories or clusters.
⢠Metadatain turn becomes facets that can be used to tune the
query to put emphasis on that category.
22
13. FORMATTING
25
FACETING
Faceted or "guided navigation"
leverages metadata fields and
values to provide users with visible
options for narrowing or refining
their query.
- Peter Morville, Search Patterns
26
26
14. Search Stack
User Interface
Search Engine
Data
27
HOW DO WE TEST?
28
15. HOW DO WE TEST?
⢠Querying
⢠Formatting
⢠Content Indexing
⢠Performance
29
WHO SHOULD TEST?
30
16. CHALLENGES
⢠Competing business stakeholders:
⢠Tester: When I search for âlamp shadesâ, I used to see these
documents, now I see a differing set.
⢠Business Owner: How do I know that the new search
engine is better?
⢠User: My pet feature âsearch within these resultsâ works
differently.
⢠Marketing Guy: I want to control the results so the current
marketing push for toilet paper brand X always shows up at
the top.
31
CHALLENGES
⢠Stakeholders want a better search implementation, but
perversely often want it to all work âthe exact same wayâ. !
Getting agreement across all the stakeholders for the
project vision, and agree on the metrics is a challenge.
32
17. PERFECT SEARCH TESTER
WOULD BE ALL OF
⢠Mathematician ⢠Business Analyst
⢠Librarian ⢠Systems Engineer
⢠UX Expert ⢠Geographer!
⢠Writer ⢠Psychologist
⢠Programmer
33
KNOWLEDGE TRANSFER
⢠If
you donât have the perfect team already, bring in experts and
do domain knowledge transfer.
⢠Learn the vocabulary of search to better communicate
together
⢠âauto completeâ vs âauto suggestâ
⢠Do âSearch for Content Teamâ brownbag sessions!
34
18. QUERY TESTING
⢠Often called ârelevancy testingâ
35
TWO SCHOOLS OF
THOUGHT
⢠âOne True Answerâ
⢠âI know it when I see itâ
36
19. âONE TRUE ANSWERâ
⢠Absolute Truth / Matrix / Grid / TREC / Relevancy Assertions
⢠The correct answers for each search are known ahead of
time
⢠Humans judges often decide these correct answers, stored
as Relevancy Assertions
⢠Can be labor intensive to setup
⢠A âNumerical Gradeâ is produced for comparision
37
PROBLEMS WITH THIS
APPROACH
⢠Open to gaming. TREC competition is swamped by
âacademicâ search engine efforts that donât work in the real
world.
⢠Needa well understood data set with generally accepted
answers.
is it better to have an engine that gives modestly relevant
results almost all the time, or an engine that gives really
good answers sometimes, better on average than the other
engine, but sometimes gives back complete garbage?
38
20. A/B TESTING
Engine version 1 and
version 2!
⢠Tracks explicit or implicit preferences between engines A/B
⢠Often dispenses with the notion of the "correct" answer
⢠Canbe easier to setup, but some fear the best answers will be
missed by both engines
39
RELEVANCY
⢠Do we have any deďŹned relevancy metrics?
⢠Relevancy is like porn.....
40
21. I KNOW IT WHEN I SEE IT!
http://en.wikipedia.org/wiki/Les_Amants
41
BEYOND PRECISION AND
RECALL: HOW ENGINES ARE
⢠Binary vs. Non-Binary Grading Systems
⢠Early TREC
had binary judgements, only Yes/No on whether
each doc was related to a test search
⢠More choices were later added
â˘A system can use letter grades (A, B, C, D and F) or numeric
grades
⢠Another style asks testers to sort documents in their
preferred order
42
22. CLASSIC MEASUREMENTS OF
SEARCH RELEVANCY
⢠Recall: "Did
I ďŹnd all the documents I expected to get back?!
What percent?"
⢠Precision: "Did
the system bring back other documents that
weren't relevant?! What percent were on target?"
43
NEWER IDEAS
⢠Rank: The order of documents that were returned
⢠Generally
a 1 in 20 match in the #1 spot is better than a
50% rate where all matches are on the second page.
44
23. INTERACTIVITY: WHAT
NAVIGATORS OR
VISUALIZATION WERE GIVEN
⢠Facets and sorting: Clickable ďŹlters and sort options
⢠Unsupervised Clustering: Related terms or phrases, or related
searches
⢠Spelling and thesaurus suggestions
45
SUBJECT DISAMBIGUATION,
SENTIMENT, CONFLICTING
INFORMATION, CROWD
HINTS
⢠kidney bean or kidney cell
⢠"best football team in the UK"
46
24. 47
SOURCES OF VARIANCE, AKA
"PROBLEMS"
Note, this is talking
about comparing search
engine a to search engine
b. But I am thinking
more in the context of
search engive v1 to v2!
48
25. DIFFERENT GOALS
⢠Perfect/Human vs. Best vs. Acceptable vs. Better than X
⢠Constrained vs. Unconstrained Resources (time, cpu, storage)
49
SAMPLE SIZE
⢠Amount of Data
⢠Fixed set or growing over time
⢠Number of Testers (AB or Relevancy Judgments)
⢠Number of Searches
50
26. VERTICAL VS. HORIZONTAL
CONTENT
⢠Oneextreme: SpeciďŹc demo may cover just one discipline, for
example Medical Journals
⢠Other extreme: Internet covers vastly disparate domains
51
USERS
⢠Experienced vs. New Searcher
⢠Subject Expert vs. Novice
⢠Spelling, typing and computer proďŹciency
⢠InterfaceMedium (large visual display, small text display,
audible, Braille, etc)
⢠Amount of Effort to understand Search
⢠Willingness to Iterate
⢠Searching for speciďŹc answer vs. General Exploration
52
27. TYPE OF SEARCHES
⢠Length / 1 or 2 words
⢠Full question
⢠Sample text
⢠Internet Boolean
⢠Advanced Boolean / Syntax / Proximity
⢠Wildcard, Regex, etc.
53
PUNCTUATION
⢠Chemical
⢠Source Code
⢠Units of Measure
⢠Literal vs. Search Operator
54
28. NOT EVEN GETTING INTO
MULTI LINGUAL SEARCH
⢠How do I test in languages I donât understand?
55
GROK YOUR RESULTS
56
30. Persona 1: Going to be a mom
Oh my God Needy
Iâm actually
Pregnant! Narrative
Self Introduction
Hi all, I'm very new to this but i couldn't help but share my
excitement. I have just found out today that i am pregnant. It
wasn't planned, me and my partner of a year and a half were
going to wait until we had our own place and were married
ďŹrst but it looks like we have done it the other way round.
Whatâs next? What am I
supposed to do?
Guidance please!
My only concern is that i don't really know how my boyfriend
To interact with feels about it. I know we need to discuss the options but i
people going have really already made up my mind about what i want to do.
through the same There is so much to consider, money, a decent place to live,
thing. being ready but i know i am ready and have been for a long
time ( I get extremely broody when i see my friends kids)
Scenarios that typify -
planned to get pregnant, but Should i just tell him how i feel or go with how he feels
hasn't done any research because i don't want to lose him. He is a loving partner who
Catch phrases - Nervous but would stand by me through anything i just donât want him to
excited, giddy, Where do I feel like i am tying him down!!! I suppose i am feeling very
start? happy but also very confused at the same time!!!
Tag lines - Wants to share,
has a million questions http://forum.sofeminine.co.uk/forum/maternite1/
Likely to say - Guide me, __f468_maternite1-Oh-my-god-i-m-pregnant.html
help me get off to a good
start
59
Persona 2: New Mom
Are my kids sick or is Demanding
this condition normal?
How do I� Narrative
I have been hearing about women who claim that thier 2, 2 1/2 or 3 year
old is not ready for the potty. They claim its a nightmare and are waiting
for their children to come around.
Maybe I grew up in the twilight zone, but I had always assumed that
potty training was something that is just done. Its done when:
a) The child in question can sleep through the night and stay dry.
b) The child in question can speak to you, in full sentences. like,
"apple juice, please" or "wanna go to the park" or "momma I wanna
How do I ensure my
hold you..."
baby is latching on
c) The child in question knows they are soiled and can ask to be
correctly?
changed.
Barring any of those things, a child is ready to be placed on the potty.
What type of stroller using the potty was never negotiable in my family. When we hit the
should I buy? What above milestones my mother trained us. We just did it. If we complained
brand of car seat is she never put diapers on us, she just kept directing us back to the potty.
best?
Her methods of redirection may be controversial (she told my brother
that unless he was a big boy he would not get a happy meal. Boys who
pooed on themselves got sad meals... lol!!! He straightened up and
Scenarios that Typify started using the potty at 2 1/2) but she was never abusive or anything
she just DIDNT ASK US. it was time to potty and that was it.
Likely to say -Are my kids sick or is
this condition normal? The reasoning was that I used to drink from a bottle, and sleep with my
Describes herself - wants to be a
mother, and such, now I don't. I also used to crap my pants, and that is
good mother, looking for expert
advice, wants to get ideas from other no longer allowed after a certain point.
moms
Narrative- could be working mom,
could be stay at home mom My question is this: why ask children if they are ready to use the potty,
Questions likely to ask - sometimes after they are clearly ready to use it (with language tools and bladder
This picture captures my life
wants to ask questions/get expert control)? Why is it treated like something that is negotiable or that the
perfectly: an adult beverage
advice
sitting on a book about child has a choice of either coming around to it or not? I understand that
underpants. children are sensitive and you have to follow their lead, at times. But
allowing them to shit
60
31. Scenario 1
Find old answer âI know went through this before with my ďŹrst child,
but cannot recall the answerâ
Preamble
Experienced mom has a dĂŠjĂ vu moment about a
previous problematic experience with her first child. She
has a partial recollection of a piece of information
Success Factors
related to the answer she seeks but she needs help in
⢠Speed of Comprehension
pulling
⢠Directness to destination
⢠Reduced:
⢠Number of queries
⢠Number of results
⢠Indirect Knowledge Transfer
Thinking aloud in the Family Room
Very nice â lists out related
Josh had not started to cry concerns for constipation.
Hhhm I now I non-stop for 3 hours when Letâs see: âsymptomsâ,
had the same wwwaaaaaaaa it ďŹnally dawned on me âcuresâ, âwhen to call the
issue with josh, wwwwaaaaaa . . . that he had not had a doctorâ, âwhat other moms
but what the ggg movement for 3 days . . . are sayingâ, âtopic over
heck did I do? wwwaaaaaaaa . . . Letâs try querying that . . . viewâ Ok â Iâll take âcuresâ Alex
âno poopâ . . . Not likely . . .
for a 300 points and my
Uumm . .. âconstipationâ?
personal sanity! Water . . .
Oh, might help to specify
fruit juice . . . high-ďŹber
who as well . . . âbabyâ . . .
baby foods - Ahhh prune
juice . . . prune juice! Now
why didnât I remember that!
After hours of frustration mother home alone has a Mother starts to type in query but suggest-as-you- Structured results quickly tip off the mother to the
partial epiphany as to her childâs problem. type search box hints to her to be more specific. assorted aspects of constipation. She focuses in on
one of the aspects and has total recollection of her
previous experience.
61
Scenario 2
Urgent Question Itâs 2am and I donât know who to ask?â
Preamble
Mother of twins finds herself with panicked in the early
morning hours with a new situation.
Success Factors
⢠Speed of Comprehension
⢠Directness to answer
Crying in the Kitchen
I donât have to read â102â . . . thank
wwwaaaaaaaa hundreds of pages on the god ! Weâre safe
wwwwaaaaaa . . . internet . . . I just need a
gggwwwaaaaaaaa quick concise answer . . .
wwwaaaaaaaaa . . . . . at what temperature do I
Crap! Who I am I wwwwaaaaa . . . need to be worried . . ? ! Ahhh . . . thatâs helpful -
. ggg
supposed to at this other conditions to know
hour ! Why is it no wwwaaaaaaaaa . . . Please [BabyCenter] show me about . . .
body is open when the answer . . ! Thatâs thorough : âWhat will the doctor
I need them ? ! wwwaaaaaaaa do? â
wwwwaaaaaa . . . Interesting âIf fever is a defense against
ggg infection, is it really a good idea to try to
wwwaaaaaaaa
wwwaaaaaaaaa . . bring it down?â
wwwwaaaaaa . . .
.
ggg
wwwaaaaaaaaa . . Let me book mark
. this for later.
In the middle of the night, a mother of twins finds Mother starts to type in a query but notices the The mother zooms in on the specific answer she
herself alone, overwhelmed, and in dire need of an suggest-as-you-type search box lets her narrow her seeks. But then she notices collateral knowledge
answer. question boosting her confidence she is going to get she takes note of for later reading.
the answer she needs.
62
32. CONTENT INDEXING
TESTING
⢠Leverages our normal testing skills. And typically what it really
means is âPerformance Testingâ.
⢠Lotâs of âintegrationâ testing.
63
PERFORMANCE TESTING
64
33. LEVELS OF SCALING
⢠Scale High
⢠There is a quickly hit point of diminishing returns!
⢠Scale Wide
⢠The safety valve for lots of load.
⢠Scale Deep
⢠ScalingDeep? You are doing some crazy stuff with huge
indexes!!
65
65
SCALE WIDE (SLAVES)
⢠Too many inbound queries!
⢠slaves
poll master for
changes
⢠index and conďŹg ďŹles
transferred
⢠ALL JAVA!
66
66
34. SCALE WIDE (SHARDING)
⢠Too large of an index to query
⢠Split
index over multiple Search
servers
â˘A -> M: Server 1, N -> Z: Server 2
⢠uniqueId.hash % numServers
⢠Relevancy typically balanced shards
⢠Requestsplit across shards, results
aggregated to single response
67
67
SCALE DEEP
⢠Combine both scaling wide
to handle number of queries
with sharding to handle size
of indexes!
68
68
35. WRAP UP
69
User Search
Methodology Interface Engine
Data
Concurrent Streams of Work
Iteration 2 Story:
Operationalize Solr Deploy Solr into BabyCenter Test Environment
Iteration 2 Story:
Search Analysis Integrate Solr into Community UI, A/B Testing
Iteration 2 Story:
Search Experience Conceptual Model (Personas, etc) & Mockups
OSC APPROACH TO SEARCH
70