5. EVALUATING SEARCH
IS HARD
What does a top
notch research team
at Microsoft (and an energetic
intern) do when faced
with the complex
problem of
evaluating search
satisfaction?
Machine learning.
Fox, S., Karnawat, K., Mydland, M., Dumais, S.T., and White, T. (2005). Evaluating implicit
measures to improve the search experience. ACM Transactions on Information Systems, 23(2):
147-168.
6. IF THE REAL WORLD IS
TOO MESSY, FAKE IT.
Examples:
• Human labeled query-result pairs
• Crowd-sourced classification tasks
• Intercept surveys to match user outcomes with log traces
BUT THE REAL STUFF IS
A LOT MORE FUN, AND
HAS BROADER UTILITY.
IT IS MESSY.
8. THE SKI JUMP ILLUSTRATES
THE “BOUNDS”
Users click more on the last
position (or row). Why? Why oh
why?
People are making a locally
rational decision between the
last set of results and the next
button.
The ski jump has also reported by SLI Systems,
Jakob Nielsen, & Lou Rosenfield
10. THE SKI JUMP ILLUSTRATES
THE “BOUNDS”
Users click more on the last
position (or row). Why? Why
oh why?
People are making a locally
rational decision between the
last set of results and the next
button.
The cost of Next is typically
very high.
The ski jump has also reported by SLI Systems,
Jakob Nielsen, & Lou Rosenfield
13. LARGER IMAGES AT EBAY:
160PX -> 225 PX
(UP FROM 80->160 IN 2011)
After
Before
14. LARGER IMAGES:
OBVIOUS WIN…
NON-OBVIOUS MECHANISM.
Search
View Item
Buy
Total searches, searchs
with clicks, and item
views decreased.
Refinement went up.
The items that were
clicked were much more
likely to convert.
Watch, Bid, etc.
15. USERS SPENT 200MS
MORE PER ITEM
Larger Images
Paired
Controls
Time to First Click by Rank, Larger Images vs ControlThe time cost of
visiting an item
means it’s
advantageous to
spend more time
evaluating results
on the search
page.
Result Rank of First Click
Time
16. MICRO-ECONOMICS
+ COGNITIVE MODEL
Page Loads
Find Results
Engage or
Refine
Evaluate
Utility
Engage
Locate Next
Result
Click?
Orienting References:
Azzopardi, L. (2014). Modeling Interaction with Economic Models of Search.
Proceedings of the 37th International ACM SIGIR conference on Conference
on Research and Development in Information Retrieval.
Fu, W. ; Pirolli, P. L. SNIF-ACT: a cognitive model of user navigation on the
World Wide Web. Human Computer Interaction. 2007; 22 (4): 355-412.
17. EXPANDING LEARNING
FROM A/B TESTS
This type of user interaction model supports learning from
experiments over time by:
• Resolving seemingly incompatible results, ex. Conversion
went up but searches w/ clicks went down
• Integrating results of multiple experiments for better
hypotheses and less organizational memory loss
MODELS CAN ALSO PINPOINT
WHERE COGNITION IS APPLIED
(AND WHERE IT’S NOT)
18. MACHINE LEARNING WORKS
BEST WHEN IT LEARNS FROM
STRONG COGNITION
The process of assessing a result set, finding it insufficient,
and generating the next refinement is one of the richest
intellectual actions online at scale.
Hence, we see very useful “related queries” across the web.
Google’s
related
searches at
the bottom of
the SRP:
22. INTRODUCING CLICK
SENSE
Mouse-down to Mouse-up latency is an indicator of cognitive
load.
Core range is 50-120ms, based upon n=200 crowd sourcing
study.
Political affinity questions (e.g. “I voted”) predict Up-Down latency
significantly. Political enthusiasts had longer Up-Down latencies on political
fact questions than non-enthusiasts, suggesting great cognitive load.
24. YOU GET OUT WHAT
YOU PUT IN.
• Typically you don’t quite understand the problem.
• That’s why you’re doing machine learning right?
• You can get better at that.
• You can at least be confident that you’re capturing:
• Things that matter
• Find the intelligence leaks
• Rich data sources – be it human cognition, or highly
predictive elements of your ecosystem
25. THE SCORE CARD IS
IN:
• Data quality trumps
• Data volume which trumps
• Machine learning methods
• Via Lukas Biewald, CEO @ Crowdflower
26. EVEN WHEN BUSINESS GOALS
AND USER GOALS ALIGN
Humans are boundedly rationale, capacity
limited, and a complex world surrounds
them.
Cognition is also alternatingly fast & parallel
and slow & serial. It’s the fast part that will
get you into trouble.
Example:
• Optimizing for sellability in e-commerce
can jeopardize relevance
• User’s expectations are informed by fast,
parallel “information scent” impressions
28. WAYS TO LEARN
• Analytic Inquiry
• Problem Identification & Severity Estimation
• Opportunity Projection
• Characterization
• A/B Testing
• A/B
• Small factorial
• Parameter fitting
• Machine Learning
• Data is Code
• Machined learned metrics
29. WAYS NOT TO
FORGET
• Big decks of stats & graphs
• With much care & love, these can be required reading for
an organization
• Reproducible research approaches (ex. R-Markdown)
• Continuous engagement with the user
• via UX research, inline feedback, dashboards & goals
• Build Theoretical Models
• Integrate findings across experiments
• Build Machine Learned Models
• Specify in code the target, the factors, and the subsequent
computation
The team was formed in 2009. I joined in Q4, and we delivered like a rocket ship for a few years. The Better Serach wins came from strong machine learning, strong evaluation, and a ever deepening understanding of the marketplace.
On one hand, human intelligence is leaky… it seeps out into world pervasively. It’s also highly varied given the goals and bounded rationality of humans.
If you assume your users are rationale agents, explaining behavior becomes more practical.