Many online systems, such as recommender systems or ad systems, are increasingly being used in societally critical domains such as education, healthcare, finance and governance. A natural question to ask is about their effectiveness, which is often measured using observational metrics. However, these metrics hide cause-and-effect processes between these systems, people's behavior and outcomes. I will present a causal framework that allows us to tackle questions about the effects of algorithmic systems and demonstrate its usage through evaluation of Amazon's recommender system and a major search engine. I will also discuss how such evaluations can lead to metrics for designing better systems.
3. Are these systems performing well?
How do these systems affect people’s
behavior?
How can we make these systems better?
Better for whom?
4. Evaluating systemsEstimating
causal effects
Three examples.
Two studies:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
6. But metrics can be misleading
Performance metric can be high or low for a number of
reasons
--day of week, time of year
--selection effects
Different metrics may provide a different picture.
20. Example 3: Did a system change
lead to better outcomes?
20
System A System B
21. Comparing old versus new
system
21
Old (A) New (B)
50/1000 (5%) 54/1000 (5.4%)
New system is better?
22. Looking at change in CTR by
income
22
Old System (A) New System (B)
10/400 (2.5%) 4/200 (2%)
Old System (A) New System (B)
40/600 (6.6%) 50/800 (6.2%)
Low-income
High-income
23. The Simpson’s paradox
Is Algorithm A better?
23
Old system (A) New system (B)
Conversion rate for
Low-income people
10/400 (2.5%) 4/200 (2%)
Conversion rate for
High-income people
40/600 (6.6%) 50/800 (6.2%)
Total Conversion
rate
50/1000 (5%) 54/1000 (5.4%)
24. Answer (as usual): May be, may be not.
High-income people might have better means to know about the updated
information.
Time of year may have an effect.
There could be other hidden causal variations.
24
26. Easy answer: Do A/B tests
Great for testing focused hypotheses.
But cannot help you find robust hypotheses.
Not possible in many scenarios for ethical or practical
reasons.
27. Harder answer: Approximate
A/B tests offline
Gives all the benefits of A/B tests
Can be run at scale.
But, without an experiment, we have no data on
what would happen if we change the system.
How do we estimate something that we have never
observed?
29. Evaluating systemsEstimating
causal effects
Two examples:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
30. Study 1: Estimating the impact
of a recommender system
Sharma-Hofman-Watts (2015,2016)
33. Confounding: Observed click-throughs
may be due to correlated demand
33
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country for
Old Men
34. Observed activity is almost surely
an overestimate of the causal
effect
34
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER
35. Finding auxiliary outcome: Split outcome into
recommender (primary) and direct visits (auxiliary)
35
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: Proxy for
unobserved demand
36. ? ?
1a. Search for any product
with a shock to page visits
36
47. Study 2: How does user
satisfaction for a search engine
vary across demographics?
Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
48. Tricky: straightforward
optimization can lead to
differential performance
Search engine uses a standard metric: time spent on clicked result page as an
indicator of satisfaction.
Goal: estimate difference in user satisfaction between these two demographic
groups.
Suppose older users issue more of “retirement planning” queries
Age: >50 years
80% users 10% users
Age: <30 years
…
49. 1. Overall metrics can hide
differential satisfaction
Average user satisfaction for “retirement planning” may be
high.
But,
Average satisfaction for younger users=0.7
Average satisfaction for older users=0.2
50. 2. Query-level metrics can hide
differential satisfaction
<query>
<query>
<query>
<query>
<query>
<query>
retirement planning
<query>
<query>
retirement planning
retirement planning
<query>
retirement planning
…
Same user satisfaction for
“retirement planning” for both older and
younger users = 0.7
What if average satisfaction for
<query>=0.9?
Older users still receiving more of lower-
quality results than younger users.
Younger users
Older users
51. 3. More critically, even individual-
level metrics can also hide
differential satisfaction
Reading time for the same webpage result for the
same user satisfaction
Time spent on a webpage
Younger Users
Older Users
52. How do we know whether some
users are more satisfied than
others?
53. Data: Demographic characteristics
of search engine users
Internal logs from Bing.com for two weeks
4 M users | 32 M impressions | 17 M sessions
Demographics: Age & Gender
Age:
◦ post-Millenial: <18
◦ Millenial: 18-34
◦ Generation X: 35-54
◦ Baby Boomer: 55 - 74
55. Pitfalls with Overall Metrics
Conflate two separate effects:
◦ natural demographic variation caused by the differing traits among the
different demographic groups e.g.
◦ Different queries issued
◦ Different information need for the same query
◦ Even for the same satisfaction, demographic A tends to click more than demographic B
◦ Systemic difference in user satisfaction due to the search engine
56. Constructing a causal model
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
57. I. Context Matching: selecting for
activity with near-identical context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
59. Age-wise differences in metrics disappear
General auditing tool: robust
Very low coverage across queries: Did we control for too much?
60. II. Query-level pairwise model:
Estimating satisfaction directly by
considering pairs of users
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
61. Estimating absolute satisfaction is non-trivial
• Instead, Estimate relative satisfaction by considering pairs of users
for the same query
• Conservative proxy for pairwise satisfaction by only considering “big”
differences in observed metric for the same query
• Logistic regression model for estimating probability of impression i
being more satisfied than impression j:
62. Again, see a small age-wise difference in
satisfaction
64. Conclusion II
Evaluation of ML systems requires careful analysis
of inputs and outputs.
Observational metrics are usually biased:
◦ Big fraction of recommendation click-throughs may be
convenience.
◦ Search engine metrics do not provide a provide a clear
picture of user satisfaction.
Causal models essential for developing robust
metrics.