Measuring effectiveness of machine learning systems

Measuring effectiveness of
machine learning systems
AMIT SHARMA, MICROSOFT RESEARCH INDIA
www.amitsharma.in
@amt_shrma

Are these systems performing well?
How do these systems affect people’s
behavior?
How can we make these systems better?
Better for whom?

Evaluating systemsEstimating
causal effects
Three examples.
Two studies:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.

Easiest answer: Look at metrics

But metrics can be misleading
Performance metric can be high or low for a number of
reasons
--day of week, time of year
--selection effects
Different metrics may provide a different picture.

Example 1: Increasing activity
on Xbox
7

From data to prediction
8
Use these correlations to make a predictive model.
Future Activity ->
f(number of friends, logins in past month)


From data to “actionable insights”
9
Would increasing the number of friends increase
people’s activity on our system?
Maybe, may be not (!)

Different explanations are possible
10

Are search ads really that effective?
12

But search results point to the same
website
13

Without reasoning about causality, may
overestimate effectiveness of ads
14

Okay, search ads have an explicit intent.
Display ads should be fine?
15

Estimating the impact of ads
16

People anyways buy more toys in
December
17

Example 3: Did a system change
lead to better outcomes?
20
System A System B

Comparing old versus new
system
21
Old (A) New (B)
50/1000 (5%) 54/1000 (5.4%)
New system is better?

Looking at change in CTR by
income
22
Old System (A) New System (B)
10/400 (2.5%) 4/200 (2%)
Old System (A) New System (B)
40/600 (6.6%) 50/800 (6.2%)
Low-income
High-income

The Simpson’s paradox
Is Algorithm A better?
23
Old system (A) New system (B)
Conversion rate for
Low-income people
10/400 (2.5%) 4/200 (2%)
Conversion rate for
High-income people
40/600 (6.6%) 50/800 (6.2%)
Total Conversion
rate
50/1000 (5%) 54/1000 (5.4%)

Answer (as usual): May be, may be not.
High-income people might have better means to know about the updated
information.
Time of year may have an effect.
There could be other hidden causal variations.
24

Easy answer: Do A/B tests
Great for testing focused hypotheses.
But cannot help you find robust hypotheses.
Not possible in many scenarios for ethical or practical
reasons.

Harder answer: Approximate
A/B tests offline
Gives all the benefits of A/B tests
Can be run at scale.
But, without an experiment, we have no data on
what would happen if we change the system.
How do we estimate something that we have never
observed?

Causal inference to the rescue…
28

Evaluating systemsEstimating
causal effects
Two examples:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.

Study 1: Estimating the impact
of a recommender system
Sharma-Hofman-Watts (2015,2016)

Example: Estimating the causal impact of
Amazon’s recommender system
31

How much activity comes from
the recommendation system?
32

Confounding: Observed click-throughs
may be due to correlated demand
33
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country for
Old Men

Observed activity is almost surely
an overestimate of the causal
effect
34
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER

Finding auxiliary outcome: Split outcome into
recommender (primary) and direct visits (auxiliary)
35
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: Proxy for
unobserved demand

? ?
1a. Search for any product
with a shock to page visits
36

1b. Filtering out invalid natural
experiments
37

The “split-door” criterion
38
Criterion: 𝑿 ∐ 𝒀 𝑫
Demand for
focal product
(UX)
Visits to focal
product (X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(UY)

More formally, why does it work?
39
Unobserved
variables (UX)
Cause
(X)
Outcome (YR)
Auxiliary
Outcome
(YD)
Unobserved
variables (UY)

Data from Amazon.com, using
Bing toolbar
•
•
•
Out of which 20 K products have at least 10 visits on any one day

Constructed sequence of visits
for each user
41

Implementing the split-door criterion
42
𝑡 = 15 days

Using the split-door criterion, obtain 23,000 natural
experiments for over 12,000 products.
43

Observational click-through rate
overestimates causal effect
45

Generalization? Distribution of products
with a natural experiment identical to
overall distribution
46

Study 2: How does user
satisfaction for a search engine
vary across demographics?
Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)

Tricky: straightforward
optimization can lead to
differential performance
Search engine uses a standard metric: time spent on clicked result page as an
indicator of satisfaction.
Goal: estimate difference in user satisfaction between these two demographic
groups.
Suppose older users issue more of “retirement planning” queries
Age: >50 years
80% users 10% users
Age: <30 years
…

1. Overall metrics can hide
differential satisfaction
Average user satisfaction for “retirement planning” may be
high.
But,
Average satisfaction for younger users=0.7
Average satisfaction for older users=0.2

2. Query-level metrics can hide
<query>
<query>
<query>
<query>
<query>
<query>
retirement planning
<query>
<query>
retirement planning
retirement planning
<query>
retirement planning
…
Same user satisfaction for
“retirement planning” for both older and
younger users = 0.7
What if average satisfaction for
<query>=0.9?
Older users still receiving more of lower-
quality results than younger users.
Younger users
Older users

3. More critically, even individual-
level metrics can also hide
Reading time for the same webpage result for the
same user satisfaction
Time spent on a webpage
Younger Users
Older Users

How do we know whether some
users are more satisfied than
others?

Data: Demographic characteristics
of search engine users
Internal logs from Bing.com for two weeks
4 M users | 32 M impressions | 17 M sessions
Demographics: Age & Gender
Age:
◦ post-Millenial: <18
◦ Millenial: 18-34
◦ Generation X: 35-54
◦ Baby Boomer: 55 - 74

Overall metrics across Demographics
Four metrics:
Graded Utility (GU) Reformulation Rate (RR)
Successful Click Count (SCC) Page Click Count (PCC)

Pitfalls with Overall Metrics
Conflate two separate effects:
◦ natural demographic variation caused by the differing traits among the
different demographic groups e.g.
◦ Different queries issued
◦ Different information need for the same query
◦ Even for the same satisfaction, demographic A tends to click more than demographic B
◦ Systemic difference in user satisfaction due to the search engine

Constructing a causal model
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results

I. Context Matching: selecting for
activity with near-identical context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context

Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
For any two users from different demographics,
1. Same Query
2. Same Information Need:
1. Control for user intent: same final SAT click
2. Only consider navigational queries
3. Identical top-8 Search Results
1.2 M impressions, 19K unique queries, 617K users

Age-wise differences in metrics disappear
General auditing tool: robust
Very low coverage across queries: Did we control for too much?

II. Query-level pairwise model:
Estimating satisfaction directly by
considering pairs of users
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results

Estimating absolute satisfaction is non-trivial
• Instead, Estimate relative satisfaction by considering pairs of users
for the same query
• Conservative proxy for pairwise satisfaction by only considering “big”
differences in observed metric for the same query
• Logistic regression model for estimating probability of impression i
being more satisfied than impression j:

Again, see a small age-wise difference in
satisfaction

Conclusion I: ML systems need
Grey Box analysis

Conclusion II
Evaluation of ML systems requires careful analysis
of inputs and outputs.
Observational metrics are usually biased:
◦ Big fraction of recommendation click-throughs may be
convenience.
◦ Search engine metrics do not provide a provide a clear
picture of user satisfaction.
Causal models essential for developing robust
metrics.

Thank you
Amit Sharma
@amt_shrma
http://www.amitsharma.in

Measuring effectiveness of machine learning systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Measuring effectiveness of machine learning systems

Similar to Measuring effectiveness of machine learning systems (20)

More from Amit Sharma

More from Amit Sharma (12)

Recently uploaded

Recently uploaded (20)

Measuring effectiveness of machine learning systems