SlideShare a Scribd company logo
1 of 65
Measuring effectiveness of
machine learning systems
AMIT SHARMA, MICROSOFT RESEARCH INDIA
www.amitsharma.in
@amt_shrma
Are these systems performing well?
How do these systems affect people’s
behavior?
How can we make these systems better?
Better for whom?
Evaluating systemsEstimating
causal effects
Three examples.
Two studies:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
Easiest answer: Look at metrics
But metrics can be misleading
Performance metric can be high or low for a number of
reasons
--day of week, time of year
--selection effects
Different metrics may provide a different picture.
Example 1: Increasing activity
on Xbox
7
From data to prediction
8
Use these correlations to make a predictive model.
Future Activity ->
f(number of friends, logins in past month)

From data to “actionable insights”
9
Would increasing the number of friends increase
people’s activity on our system?
Maybe, may be not (!)
Different explanations are possible
10
Example 2: Search Ads
11
Are search ads really that effective?
12
But search results point to the same
website
13
Without reasoning about causality, may
overestimate effectiveness of ads
14
Okay, search ads have an explicit intent.
Display ads should be fine?
15
Estimating the impact of ads
16
People anyways buy more toys in
December
17
18
19
Example 3: Did a system change
lead to better outcomes?
20
System A System B
Comparing old versus new
system
21
Old (A) New (B)
50/1000 (5%) 54/1000 (5.4%)
New system is better?
Looking at change in CTR by
income
22
Old System (A) New System (B)
10/400 (2.5%) 4/200 (2%)
Old System (A) New System (B)
40/600 (6.6%) 50/800 (6.2%)
Low-income
High-income
The Simpson’s paradox
Is Algorithm A better?
23
Old system (A) New system (B)
Conversion rate for
Low-income people
10/400 (2.5%) 4/200 (2%)
Conversion rate for
High-income people
40/600 (6.6%) 50/800 (6.2%)
Total Conversion
rate
50/1000 (5%) 54/1000 (5.4%)
Answer (as usual): May be, may be not.
High-income people might have better means to know about the updated
information.
Time of year may have an effect.
There could be other hidden causal variations.
24
25
Easy answer: Do A/B tests
Great for testing focused hypotheses.
But cannot help you find robust hypotheses.
Not possible in many scenarios for ethical or practical
reasons.
Harder answer: Approximate
A/B tests offline
Gives all the benefits of A/B tests
Can be run at scale.
But, without an experiment, we have no data on
what would happen if we change the system.
How do we estimate something that we have never
observed?
Causal inference to the rescue…
28
Evaluating systemsEstimating
causal effects
Two examples:
Question: How good is a recommender system?
Estimate what would have happened without the
recommender system.
Question: Is a search engine performing well for all users?
Estimate how performance changes if a person had a
different demographic, without changing anything else.
Study 1: Estimating the impact
of a recommender system
Sharma-Hofman-Watts (2015,2016)
Example: Estimating the causal impact of
Amazon’s recommender system
31
How much activity comes from
the recommendation system?
32
Confounding: Observed click-throughs
may be due to correlated demand
33
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country for
Old Men
Observed activity is almost surely
an overestimate of the causal
effect
34
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER
Finding auxiliary outcome: Split outcome into
recommender (primary) and direct visits (auxiliary)
35
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: Proxy for
unobserved demand
? ?
1a. Search for any product
with a shock to page visits
36
1b. Filtering out invalid natural
experiments
37
The “split-door” criterion
38
Criterion: 𝑿 ∐ 𝒀 𝑫
Demand for
focal product
(UX)
Visits to focal
product (X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(UY)
More formally, why does it work?
39
Unobserved
variables (UX)
Cause
(X)
Outcome (YR)
Auxiliary
Outcome
(YD)
Unobserved
variables (UY)
Data from Amazon.com, using
Bing toolbar
•
•
•
Out of which 20 K products have at least 10 visits on any one day
Constructed sequence of visits
for each user
41
Implementing the split-door criterion
42
𝑡 = 15 days
Using the split-door criterion, obtain 23,000 natural
experiments for over 12,000 products.
43
44
Observational click-through rate
overestimates causal effect
45
Generalization? Distribution of products
with a natural experiment identical to
overall distribution
46
Study 2: How does user
satisfaction for a search engine
vary across demographics?
Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
Tricky: straightforward
optimization can lead to
differential performance
Search engine uses a standard metric: time spent on clicked result page as an
indicator of satisfaction.
Goal: estimate difference in user satisfaction between these two demographic
groups.
Suppose older users issue more of “retirement planning” queries
Age: >50 years
80% users 10% users
Age: <30 years
…
1. Overall metrics can hide
differential satisfaction
Average user satisfaction for “retirement planning” may be
high.
But,
Average satisfaction for younger users=0.7
Average satisfaction for older users=0.2
2. Query-level metrics can hide
differential satisfaction
<query>
<query>
<query>
<query>
<query>
<query>
retirement planning
<query>
<query>
retirement planning
retirement planning
<query>
retirement planning
…
Same user satisfaction for
“retirement planning” for both older and
younger users = 0.7
What if average satisfaction for
<query>=0.9?
Older users still receiving more of lower-
quality results than younger users.
Younger users
Older users
3. More critically, even individual-
level metrics can also hide
differential satisfaction
Reading time for the same webpage result for the
same user satisfaction
Time spent on a webpage
Younger Users
Older Users
How do we know whether some
users are more satisfied than
others?
Data: Demographic characteristics
of search engine users
Internal logs from Bing.com for two weeks
4 M users | 32 M impressions | 17 M sessions
Demographics: Age & Gender
Age:
◦ post-Millenial: <18
◦ Millenial: 18-34
◦ Generation X: 35-54
◦ Baby Boomer: 55 - 74
Overall metrics across Demographics
Four metrics:
Graded Utility (GU) Reformulation Rate (RR)
Successful Click Count (SCC) Page Click Count (PCC)
Pitfalls with Overall Metrics
Conflate two separate effects:
◦ natural demographic variation caused by the differing traits among the
different demographic groups e.g.
◦ Different queries issued
◦ Different information need for the same query
◦ Even for the same satisfaction, demographic A tends to click more than demographic B
◦ Systemic difference in user satisfaction due to the search engine
Constructing a causal model
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
I. Context Matching: selecting for
activity with near-identical context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
For any two users from different demographics,
1. Same Query
2. Same Information Need:
1. Control for user intent: same final SAT click
2. Only consider navigational queries
3. Identical top-8 Search Results
1.2 M impressions, 19K unique queries, 617K users
Age-wise differences in metrics disappear
General auditing tool: robust
Very low coverage across queries: Did we control for too much?
II. Query-level pairwise model:
Estimating satisfaction directly by
considering pairs of users
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Estimating absolute satisfaction is non-trivial
• Instead, Estimate relative satisfaction by considering pairs of users
for the same query
• Conservative proxy for pairwise satisfaction by only considering “big”
differences in observed metric for the same query
• Logistic regression model for estimating probability of impression i
being more satisfied than impression j:
Again, see a small age-wise difference in
satisfaction
Conclusion I: ML systems need
Grey Box analysis
Conclusion II
Evaluation of ML systems requires careful analysis
of inputs and outputs.
Observational metrics are usually biased:
◦ Big fraction of recommendation click-throughs may be
convenience.
◦ Search engine metrics do not provide a provide a clear
picture of user satisfaction.
Causal models essential for developing robust
metrics.
Thank you
Amit Sharma
@amt_shrma
http://www.amitsharma.in

More Related Content

What's hot

The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017Big Data Spain
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET Journal
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationScientificRevenue
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_satoMasahiro Sato
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroRobert Munro
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Essential econometrics for data scientists
Essential econometrics for data scientistsEssential econometrics for data scientists
Essential econometrics for data scientistsBenjamin Skrainka
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016Fabian Abel
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmFiltering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmVenkat Projects
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...David Zibriczky
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science processBenjamin Skrainka
 
Interleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationInterleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationXin QIAN
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesQuestionPro
 
How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?Ganes Kesari
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalriedlc
 

What's hot (20)

The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product Marketing
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous Optimization
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_sato
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert Munro
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Essential econometrics for data scientists
Essential econometrics for data scientistsEssential econometrics for data scientists
Essential econometrics for data scientists
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithmFiltering Instagram hashtags through crowdtagging and the HITS algorithm
Filtering Instagram hashtags through crowdtagging and the HITS algorithm
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
Interleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentationInterleaving - SIGIR 2016 presentation
Interleaving - SIGIR 2016 presentation
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
 
How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?How to Enter the Data Analytics Industry?
How to Enter the Data Analytics Industry?
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-finalICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
ICIS Rating Scales for Collective IntelligenceIcis idea rating-v1.0-final
 

Similar to Measuring effectiveness of machine learning systems

Auditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAuditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAmit Sharma
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesJPINFOTECH JAYAPRAKASH
 
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsMegaVjohnson
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemRSIS International
 
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...IEEEGLOBALSOFTTECHNOLOGIES
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesIEEEFINALYEARPROJECTS
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationnirvdrum
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
 
MAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comMAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comdonaldzs24
 
MAT 510 Exceptional Education - snaptutorial.com
MAT 510   Exceptional Education - snaptutorial.comMAT 510   Exceptional Education - snaptutorial.com
MAT 510 Exceptional Education - snaptutorial.comDavisMurphyB11
 
Behavioural Modelling Outcomes prediction using Casual Factors
Behavioural Modelling Outcomes prediction using Casual  FactorsBehavioural Modelling Outcomes prediction using Casual  Factors
Behavioural Modelling Outcomes prediction using Casual FactorsIJMER
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015Marianne Sweeny
 
Mat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comMat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comBaileya19
 
MAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.comMAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.combellflower184
 
Mat 510 Believe Possibilities / snaptutorial.com
Mat 510  Believe Possibilities / snaptutorial.comMat 510  Believe Possibilities / snaptutorial.com
Mat 510 Believe Possibilities / snaptutorial.comDavis29a
 
MAT 510 Inspiring Innovation/tutorialrank.com
 MAT 510 Inspiring Innovation/tutorialrank.com MAT 510 Inspiring Innovation/tutorialrank.com
MAT 510 Inspiring Innovation/tutorialrank.comjonhson139
 
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...Pieter Van Gorp
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
MAT 510 Effective Communication - tutorialrank.com
MAT 510  Effective Communication - tutorialrank.comMAT 510  Effective Communication - tutorialrank.com
MAT 510 Effective Communication - tutorialrank.comBartholomew46
 

Similar to Measuring effectiveness of machine learning systems (20)

Auditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographicsAuditing search engines for differential satisfaction across demographics
Auditing search engines for differential satisfaction across demographics
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomes
 
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender Systems
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender System
 
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
JAVA 2013 IEEE DATAMINING PROJECT Crowdsourcing predictors of behavioral outc...
 
Crowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomesCrowdsourcing predictors of behavioral outcomes
Crowdsourcing predictors of behavioral outcomes
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 
MAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comMAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.com
 
MAT 510 Exceptional Education - snaptutorial.com
MAT 510   Exceptional Education - snaptutorial.comMAT 510   Exceptional Education - snaptutorial.com
MAT 510 Exceptional Education - snaptutorial.com
 
Behavioural Modelling Outcomes prediction using Casual Factors
Behavioural Modelling Outcomes prediction using Casual  FactorsBehavioural Modelling Outcomes prediction using Casual  Factors
Behavioural Modelling Outcomes prediction using Casual Factors
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015
 
Mat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comMat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.com
 
MAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.comMAT 510 Great Stories /newtonhelp.com
MAT 510 Great Stories /newtonhelp.com
 
Mat 510 Believe Possibilities / snaptutorial.com
Mat 510  Believe Possibilities / snaptutorial.comMat 510  Believe Possibilities / snaptutorial.com
Mat 510 Believe Possibilities / snaptutorial.com
 
MAT 510 Inspiring Innovation/tutorialrank.com
 MAT 510 Inspiring Innovation/tutorialrank.com MAT 510 Inspiring Innovation/tutorialrank.com
MAT 510 Inspiring Innovation/tutorialrank.com
 
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
8-year Evaluation of GameBus: Status quo in Aiming for an Open Access Platfor...
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
MAT 510 Effective Communication - tutorialrank.com
MAT 510  Effective Communication - tutorialrank.comMAT 510  Effective Communication - tutorialrank.com
MAT 510 Effective Communication - tutorialrank.com
 
MBA
MBAMBA
MBA
 

More from Amit Sharma

Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
 
Artificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactArtificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactAmit Sharma
 
Causal inference in data science
Causal inference in data scienceCausal inference in data science
Causal inference in data scienceAmit Sharma
 
Causal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesCausal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesAmit Sharma
 
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesEquivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesAmit Sharma
 
Estimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsEstimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsAmit Sharma
 
From prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsFrom prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsAmit Sharma
 
Causal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereCausal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereAmit Sharma
 
The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...Amit Sharma
 
The role of social connections in shaping our preferences
The role of social connections in shaping our preferencesThe role of social connections in shaping our preferences
The role of social connections in shaping our preferencesAmit Sharma
 
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...Amit Sharma
 
RSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationRSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationAmit Sharma
 

More from Amit Sharma (12)

Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
 
Artificial Intelligence for Societal Impact
Artificial Intelligence for Societal ImpactArtificial Intelligence for Societal Impact
Artificial Intelligence for Societal Impact
 
Causal inference in data science
Causal inference in data scienceCausal inference in data science
Causal inference in data science
 
Causal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practicesCausal inference in online systems: Methods, pitfalls and best practices
Causal inference in online systems: Methods, pitfalls and best practices
 
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential OutcomesEquivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
Equivalence causal frameworks: SEMs, Graphical models and Potential Outcomes
 
Estimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actionsEstimating influence of online activity feeds on people's actions
Estimating influence of online activity feeds on people's actions
 
From prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systemsFrom prediction to causation: Causal inference in online systems
From prediction to causation: Causal inference in online systems
 
Causal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhereCausal inference in practice: Here, there, causality is everywhere
Causal inference in practice: Here, there, causality is everywhere
 
The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...The interplay of personal preference and social influence in sharing networks...
The interplay of personal preference and social influence in sharing networks...
 
The role of social connections in shaping our preferences
The role of social connections in shaping our preferencesThe role of social connections in shaping our preferences
The role of social connections in shaping our preferences
 
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
[RecSys '13]Pairwise Learning: Experiments with Community Recommendation on L...
 
RSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendationRSWEB 2013: A research platform for social recommendation
RSWEB 2013: A research platform for social recommendation
 

Recently uploaded

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Measuring effectiveness of machine learning systems

  • 1. Measuring effectiveness of machine learning systems AMIT SHARMA, MICROSOFT RESEARCH INDIA www.amitsharma.in @amt_shrma
  • 2.
  • 3. Are these systems performing well? How do these systems affect people’s behavior? How can we make these systems better? Better for whom?
  • 4. Evaluating systemsEstimating causal effects Three examples. Two studies: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  • 5. Easiest answer: Look at metrics
  • 6. But metrics can be misleading Performance metric can be high or low for a number of reasons --day of week, time of year --selection effects Different metrics may provide a different picture.
  • 7. Example 1: Increasing activity on Xbox 7
  • 8. From data to prediction 8 Use these correlations to make a predictive model. Future Activity -> f(number of friends, logins in past month) 
  • 9. From data to “actionable insights” 9 Would increasing the number of friends increase people’s activity on our system? Maybe, may be not (!)
  • 12. Are search ads really that effective? 12
  • 13. But search results point to the same website 13
  • 14. Without reasoning about causality, may overestimate effectiveness of ads 14
  • 15. Okay, search ads have an explicit intent. Display ads should be fine? 15
  • 17. People anyways buy more toys in December 17
  • 18. 18
  • 19. 19
  • 20. Example 3: Did a system change lead to better outcomes? 20 System A System B
  • 21. Comparing old versus new system 21 Old (A) New (B) 50/1000 (5%) 54/1000 (5.4%) New system is better?
  • 22. Looking at change in CTR by income 22 Old System (A) New System (B) 10/400 (2.5%) 4/200 (2%) Old System (A) New System (B) 40/600 (6.6%) 50/800 (6.2%) Low-income High-income
  • 23. The Simpson’s paradox Is Algorithm A better? 23 Old system (A) New system (B) Conversion rate for Low-income people 10/400 (2.5%) 4/200 (2%) Conversion rate for High-income people 40/600 (6.6%) 50/800 (6.2%) Total Conversion rate 50/1000 (5%) 54/1000 (5.4%)
  • 24. Answer (as usual): May be, may be not. High-income people might have better means to know about the updated information. Time of year may have an effect. There could be other hidden causal variations. 24
  • 25. 25
  • 26. Easy answer: Do A/B tests Great for testing focused hypotheses. But cannot help you find robust hypotheses. Not possible in many scenarios for ethical or practical reasons.
  • 27. Harder answer: Approximate A/B tests offline Gives all the benefits of A/B tests Can be run at scale. But, without an experiment, we have no data on what would happen if we change the system. How do we estimate something that we have never observed?
  • 28. Causal inference to the rescue… 28
  • 29. Evaluating systemsEstimating causal effects Two examples: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  • 30. Study 1: Estimating the impact of a recommender system Sharma-Hofman-Watts (2015,2016)
  • 31. Example: Estimating the causal impact of Amazon’s recommender system 31
  • 32. How much activity comes from the recommendation system? 32
  • 33. Confounding: Observed click-throughs may be due to correlated demand 33 Demand for The Road Visits to The Road Rec. visits to No Country for Old Men Demand for No Country for Old Men
  • 34. Observed activity is almost surely an overestimate of the causal effect 34 Causal Convenience OBSERVED ACTIVITY FROM RECOMMENDER All page visits ? ACTIVITY WITHOUT RECOMMENDER
  • 35. Finding auxiliary outcome: Split outcome into recommender (primary) and direct visits (auxiliary) 35 All visits to a recommended product Recommender visits Direct visits Search visits Direct browsing Auxiliary outcome: Proxy for unobserved demand
  • 36. ? ? 1a. Search for any product with a shock to page visits 36
  • 37. 1b. Filtering out invalid natural experiments 37
  • 38. The “split-door” criterion 38 Criterion: 𝑿 ∐ 𝒀 𝑫 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
  • 39. More formally, why does it work? 39 Unobserved variables (UX) Cause (X) Outcome (YR) Auxiliary Outcome (YD) Unobserved variables (UY)
  • 40. Data from Amazon.com, using Bing toolbar • • • Out of which 20 K products have at least 10 visits on any one day
  • 41. Constructed sequence of visits for each user 41
  • 42. Implementing the split-door criterion 42 𝑡 = 15 days
  • 43. Using the split-door criterion, obtain 23,000 natural experiments for over 12,000 products. 43
  • 44. 44
  • 46. Generalization? Distribution of products with a natural experiment identical to overall distribution 46
  • 47. Study 2: How does user satisfaction for a search engine vary across demographics? Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
  • 48. Tricky: straightforward optimization can lead to differential performance Search engine uses a standard metric: time spent on clicked result page as an indicator of satisfaction. Goal: estimate difference in user satisfaction between these two demographic groups. Suppose older users issue more of “retirement planning” queries Age: >50 years 80% users 10% users Age: <30 years …
  • 49. 1. Overall metrics can hide differential satisfaction Average user satisfaction for “retirement planning” may be high. But, Average satisfaction for younger users=0.7 Average satisfaction for older users=0.2
  • 50. 2. Query-level metrics can hide differential satisfaction <query> <query> <query> <query> <query> <query> retirement planning <query> <query> retirement planning retirement planning <query> retirement planning … Same user satisfaction for “retirement planning” for both older and younger users = 0.7 What if average satisfaction for <query>=0.9? Older users still receiving more of lower- quality results than younger users. Younger users Older users
  • 51. 3. More critically, even individual- level metrics can also hide differential satisfaction Reading time for the same webpage result for the same user satisfaction Time spent on a webpage Younger Users Older Users
  • 52. How do we know whether some users are more satisfied than others?
  • 53. Data: Demographic characteristics of search engine users Internal logs from Bing.com for two weeks 4 M users | 32 M impressions | 17 M sessions Demographics: Age & Gender Age: ◦ post-Millenial: <18 ◦ Millenial: 18-34 ◦ Generation X: 35-54 ◦ Baby Boomer: 55 - 74
  • 54. Overall metrics across Demographics Four metrics: Graded Utility (GU) Reformulation Rate (RR) Successful Click Count (SCC) Page Click Count (PCC)
  • 55. Pitfalls with Overall Metrics Conflate two separate effects: ◦ natural demographic variation caused by the differing traits among the different demographic groups e.g. ◦ Different queries issued ◦ Different information need for the same query ◦ Even for the same satisfaction, demographic A tends to click more than demographic B ◦ Systemic difference in user satisfaction due to the search engine
  • 56. Constructing a causal model Information Need Demographics Metric User satisfaction Query Search Results
  • 57. I. Context Matching: selecting for activity with near-identical context Information Need Demographics Metric User satisfaction Query Search Results Context
  • 58. Information Need Demographics Metric User satisfaction Query Search Results Context For any two users from different demographics, 1. Same Query 2. Same Information Need: 1. Control for user intent: same final SAT click 2. Only consider navigational queries 3. Identical top-8 Search Results 1.2 M impressions, 19K unique queries, 617K users
  • 59. Age-wise differences in metrics disappear General auditing tool: robust Very low coverage across queries: Did we control for too much?
  • 60. II. Query-level pairwise model: Estimating satisfaction directly by considering pairs of users Information Need Demographics Metric User satisfaction Query Search Results
  • 61. Estimating absolute satisfaction is non-trivial • Instead, Estimate relative satisfaction by considering pairs of users for the same query • Conservative proxy for pairwise satisfaction by only considering “big” differences in observed metric for the same query • Logistic regression model for estimating probability of impression i being more satisfied than impression j:
  • 62. Again, see a small age-wise difference in satisfaction
  • 63. Conclusion I: ML systems need Grey Box analysis
  • 64. Conclusion II Evaluation of ML systems requires careful analysis of inputs and outputs. Observational metrics are usually biased: ◦ Big fraction of recommendation click-throughs may be convenience. ◦ Search engine metrics do not provide a provide a clear picture of user satisfaction. Causal models essential for developing robust metrics.