SlideShare a Scribd company logo
1 of 63
A-B Test Platform at EBay
From Statistics to Distributed Systems
Michael Lei
11/18/2019
The Statistics and Concepts
• Problem Structure in Statistics
• Formal Statistical Methodologies
• Hypothesis Testing: Superiority test, Non-inf test
• Two-sampled T-Tests: 1-tail vs. 2-tail, paired vs. unpaired
• Power Analysis
• Is the sample set large enough?
• How MSZ (min sample size) relates to lift, std-dev and mean of population
• Sampling vs. Activation
• Activation is ultimately on impressions –- where end users interacts with computer
programs
Objective of an A-B Test
• Effect on a random variable
• Assume a treatment only shifts the mean of its distribution.
• Lift: change on mean measured as a ratio
• Find how a treatment a random variable.
• CTR on results of a search query (metrics for search engine)
• CTR on an Ad impression (metrics for CPC ads)
• GMB of a user, conversion rate of sessions of a user (metrics for
E-commerce)
• New ranking model (search
engine)
• New CTR model (ads)
• New item view page (e-
comm
A random process that generates variable v:
𝑝 𝑥1, 𝑥2, … , ε  v
Change a parameter x that we can control
𝑝 𝑥1′, 𝑥2, … , ε  v’
We want to see whether there is meaningful difference
between v and v’.
Is the population mean of v’ different from that of v?
But we can’t directly measure the population mean
Run the random process a few times to
generate two sample sets:
{𝑣1, 𝑣2, 𝑣3, … . } {𝑣1′, 𝑣2′, 𝑣3′, … . }
But the lift itself is a random variable!
Hence the naïve idea does not seem very smart or complete!
A naïve idea:
Compute a 𝑙𝑖𝑓𝑡 = 𝑣′ 𝑣 − 1 from the means of the
two sample sets .
Is lift = 0?
It turns out:
The lift roughly follows a known distribution.
CLT: sum of large number IID variables
follow normal distribution.
Hence the sample mean of IID variables
As long as the following are satisfied:
• “IID”
• “Large number” of samples
Probability density function
The shaded area is:
Prob(v>x)
Formal Methodology
• Hypothesis Testing
• Form a hypothesis on random variables.
• Collect evidence from samples to reject the hypothesis.
• Two sampled T-tests
• Test a hypothesis on the sample means or the lift of the sample means of our
variables.
• Reject the hypothesis with alpha (type I error).
• Power Analysis
• Is the sample set large enough?
Null Hypothesis on Lift
Let lift = treatedMean / controlMean - 1
• 𝑯𝟎 : lift = 0
This is usually referred to as two-tailed test. It tests both superiority
(lift > 0) and inferiority (lift < 0). And it is the H0 used by EBay EP.
• 𝑯𝟎 : lift <= 0
One-tail test for superiority.
• 𝑯𝟎 : lift <= -errorMargin
This is non-inferiority test, where we want to reject the H0 that
treated is worse than control by the errorMargin.
Assumptions of T-Test
• Mean of the two populations being compared should follow a normal
distribution
• Implied by the assumption of Central Limit Theorem:
sum/mean of large number of IID (independent and identical distributed)
variables tend to follow Normal Distribution.
• The two populations being compared should have the same variance
• Implied by the assumption that treatment only changes mean of the
population.
• Variables are sampled independently
• This implies that the metric computation has to be consistent with the serving
time treatment assignment [session scope metrics is not consistent with
assigning guids to treatments].
Independent T Test for Two Samples
ℋ0: no difference between two groups
𝛼 (Type I error) - false positive of ℋ1
Pr(Reject ℋ0 | ℋ0 is true)
𝛽 (Type II error) - false negative
Pr(Not reject ℋ0 | ℋ0 is false)
Power = 1 − 𝛽
Power = Pr(reject H0 | H0 is false)
When dotted green line is
to the left of the left orange line OR to the right of the right orange line
Then The mean of Test is not same as mean of Control
P-Value
1-tail vs. 2-tail two-sample T-Test on Sample Data only.
(assuming the sample size is large enough !)
Can we trust P-Value ?!
 Can we trust our two-sample T-Test ?!
 Can we trust the assumption made by our two-sample T-Test ?!
Is our sample size really large enough?
 Power Analysis!
Power
• More than type II error.
• It also evaluates whether the sample size is large enough to confirm
the assumption made by p-value calculation in two-sample T-Test.
• Intuition: Repeatability/Stability of the t-test result.
Power vs. Sample Size and Lift
MSZ (Minimum Sample Size)
Minimum number of sampled variables required to reach statistical
power (eg, β = 0.1) for the test.
Use large historical
data set to
approximate
Use mean of the
control sample to
approximate
Preconfigured
minimum detectable
lift
EBay’s Fixed Horizon A-B Test
Test
Start
Estimate variable stats for all traffic.
Collect historical test data to infer MDE
Test planning
Estimate test
duration w. MDE
and population
stats
trial-period=
1 week
• Estimate population
variance w. test traffic.
• Re-estimate duration w.
estimated variance and
observed lift
Test ends:
- Accept Ha w. type-I
err=p
- Don’t accept Ha
- Test is underpowered
3 months
A-B Test Design @EBay
• 2-tail test on H0: lift=0, select 𝞪=0.1 and 𝝱=0.1.
• We reject H0 (accept Ha) with false positive rate = p_value when p_value < 𝞪
and traffic reaches msz.
• We fail to reject H0 (don’t accept Ha) with false negative rate = 𝝱 when
p_value > 𝞪 and traffic reaches msz.
• An A-B test goes thru a life cycle of different states in EP exp report
• Grey: not statistically significant (p_value > 𝞪)
• White: statistically significant but has not reached power (has not reached its
MSZ)
• Green or Red: both statistically significant and has reached power. Green
indicates lift>0 while Red for lift<0.
Stats-sig and Power on EP Dashboard
The Non-inferiority 𝑯𝟎
𝑯𝟎 : lift <= -errorMargin
Applies to:
• “Guardrail metrics”
• ML model refreshes
• Code rewrites (eg, “Replat”)
It lowers the bar to reject 𝑯𝟎, mathematically
• Lower p-value (half) with the same samples  higher stats-sig !
• Lower MSZ requirement on sample set  higher POWER!
Changes on 𝑯𝟎 only require changes to metric computation
Paired Two-sample T-Test in Conv Model Eval
The same
input record
{Query,
Result set,
conversions}
Treatment A
Treatment B
{ResultList,
conversions}
{ResultList’,
conversions}
NDCG of convs
NDCG’ of convs
thru diff treat
process
generates two
variables for
comparison
Activation vs. Sampling
● Sampling
A random process to sample sets A for control and B for treatment from
the same population.
● Activation
A treatment is activated on the variable if some condition is satisfied.
Eg, A clinical may have the following qualification requirement:
● A person has certain disease, and
● is willing to take the drug (this is often enforced by overseeing both
treatment and control groups to take real and fake drugs
respectively).
Why Activation stands out?
● Can it be implemented as part of the sampling process?
● A Clinical A-B Test on a drug for cancer
● YES. The sampling process can limit to the population with cancer!
● But is that the only condition for “activation”
● What if half of sampled patients throw away the drug!
● The observed lift may go down 50% compared to the expected.
● Add an administrative step to your test procedure:
● Monitor your patients taking the drug.
● This is your activation step.
A-B Test in Online Services
● User interacts with Online Services using a pattern of
“Request – Impression”.
● Sampling can be applied to different units:
users, sessions, req/imp, …
● But treatment is ultimate defined on what is fed to user –
Impression.
● Treating request is only a necessary condition for treating
impression.
Activation in Online Service A-B Test
● Activation
● Is not checking whether treatment is applied in request processing.
● Is checked on impressions.
● A refreshed ranking model
● Every request is processed differently by it than the base model. But
only a small percentage of user impressions (SRPs) are actually
different and hence treated at all.
● Any change in the backend
● Experience service may decide not to use that data in the view model
for rendering impression.
Test Boundaries in Online Services
● A component view
● An Online Application has frontend, backend, data store, etc.
● Each component is a self-contained subsystem and runs its own A-B
test.
● It is fine if you understand who your test subjects are
● If subjects are users,
regardless where you treat the requests,
the boundary of your test is the interaction between
application as a whole and users.
The Software Engineering Part
I. Metric Computation
II. EP Serving Platform
III. Inside Distributed
Application
Metrics
Metrics -- overview
• A metric is:
• A type of variables used in A-B Test
• Represent the population of a variable.
• Eg:
total bought items per session of an Ebay user.
total items bought per session of an Ebay user.
click thru rate of search queries of an EBay user.
• Types of metrics: ratio (near binomial) vs. count, weighted, normalized
• Scopes: user, guid, session, event/impression
• Data Model of Metric Computation: impression, action, attribution
• Event filtering in Metric Computation
Types of Metrics
• Proportional
Ex, “CSS” is the ratio of search sessions that converted in a guid. They are often
binomial variables (CSS is if search sessions are considered IDD).
• Counts
Ex., “bid_count”, “bin_count” (they are not normalized).
• (normalized) weighted counts
Ex., “GSS” is the count of purchase weighted with price and normalized by
number of search sessions.
What difference they make?
• CV = std_dev / mean
MSZ increases with CV^2
• Proportionals tend to have the lowest CV.
Weighted counts have the highest CV.
Normalization helps lower the CV.
Scopes of Metrics
• The scope, in which the metric is computed.
• User
• GUID (Long Session)
A long running session of a user as per user-agent and site.
• Session
A short session of a user as per user-agent and site, which terminates with 30
mins of inactivity.
• Impression
A single unit of user-site interaction. For example: an SRP.
Scopes of Metrics
Scope  Metric GSS Shares of Search Session w.
BBOWAC
SRP_to_VI conversion
User Ave GMB per session of
a user
% of search sessions w.
BBOWAC of a user
% of SRPs w. VI conversion of
a user
Long Session (GUID) Ave GMB per session of
a GUID
% of search sessions w.
BBOWAC of a GUID
% of SRPs w. VI conversion of
a GUID
Session GMB of a session For a search session,
=1 if it has a BBOWAC
=0 if not
% of SRPs w. VI conversion of
a session
Impression n/a n/a For an SRP,
=1 if SRP is converted,
=0 if not
Statistics on Population (QU Definition)
Data: all SRPs on US site from 2019-04-01 to 2019-04-14
Noisy events filtered => slight decrease on CV and minimum sample size
Significant increase in treated_SRP_only => different attribution logic of purchase to SRP
Session scope: slight increase on CV and minimum sample size, but triple actual sample size
Choice of Scopes
• Finer-granular scopes:
• Increase sample size.
In 2 weeks, a GUID has on average 3 sessions.
• But may to have higher CV, thus higher MSZ
We did observe higher CV in Session scope of GSS, GMB than GUID scope.
• Consider the consistency between metric scope and serving time
scope.
• GUID is used for random assignment of treatment at serving time.
• If compute metrics in session scope  implies non-independent sampling of
sessions.
Computation of Metrics
Metrics are computed from the following data model of user behavior
logs:
GUID := seq(session)
session := seq([impression|action])
treat := ref(impression), ref(treatment)
attribution := [ref(session)|ref(impression)], ref(action)
Types of impressions: SRP, …
Types of actions: BBOWAC, purchase, SRP_2_VI, …
Computation of Metrics
• Metric: BI per session at GUID scope
• Consider a norm table:
guid*, action_id*, action_type, attributing_session_id, treatmentId*
• select treatmentId, guid, agg(bi_action)/count(unique attributing_session_id)
group by guid, treatmentId
where actionType=“BI”
• A metric is computed generally thru:
• Aggregate a kind of actions,
• Normalize the aggregate by # of the attribution scopes (eg, session or
impression) that attribute to those user actions.
• Group by the variable scope (eg, GUID, session, impression, etc)
Alternative Computations
• “Whole Session”
No filtering. All sessions of the GUID are considered treated if the GUID has one
impression treated.
Filtering data based on attribution model lead to many alternatives.
• “Event Level”
Throw away all untreated impressions and actions that do not attribute to
treated impressions.
• “In Session”
Throw away all untreated sessions of a GUID.
A session is treated if at least one of its impression is treated.
An Ad-hoc Analysis on Alternative
Computations
• Aggressive filtering substantially increase the level of Stats-sig,
but only on A-B tests that we can perform proper activation.
8 Variants of each Metric
GUID
Session
Variants on Filtering Impressions Variants on Scope
Ranking Experiments: Treated = Best Match
QU Experiments: specific treated definition
Hypothesis Tests on Previous Experiments
statistically significant not significant
QU Experiment Ranking Experiments
Data: 15 treatments from 10 experiments on US site
EP Serving Platform
Basic Process of Online A-B Test
● Select metrics.
● Design hypothesis on the metrics.
● Implement alternative behaviors/implementations in
application.
● Implement activation of the experiment in application.
● Configure an experiment configuration in EP for application
to use.
● Run exp and collect data
● Analyze metrics
Interface of EP Serving
● Assignment (sampling)
- Assign req to treatments
● Tracking
- Tracking msg format to specify
treatments activated on an
impression (xttag).
- General endpoint to receive
tracking message (Lighthouse)
Sampling Unit
● Sampling unit is the unit of random treatment assignment.
● The concept is highly related to variable scope
● Possible choice:
User, Long Session, Session, Request/Event
Consideration for choosing Sampling
Unit
● Consistent user experience: User, Long Session
● Availability at serving time:
Session is currently only a notion in offline data processing.
Long session is highly available at serving, more so than
logged-in user.
Orthogonality of
Experiments/Treatments
● 2 treatments are orthogonal to each other if they can be
applied to the same sampled variable (assigned to the same
request/impression).
● Treatments of the same experiment are non-orthogonal to
each other.
● 2 experiments are orthogonal to each other iif every
treatment of exp1 is orthogonal to every treatment in exp2.
● Experiment Plane:
A group of experiments that are orthogonal to each other.
Conceptual Data Model of EP Serving
EP := ExpFlight+
ExpFlight := ExpConfig+,
duration
ExpConfig := Treatment+
Treatment := EpFactor+
EpFactor := name,value
● Enforced by policy: There is only
one ExpFlight for an Exp Plane at
any given time.
● EpFactors are used by application
code:
● Gate alternative logics for
treatment
● Activate corresponding
treatment if request “qualifies”
Experiment Config
ExpConfig Data Model
Treatment Assignment (Sampling)
Data Structure:
ExpFlight := randNum, TreatmentState+
TreatmentState := hashBuckets, Treatment
For each ef in array of ExpFlight:
hashBucket = hash(ef.randNum, req.GUID)
ts = getTreatmentStateByHashBucket(ef, hashBucket)
req.expContext.addTreatment(ts.Treatment)
Treatment Activation and Tracking
• Application activates EpFactors.
• EP Runtime Framework activates treatments accordingly.
• Notification of activated treatments on a request is thru the
tracking API.
Application sends to EP/Lighthouse:
“xttag=treatmentId1,treatmentId2,…”
A-B Test in Distributed
Application
How does it change A-B Test
● The 4 steps:
Assignment, treatment, activation and tracking are
distributed.
● Treatment logics by itself can be distributed.
● Activation logic by itself can be distributed.
A Distributed Application consists of multiple subsystems/
services that interact with each other thru message passing.
Live Exp in
Distributed
Application
Scatter & Gather
• Platform
• Framework
• Application
ExpContext
This data structure lives thru the whole lifecycle of handling a
request in a distributed application. It serves two purposes:
● carries all necessary info for application’s services to finalize what treatments
should be applied and apply them.
● Carries all necessary info to generate a tracking record to confirm to EP what
treatments are applied.
● ExpContext is generated by EP and sent to application at Step 1 in Figure 4. And
it is passed thru the RPC call graph with the scatter&gather pattern in the
distributed application in Step 2/3 in Figure 4.
ExpContext
Logical Data Structure
● ExpContext => treatment* // all assigned treatments on a request
● Treatment => factor*, activated=[true|false]
// each treatment has multiple factors
● Factor => (name, value) // each factor is a unique pair of {name, value}
● Within a distributed application, factors can be in different name spaces
corresponding to different services.
ExpContext -- Application Programming Model
 EP Factors are parameters in implementation of a distributed application
• Flags directly used in code.
• Parameters referenced in meta-data that controls behaviors of code (aka
Configurations).
 ExpContext (factors) is part of the context of whole call graph of services for a single
entry request.
• Implicit passing as opposed to explicit passing thru the RPC call graph.
 Based on seeing different factors associated with a request:
• Implementation checks qualification (aka “activation”) of treatment.
• Implementation performs alternative logic on req processing.
 Qualification check 
• Activates/de-activates some ep_factors
ExpContext – Application Framework
 Implements scatter&gather of ExpContext in application.
• Implicit passing
• Merging
• Splitting set of ep_factors based on name spaces of services.
 Implements activation of treatments based activation of ep_factors.
 Implements tracking message
ExpContext
 EP Factor usage outside A-B Tests
• Passed in from entry request to overwrite application behavior (for
testing/debugging, etc)
Inside Cassini
Query Transformation: DSBE A-B Test
• Query Transformation
queries DSBE tables for hints.
• An A-B Test on a DSBE table
uses different versions of
records for treatment vs.
control.
• Query with version:
AND(DSBE_query,
version:val)
from original cassini query
from cassini query parameter,
or ep_factors in expContext
Query Transformation: DSBE A-B Test
Activation:
● Treatment and control (different DSBE versions) give different DSBE lookup
results
Alternative Logic of Treatment:
● When the above qualification is satisfied, original query is transformed
differently between treatment and control, and hence is processed differently in
Search Engine (in terms of retrieval and ranking, etc).
Query
{query,
dsbe_usecase,
epFactors} DSBE
API
Search
Engine
{dsbe_lookup,
activated_epFactors}
SIBE
API
Search
Engine
Ranking Profile Compiler
{transformed_query,
epFactors,
profile}
{complete_cassini_query}
Search Engine
{resultList,
activated_epFactors}
{epFactors,
postProcessingConfig}
{result list,
postProcessingConfig,
epFactors}
Blenders
DSBE
Rewrite
SIBE
Rewrite
Frontend
User
Agent
{epContext}
EP Serving
Controller
View Render
View Data
Model
Req/action
{impression,
trackingInfo}
{req,
epContext}
Tracking Service
Tracking_msg:
{…, xtTags}
• Experience Service may be a more
appropriate endpoint than Sa2S
as an A-B test entry point service.
• EpContext (contains epFactors) is at
application scope effectively shared
across services and subsystems.
• Condition for activation of treatment
is distributed across backend,
frontend and even user agent.
{data model,
epContext w.
activations}

More Related Content

Similar to ABTest-20231020.pptx

A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
 
Business Research Methods Unit V
Business Research Methods Unit VBusiness Research Methods Unit V
Business Research Methods Unit VKartikeya Singh
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Manoj Sharma
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSanghamitra Deb
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Marsan Ma
 
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...Aurangzeb Khan
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teestingkpgandhi
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headachesBenoît Rostykus
 
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...Minitab, LLC
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 

Similar to ABTest-20231020.pptx (20)

A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
ICAR - IFPRI- Power Calculation
ICAR - IFPRI- Power CalculationICAR - IFPRI- Power Calculation
ICAR - IFPRI- Power Calculation
 
Business Research Methods Unit V
Business Research Methods Unit VBusiness Research Methods Unit V
Business Research Methods Unit V
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
Hypothsis testing
Hypothsis testingHypothsis testing
Hypothsis testing
 
chi_square test.pptx
chi_square test.pptxchi_square test.pptx
chi_square test.pptx
 
joe-olsen.pptx
joe-olsen.pptxjoe-olsen.pptx
joe-olsen.pptx
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.
 
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
A/B Testing - Customer Experience Platform experimentation using Pearson’s Ch...
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
Qm 0809
Qm 0809 Qm 0809
Qm 0809
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teesting
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headaches
 
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Inte...
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 

Recently uploaded

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersEmilyJiang23
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 

Recently uploaded (20)

COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 

ABTest-20231020.pptx

  • 1. A-B Test Platform at EBay From Statistics to Distributed Systems Michael Lei 11/18/2019
  • 2. The Statistics and Concepts • Problem Structure in Statistics • Formal Statistical Methodologies • Hypothesis Testing: Superiority test, Non-inf test • Two-sampled T-Tests: 1-tail vs. 2-tail, paired vs. unpaired • Power Analysis • Is the sample set large enough? • How MSZ (min sample size) relates to lift, std-dev and mean of population • Sampling vs. Activation • Activation is ultimately on impressions –- where end users interacts with computer programs
  • 3. Objective of an A-B Test • Effect on a random variable • Assume a treatment only shifts the mean of its distribution. • Lift: change on mean measured as a ratio • Find how a treatment a random variable. • CTR on results of a search query (metrics for search engine) • CTR on an Ad impression (metrics for CPC ads) • GMB of a user, conversion rate of sessions of a user (metrics for E-commerce) • New ranking model (search engine) • New CTR model (ads) • New item view page (e- comm
  • 4. A random process that generates variable v: 𝑝 𝑥1, 𝑥2, … , ε  v Change a parameter x that we can control 𝑝 𝑥1′, 𝑥2, … , ε  v’ We want to see whether there is meaningful difference between v and v’. Is the population mean of v’ different from that of v? But we can’t directly measure the population mean
  • 5. Run the random process a few times to generate two sample sets: {𝑣1, 𝑣2, 𝑣3, … . } {𝑣1′, 𝑣2′, 𝑣3′, … . } But the lift itself is a random variable! Hence the naïve idea does not seem very smart or complete! A naïve idea: Compute a 𝑙𝑖𝑓𝑡 = 𝑣′ 𝑣 − 1 from the means of the two sample sets . Is lift = 0?
  • 6. It turns out: The lift roughly follows a known distribution. CLT: sum of large number IID variables follow normal distribution. Hence the sample mean of IID variables As long as the following are satisfied: • “IID” • “Large number” of samples Probability density function The shaded area is: Prob(v>x)
  • 7. Formal Methodology • Hypothesis Testing • Form a hypothesis on random variables. • Collect evidence from samples to reject the hypothesis. • Two sampled T-tests • Test a hypothesis on the sample means or the lift of the sample means of our variables. • Reject the hypothesis with alpha (type I error). • Power Analysis • Is the sample set large enough?
  • 8. Null Hypothesis on Lift Let lift = treatedMean / controlMean - 1 • 𝑯𝟎 : lift = 0 This is usually referred to as two-tailed test. It tests both superiority (lift > 0) and inferiority (lift < 0). And it is the H0 used by EBay EP. • 𝑯𝟎 : lift <= 0 One-tail test for superiority. • 𝑯𝟎 : lift <= -errorMargin This is non-inferiority test, where we want to reject the H0 that treated is worse than control by the errorMargin.
  • 9. Assumptions of T-Test • Mean of the two populations being compared should follow a normal distribution • Implied by the assumption of Central Limit Theorem: sum/mean of large number of IID (independent and identical distributed) variables tend to follow Normal Distribution. • The two populations being compared should have the same variance • Implied by the assumption that treatment only changes mean of the population. • Variables are sampled independently • This implies that the metric computation has to be consistent with the serving time treatment assignment [session scope metrics is not consistent with assigning guids to treatments].
  • 10. Independent T Test for Two Samples ℋ0: no difference between two groups 𝛼 (Type I error) - false positive of ℋ1 Pr(Reject ℋ0 | ℋ0 is true) 𝛽 (Type II error) - false negative Pr(Not reject ℋ0 | ℋ0 is false) Power = 1 − 𝛽 Power = Pr(reject H0 | H0 is false) When dotted green line is to the left of the left orange line OR to the right of the right orange line Then The mean of Test is not same as mean of Control
  • 11. P-Value 1-tail vs. 2-tail two-sample T-Test on Sample Data only. (assuming the sample size is large enough !)
  • 12. Can we trust P-Value ?!  Can we trust our two-sample T-Test ?!  Can we trust the assumption made by our two-sample T-Test ?! Is our sample size really large enough?  Power Analysis!
  • 13. Power • More than type II error. • It also evaluates whether the sample size is large enough to confirm the assumption made by p-value calculation in two-sample T-Test. • Intuition: Repeatability/Stability of the t-test result.
  • 14. Power vs. Sample Size and Lift MSZ (Minimum Sample Size) Minimum number of sampled variables required to reach statistical power (eg, β = 0.1) for the test. Use large historical data set to approximate Use mean of the control sample to approximate Preconfigured minimum detectable lift
  • 15. EBay’s Fixed Horizon A-B Test Test Start Estimate variable stats for all traffic. Collect historical test data to infer MDE Test planning Estimate test duration w. MDE and population stats trial-period= 1 week • Estimate population variance w. test traffic. • Re-estimate duration w. estimated variance and observed lift Test ends: - Accept Ha w. type-I err=p - Don’t accept Ha - Test is underpowered 3 months
  • 16. A-B Test Design @EBay • 2-tail test on H0: lift=0, select 𝞪=0.1 and 𝝱=0.1. • We reject H0 (accept Ha) with false positive rate = p_value when p_value < 𝞪 and traffic reaches msz. • We fail to reject H0 (don’t accept Ha) with false negative rate = 𝝱 when p_value > 𝞪 and traffic reaches msz. • An A-B test goes thru a life cycle of different states in EP exp report • Grey: not statistically significant (p_value > 𝞪) • White: statistically significant but has not reached power (has not reached its MSZ) • Green or Red: both statistically significant and has reached power. Green indicates lift>0 while Red for lift<0.
  • 17. Stats-sig and Power on EP Dashboard
  • 18. The Non-inferiority 𝑯𝟎 𝑯𝟎 : lift <= -errorMargin Applies to: • “Guardrail metrics” • ML model refreshes • Code rewrites (eg, “Replat”) It lowers the bar to reject 𝑯𝟎, mathematically • Lower p-value (half) with the same samples  higher stats-sig ! • Lower MSZ requirement on sample set  higher POWER! Changes on 𝑯𝟎 only require changes to metric computation
  • 19. Paired Two-sample T-Test in Conv Model Eval The same input record {Query, Result set, conversions} Treatment A Treatment B {ResultList, conversions} {ResultList’, conversions} NDCG of convs NDCG’ of convs thru diff treat process generates two variables for comparison
  • 20. Activation vs. Sampling ● Sampling A random process to sample sets A for control and B for treatment from the same population. ● Activation A treatment is activated on the variable if some condition is satisfied. Eg, A clinical may have the following qualification requirement: ● A person has certain disease, and ● is willing to take the drug (this is often enforced by overseeing both treatment and control groups to take real and fake drugs respectively).
  • 21. Why Activation stands out? ● Can it be implemented as part of the sampling process? ● A Clinical A-B Test on a drug for cancer ● YES. The sampling process can limit to the population with cancer! ● But is that the only condition for “activation” ● What if half of sampled patients throw away the drug! ● The observed lift may go down 50% compared to the expected. ● Add an administrative step to your test procedure: ● Monitor your patients taking the drug. ● This is your activation step.
  • 22. A-B Test in Online Services ● User interacts with Online Services using a pattern of “Request – Impression”. ● Sampling can be applied to different units: users, sessions, req/imp, … ● But treatment is ultimate defined on what is fed to user – Impression. ● Treating request is only a necessary condition for treating impression.
  • 23. Activation in Online Service A-B Test ● Activation ● Is not checking whether treatment is applied in request processing. ● Is checked on impressions. ● A refreshed ranking model ● Every request is processed differently by it than the base model. But only a small percentage of user impressions (SRPs) are actually different and hence treated at all. ● Any change in the backend ● Experience service may decide not to use that data in the view model for rendering impression.
  • 24. Test Boundaries in Online Services ● A component view ● An Online Application has frontend, backend, data store, etc. ● Each component is a self-contained subsystem and runs its own A-B test. ● It is fine if you understand who your test subjects are ● If subjects are users, regardless where you treat the requests, the boundary of your test is the interaction between application as a whole and users.
  • 25. The Software Engineering Part I. Metric Computation II. EP Serving Platform III. Inside Distributed Application
  • 27. Metrics -- overview • A metric is: • A type of variables used in A-B Test • Represent the population of a variable. • Eg: total bought items per session of an Ebay user. total items bought per session of an Ebay user. click thru rate of search queries of an EBay user. • Types of metrics: ratio (near binomial) vs. count, weighted, normalized • Scopes: user, guid, session, event/impression • Data Model of Metric Computation: impression, action, attribution • Event filtering in Metric Computation
  • 28. Types of Metrics • Proportional Ex, “CSS” is the ratio of search sessions that converted in a guid. They are often binomial variables (CSS is if search sessions are considered IDD). • Counts Ex., “bid_count”, “bin_count” (they are not normalized). • (normalized) weighted counts Ex., “GSS” is the count of purchase weighted with price and normalized by number of search sessions.
  • 29. What difference they make? • CV = std_dev / mean MSZ increases with CV^2 • Proportionals tend to have the lowest CV. Weighted counts have the highest CV. Normalization helps lower the CV.
  • 30. Scopes of Metrics • The scope, in which the metric is computed. • User • GUID (Long Session) A long running session of a user as per user-agent and site. • Session A short session of a user as per user-agent and site, which terminates with 30 mins of inactivity. • Impression A single unit of user-site interaction. For example: an SRP.
  • 31. Scopes of Metrics Scope Metric GSS Shares of Search Session w. BBOWAC SRP_to_VI conversion User Ave GMB per session of a user % of search sessions w. BBOWAC of a user % of SRPs w. VI conversion of a user Long Session (GUID) Ave GMB per session of a GUID % of search sessions w. BBOWAC of a GUID % of SRPs w. VI conversion of a GUID Session GMB of a session For a search session, =1 if it has a BBOWAC =0 if not % of SRPs w. VI conversion of a session Impression n/a n/a For an SRP, =1 if SRP is converted, =0 if not
  • 32. Statistics on Population (QU Definition) Data: all SRPs on US site from 2019-04-01 to 2019-04-14 Noisy events filtered => slight decrease on CV and minimum sample size Significant increase in treated_SRP_only => different attribution logic of purchase to SRP Session scope: slight increase on CV and minimum sample size, but triple actual sample size
  • 33. Choice of Scopes • Finer-granular scopes: • Increase sample size. In 2 weeks, a GUID has on average 3 sessions. • But may to have higher CV, thus higher MSZ We did observe higher CV in Session scope of GSS, GMB than GUID scope. • Consider the consistency between metric scope and serving time scope. • GUID is used for random assignment of treatment at serving time. • If compute metrics in session scope  implies non-independent sampling of sessions.
  • 34. Computation of Metrics Metrics are computed from the following data model of user behavior logs: GUID := seq(session) session := seq([impression|action]) treat := ref(impression), ref(treatment) attribution := [ref(session)|ref(impression)], ref(action) Types of impressions: SRP, … Types of actions: BBOWAC, purchase, SRP_2_VI, …
  • 35. Computation of Metrics • Metric: BI per session at GUID scope • Consider a norm table: guid*, action_id*, action_type, attributing_session_id, treatmentId* • select treatmentId, guid, agg(bi_action)/count(unique attributing_session_id) group by guid, treatmentId where actionType=“BI” • A metric is computed generally thru: • Aggregate a kind of actions, • Normalize the aggregate by # of the attribution scopes (eg, session or impression) that attribute to those user actions. • Group by the variable scope (eg, GUID, session, impression, etc)
  • 36. Alternative Computations • “Whole Session” No filtering. All sessions of the GUID are considered treated if the GUID has one impression treated. Filtering data based on attribution model lead to many alternatives. • “Event Level” Throw away all untreated impressions and actions that do not attribute to treated impressions. • “In Session” Throw away all untreated sessions of a GUID. A session is treated if at least one of its impression is treated.
  • 37. An Ad-hoc Analysis on Alternative Computations • Aggressive filtering substantially increase the level of Stats-sig, but only on A-B tests that we can perform proper activation.
  • 38. 8 Variants of each Metric GUID Session Variants on Filtering Impressions Variants on Scope Ranking Experiments: Treated = Best Match QU Experiments: specific treated definition
  • 39. Hypothesis Tests on Previous Experiments statistically significant not significant QU Experiment Ranking Experiments Data: 15 treatments from 10 experiments on US site
  • 41. Basic Process of Online A-B Test ● Select metrics. ● Design hypothesis on the metrics. ● Implement alternative behaviors/implementations in application. ● Implement activation of the experiment in application. ● Configure an experiment configuration in EP for application to use. ● Run exp and collect data ● Analyze metrics
  • 42. Interface of EP Serving ● Assignment (sampling) - Assign req to treatments ● Tracking - Tracking msg format to specify treatments activated on an impression (xttag). - General endpoint to receive tracking message (Lighthouse)
  • 43. Sampling Unit ● Sampling unit is the unit of random treatment assignment. ● The concept is highly related to variable scope ● Possible choice: User, Long Session, Session, Request/Event
  • 44. Consideration for choosing Sampling Unit ● Consistent user experience: User, Long Session ● Availability at serving time: Session is currently only a notion in offline data processing. Long session is highly available at serving, more so than logged-in user.
  • 45. Orthogonality of Experiments/Treatments ● 2 treatments are orthogonal to each other if they can be applied to the same sampled variable (assigned to the same request/impression). ● Treatments of the same experiment are non-orthogonal to each other. ● 2 experiments are orthogonal to each other iif every treatment of exp1 is orthogonal to every treatment in exp2. ● Experiment Plane: A group of experiments that are orthogonal to each other.
  • 46. Conceptual Data Model of EP Serving EP := ExpFlight+ ExpFlight := ExpConfig+, duration ExpConfig := Treatment+ Treatment := EpFactor+ EpFactor := name,value ● Enforced by policy: There is only one ExpFlight for an Exp Plane at any given time. ● EpFactors are used by application code: ● Gate alternative logics for treatment ● Activate corresponding treatment if request “qualifies”
  • 48. Treatment Assignment (Sampling) Data Structure: ExpFlight := randNum, TreatmentState+ TreatmentState := hashBuckets, Treatment For each ef in array of ExpFlight: hashBucket = hash(ef.randNum, req.GUID) ts = getTreatmentStateByHashBucket(ef, hashBucket) req.expContext.addTreatment(ts.Treatment)
  • 49. Treatment Activation and Tracking • Application activates EpFactors. • EP Runtime Framework activates treatments accordingly. • Notification of activated treatments on a request is thru the tracking API. Application sends to EP/Lighthouse: “xttag=treatmentId1,treatmentId2,…”
  • 50. A-B Test in Distributed Application
  • 51. How does it change A-B Test ● The 4 steps: Assignment, treatment, activation and tracking are distributed. ● Treatment logics by itself can be distributed. ● Activation logic by itself can be distributed. A Distributed Application consists of multiple subsystems/ services that interact with each other thru message passing.
  • 52. Live Exp in Distributed Application Scatter & Gather • Platform • Framework • Application
  • 53. ExpContext This data structure lives thru the whole lifecycle of handling a request in a distributed application. It serves two purposes: ● carries all necessary info for application’s services to finalize what treatments should be applied and apply them. ● Carries all necessary info to generate a tracking record to confirm to EP what treatments are applied. ● ExpContext is generated by EP and sent to application at Step 1 in Figure 4. And it is passed thru the RPC call graph with the scatter&gather pattern in the distributed application in Step 2/3 in Figure 4.
  • 54. ExpContext Logical Data Structure ● ExpContext => treatment* // all assigned treatments on a request ● Treatment => factor*, activated=[true|false] // each treatment has multiple factors ● Factor => (name, value) // each factor is a unique pair of {name, value} ● Within a distributed application, factors can be in different name spaces corresponding to different services.
  • 55. ExpContext -- Application Programming Model  EP Factors are parameters in implementation of a distributed application • Flags directly used in code. • Parameters referenced in meta-data that controls behaviors of code (aka Configurations).  ExpContext (factors) is part of the context of whole call graph of services for a single entry request. • Implicit passing as opposed to explicit passing thru the RPC call graph.  Based on seeing different factors associated with a request: • Implementation checks qualification (aka “activation”) of treatment. • Implementation performs alternative logic on req processing.  Qualification check  • Activates/de-activates some ep_factors
  • 56. ExpContext – Application Framework  Implements scatter&gather of ExpContext in application. • Implicit passing • Merging • Splitting set of ep_factors based on name spaces of services.  Implements activation of treatments based activation of ep_factors.  Implements tracking message
  • 57. ExpContext  EP Factor usage outside A-B Tests • Passed in from entry request to overwrite application behavior (for testing/debugging, etc)
  • 59. Query Transformation: DSBE A-B Test • Query Transformation queries DSBE tables for hints. • An A-B Test on a DSBE table uses different versions of records for treatment vs. control. • Query with version: AND(DSBE_query, version:val) from original cassini query from cassini query parameter, or ep_factors in expContext
  • 60. Query Transformation: DSBE A-B Test Activation: ● Treatment and control (different DSBE versions) give different DSBE lookup results Alternative Logic of Treatment: ● When the above qualification is satisfied, original query is transformed differently between treatment and control, and hence is processed differently in Search Engine (in terms of retrieval and ranking, etc).
  • 61. Query {query, dsbe_usecase, epFactors} DSBE API Search Engine {dsbe_lookup, activated_epFactors} SIBE API Search Engine Ranking Profile Compiler {transformed_query, epFactors, profile} {complete_cassini_query} Search Engine {resultList, activated_epFactors} {epFactors, postProcessingConfig} {result list, postProcessingConfig, epFactors} Blenders DSBE Rewrite SIBE Rewrite
  • 63. User Agent {epContext} EP Serving Controller View Render View Data Model Req/action {impression, trackingInfo} {req, epContext} Tracking Service Tracking_msg: {…, xtTags} • Experience Service may be a more appropriate endpoint than Sa2S as an A-B test entry point service. • EpContext (contains epFactors) is at application scope effectively shared across services and subsystems. • Condition for activation of treatment is distributed across backend, frontend and even user agent. {data model, epContext w. activations}

Editor's Notes

  1. IMT = SS quality, which enables SS agility
  2. IMT = SS quality, which enables SS agility
  3. IMT = SS quality, which enables SS agility
  4. IMT = SS quality, which enables SS agility
  5. IMT = SS quality, which enables SS agility
  6. IMT = SS quality, which enables SS agility
  7. IMT = SS quality, which enables SS agility
  8. IMT = SS quality, which enables SS agility
  9. IMT = SS quality, which enables SS agility
  10. IMT = SS quality, which enables SS agility
  11. IMT = SS quality, which enables SS agility