AI/ML Infra Meetup | Perspective on Deep Learning Framework
ABTest-20231020.pptx
1. A-B Test Platform at EBay
From Statistics to Distributed Systems
Michael Lei
11/18/2019
2. The Statistics and Concepts
• Problem Structure in Statistics
• Formal Statistical Methodologies
• Hypothesis Testing: Superiority test, Non-inf test
• Two-sampled T-Tests: 1-tail vs. 2-tail, paired vs. unpaired
• Power Analysis
• Is the sample set large enough?
• How MSZ (min sample size) relates to lift, std-dev and mean of population
• Sampling vs. Activation
• Activation is ultimately on impressions –- where end users interacts with computer
programs
3. Objective of an A-B Test
• Effect on a random variable
• Assume a treatment only shifts the mean of its distribution.
• Lift: change on mean measured as a ratio
• Find how a treatment a random variable.
• CTR on results of a search query (metrics for search engine)
• CTR on an Ad impression (metrics for CPC ads)
• GMB of a user, conversion rate of sessions of a user (metrics for
E-commerce)
• New ranking model (search
engine)
• New CTR model (ads)
• New item view page (e-
comm
4. A random process that generates variable v:
𝑝 𝑥1, 𝑥2, … , ε v
Change a parameter x that we can control
𝑝 𝑥1′, 𝑥2, … , ε v’
We want to see whether there is meaningful difference
between v and v’.
Is the population mean of v’ different from that of v?
But we can’t directly measure the population mean
5. Run the random process a few times to
generate two sample sets:
{𝑣1, 𝑣2, 𝑣3, … . } {𝑣1′, 𝑣2′, 𝑣3′, … . }
But the lift itself is a random variable!
Hence the naïve idea does not seem very smart or complete!
A naïve idea:
Compute a 𝑙𝑖𝑓𝑡 = 𝑣′ 𝑣 − 1 from the means of the
two sample sets .
Is lift = 0?
6. It turns out:
The lift roughly follows a known distribution.
CLT: sum of large number IID variables
follow normal distribution.
Hence the sample mean of IID variables
As long as the following are satisfied:
• “IID”
• “Large number” of samples
Probability density function
The shaded area is:
Prob(v>x)
7. Formal Methodology
• Hypothesis Testing
• Form a hypothesis on random variables.
• Collect evidence from samples to reject the hypothesis.
• Two sampled T-tests
• Test a hypothesis on the sample means or the lift of the sample means of our
variables.
• Reject the hypothesis with alpha (type I error).
• Power Analysis
• Is the sample set large enough?
8. Null Hypothesis on Lift
Let lift = treatedMean / controlMean - 1
• 𝑯𝟎 : lift = 0
This is usually referred to as two-tailed test. It tests both superiority
(lift > 0) and inferiority (lift < 0). And it is the H0 used by EBay EP.
• 𝑯𝟎 : lift <= 0
One-tail test for superiority.
• 𝑯𝟎 : lift <= -errorMargin
This is non-inferiority test, where we want to reject the H0 that
treated is worse than control by the errorMargin.
9. Assumptions of T-Test
• Mean of the two populations being compared should follow a normal
distribution
• Implied by the assumption of Central Limit Theorem:
sum/mean of large number of IID (independent and identical distributed)
variables tend to follow Normal Distribution.
• The two populations being compared should have the same variance
• Implied by the assumption that treatment only changes mean of the
population.
• Variables are sampled independently
• This implies that the metric computation has to be consistent with the serving
time treatment assignment [session scope metrics is not consistent with
assigning guids to treatments].
10. Independent T Test for Two Samples
ℋ0: no difference between two groups
𝛼 (Type I error) - false positive of ℋ1
Pr(Reject ℋ0 | ℋ0 is true)
𝛽 (Type II error) - false negative
Pr(Not reject ℋ0 | ℋ0 is false)
Power = 1 − 𝛽
Power = Pr(reject H0 | H0 is false)
When dotted green line is
to the left of the left orange line OR to the right of the right orange line
Then The mean of Test is not same as mean of Control
11. P-Value
1-tail vs. 2-tail two-sample T-Test on Sample Data only.
(assuming the sample size is large enough !)
12. Can we trust P-Value ?!
Can we trust our two-sample T-Test ?!
Can we trust the assumption made by our two-sample T-Test ?!
Is our sample size really large enough?
Power Analysis!
13. Power
• More than type II error.
• It also evaluates whether the sample size is large enough to confirm
the assumption made by p-value calculation in two-sample T-Test.
• Intuition: Repeatability/Stability of the t-test result.
14. Power vs. Sample Size and Lift
MSZ (Minimum Sample Size)
Minimum number of sampled variables required to reach statistical
power (eg, β = 0.1) for the test.
Use large historical
data set to
approximate
Use mean of the
control sample to
approximate
Preconfigured
minimum detectable
lift
15. EBay’s Fixed Horizon A-B Test
Test
Start
Estimate variable stats for all traffic.
Collect historical test data to infer MDE
Test planning
Estimate test
duration w. MDE
and population
stats
trial-period=
1 week
• Estimate population
variance w. test traffic.
• Re-estimate duration w.
estimated variance and
observed lift
Test ends:
- Accept Ha w. type-I
err=p
- Don’t accept Ha
- Test is underpowered
3 months
16. A-B Test Design @EBay
• 2-tail test on H0: lift=0, select 𝞪=0.1 and 𝝱=0.1.
• We reject H0 (accept Ha) with false positive rate = p_value when p_value < 𝞪
and traffic reaches msz.
• We fail to reject H0 (don’t accept Ha) with false negative rate = 𝝱 when
p_value > 𝞪 and traffic reaches msz.
• An A-B test goes thru a life cycle of different states in EP exp report
• Grey: not statistically significant (p_value > 𝞪)
• White: statistically significant but has not reached power (has not reached its
MSZ)
• Green or Red: both statistically significant and has reached power. Green
indicates lift>0 while Red for lift<0.
18. The Non-inferiority 𝑯𝟎
𝑯𝟎 : lift <= -errorMargin
Applies to:
• “Guardrail metrics”
• ML model refreshes
• Code rewrites (eg, “Replat”)
It lowers the bar to reject 𝑯𝟎, mathematically
• Lower p-value (half) with the same samples higher stats-sig !
• Lower MSZ requirement on sample set higher POWER!
Changes on 𝑯𝟎 only require changes to metric computation
19. Paired Two-sample T-Test in Conv Model Eval
The same
input record
{Query,
Result set,
conversions}
Treatment A
Treatment B
{ResultList,
conversions}
{ResultList’,
conversions}
NDCG of convs
NDCG’ of convs
thru diff treat
process
generates two
variables for
comparison
20. Activation vs. Sampling
● Sampling
A random process to sample sets A for control and B for treatment from
the same population.
● Activation
A treatment is activated on the variable if some condition is satisfied.
Eg, A clinical may have the following qualification requirement:
● A person has certain disease, and
● is willing to take the drug (this is often enforced by overseeing both
treatment and control groups to take real and fake drugs
respectively).
21. Why Activation stands out?
● Can it be implemented as part of the sampling process?
● A Clinical A-B Test on a drug for cancer
● YES. The sampling process can limit to the population with cancer!
● But is that the only condition for “activation”
● What if half of sampled patients throw away the drug!
● The observed lift may go down 50% compared to the expected.
● Add an administrative step to your test procedure:
● Monitor your patients taking the drug.
● This is your activation step.
22. A-B Test in Online Services
● User interacts with Online Services using a pattern of
“Request – Impression”.
● Sampling can be applied to different units:
users, sessions, req/imp, …
● But treatment is ultimate defined on what is fed to user –
Impression.
● Treating request is only a necessary condition for treating
impression.
23. Activation in Online Service A-B Test
● Activation
● Is not checking whether treatment is applied in request processing.
● Is checked on impressions.
● A refreshed ranking model
● Every request is processed differently by it than the base model. But
only a small percentage of user impressions (SRPs) are actually
different and hence treated at all.
● Any change in the backend
● Experience service may decide not to use that data in the view model
for rendering impression.
24. Test Boundaries in Online Services
● A component view
● An Online Application has frontend, backend, data store, etc.
● Each component is a self-contained subsystem and runs its own A-B
test.
● It is fine if you understand who your test subjects are
● If subjects are users,
regardless where you treat the requests,
the boundary of your test is the interaction between
application as a whole and users.
25. The Software Engineering Part
I. Metric Computation
II. EP Serving Platform
III. Inside Distributed
Application
27. Metrics -- overview
• A metric is:
• A type of variables used in A-B Test
• Represent the population of a variable.
• Eg:
total bought items per session of an Ebay user.
total items bought per session of an Ebay user.
click thru rate of search queries of an EBay user.
• Types of metrics: ratio (near binomial) vs. count, weighted, normalized
• Scopes: user, guid, session, event/impression
• Data Model of Metric Computation: impression, action, attribution
• Event filtering in Metric Computation
28. Types of Metrics
• Proportional
Ex, “CSS” is the ratio of search sessions that converted in a guid. They are often
binomial variables (CSS is if search sessions are considered IDD).
• Counts
Ex., “bid_count”, “bin_count” (they are not normalized).
• (normalized) weighted counts
Ex., “GSS” is the count of purchase weighted with price and normalized by
number of search sessions.
29. What difference they make?
• CV = std_dev / mean
MSZ increases with CV^2
• Proportionals tend to have the lowest CV.
Weighted counts have the highest CV.
Normalization helps lower the CV.
30. Scopes of Metrics
• The scope, in which the metric is computed.
• User
• GUID (Long Session)
A long running session of a user as per user-agent and site.
• Session
A short session of a user as per user-agent and site, which terminates with 30
mins of inactivity.
• Impression
A single unit of user-site interaction. For example: an SRP.
31. Scopes of Metrics
Scope Metric GSS Shares of Search Session w.
BBOWAC
SRP_to_VI conversion
User Ave GMB per session of
a user
% of search sessions w.
BBOWAC of a user
% of SRPs w. VI conversion of
a user
Long Session (GUID) Ave GMB per session of
a GUID
% of search sessions w.
BBOWAC of a GUID
% of SRPs w. VI conversion of
a GUID
Session GMB of a session For a search session,
=1 if it has a BBOWAC
=0 if not
% of SRPs w. VI conversion of
a session
Impression n/a n/a For an SRP,
=1 if SRP is converted,
=0 if not
32. Statistics on Population (QU Definition)
Data: all SRPs on US site from 2019-04-01 to 2019-04-14
Noisy events filtered => slight decrease on CV and minimum sample size
Significant increase in treated_SRP_only => different attribution logic of purchase to SRP
Session scope: slight increase on CV and minimum sample size, but triple actual sample size
33. Choice of Scopes
• Finer-granular scopes:
• Increase sample size.
In 2 weeks, a GUID has on average 3 sessions.
• But may to have higher CV, thus higher MSZ
We did observe higher CV in Session scope of GSS, GMB than GUID scope.
• Consider the consistency between metric scope and serving time
scope.
• GUID is used for random assignment of treatment at serving time.
• If compute metrics in session scope implies non-independent sampling of
sessions.
34. Computation of Metrics
Metrics are computed from the following data model of user behavior
logs:
GUID := seq(session)
session := seq([impression|action])
treat := ref(impression), ref(treatment)
attribution := [ref(session)|ref(impression)], ref(action)
Types of impressions: SRP, …
Types of actions: BBOWAC, purchase, SRP_2_VI, …
35. Computation of Metrics
• Metric: BI per session at GUID scope
• Consider a norm table:
guid*, action_id*, action_type, attributing_session_id, treatmentId*
• select treatmentId, guid, agg(bi_action)/count(unique attributing_session_id)
group by guid, treatmentId
where actionType=“BI”
• A metric is computed generally thru:
• Aggregate a kind of actions,
• Normalize the aggregate by # of the attribution scopes (eg, session or
impression) that attribute to those user actions.
• Group by the variable scope (eg, GUID, session, impression, etc)
36. Alternative Computations
• “Whole Session”
No filtering. All sessions of the GUID are considered treated if the GUID has one
impression treated.
Filtering data based on attribution model lead to many alternatives.
• “Event Level”
Throw away all untreated impressions and actions that do not attribute to
treated impressions.
• “In Session”
Throw away all untreated sessions of a GUID.
A session is treated if at least one of its impression is treated.
37. An Ad-hoc Analysis on Alternative
Computations
• Aggressive filtering substantially increase the level of Stats-sig,
but only on A-B tests that we can perform proper activation.
38. 8 Variants of each Metric
GUID
Session
Variants on Filtering Impressions Variants on Scope
Ranking Experiments: Treated = Best Match
QU Experiments: specific treated definition
39. Hypothesis Tests on Previous Experiments
statistically significant not significant
QU Experiment Ranking Experiments
Data: 15 treatments from 10 experiments on US site
41. Basic Process of Online A-B Test
● Select metrics.
● Design hypothesis on the metrics.
● Implement alternative behaviors/implementations in
application.
● Implement activation of the experiment in application.
● Configure an experiment configuration in EP for application
to use.
● Run exp and collect data
● Analyze metrics
42. Interface of EP Serving
● Assignment (sampling)
- Assign req to treatments
● Tracking
- Tracking msg format to specify
treatments activated on an
impression (xttag).
- General endpoint to receive
tracking message (Lighthouse)
43. Sampling Unit
● Sampling unit is the unit of random treatment assignment.
● The concept is highly related to variable scope
● Possible choice:
User, Long Session, Session, Request/Event
44. Consideration for choosing Sampling
Unit
● Consistent user experience: User, Long Session
● Availability at serving time:
Session is currently only a notion in offline data processing.
Long session is highly available at serving, more so than
logged-in user.
45. Orthogonality of
Experiments/Treatments
● 2 treatments are orthogonal to each other if they can be
applied to the same sampled variable (assigned to the same
request/impression).
● Treatments of the same experiment are non-orthogonal to
each other.
● 2 experiments are orthogonal to each other iif every
treatment of exp1 is orthogonal to every treatment in exp2.
● Experiment Plane:
A group of experiments that are orthogonal to each other.
46. Conceptual Data Model of EP Serving
EP := ExpFlight+
ExpFlight := ExpConfig+,
duration
ExpConfig := Treatment+
Treatment := EpFactor+
EpFactor := name,value
● Enforced by policy: There is only
one ExpFlight for an Exp Plane at
any given time.
● EpFactors are used by application
code:
● Gate alternative logics for
treatment
● Activate corresponding
treatment if request “qualifies”
48. Treatment Assignment (Sampling)
Data Structure:
ExpFlight := randNum, TreatmentState+
TreatmentState := hashBuckets, Treatment
For each ef in array of ExpFlight:
hashBucket = hash(ef.randNum, req.GUID)
ts = getTreatmentStateByHashBucket(ef, hashBucket)
req.expContext.addTreatment(ts.Treatment)
49. Treatment Activation and Tracking
• Application activates EpFactors.
• EP Runtime Framework activates treatments accordingly.
• Notification of activated treatments on a request is thru the
tracking API.
Application sends to EP/Lighthouse:
“xttag=treatmentId1,treatmentId2,…”
51. How does it change A-B Test
● The 4 steps:
Assignment, treatment, activation and tracking are
distributed.
● Treatment logics by itself can be distributed.
● Activation logic by itself can be distributed.
A Distributed Application consists of multiple subsystems/
services that interact with each other thru message passing.
53. ExpContext
This data structure lives thru the whole lifecycle of handling a
request in a distributed application. It serves two purposes:
● carries all necessary info for application’s services to finalize what treatments
should be applied and apply them.
● Carries all necessary info to generate a tracking record to confirm to EP what
treatments are applied.
● ExpContext is generated by EP and sent to application at Step 1 in Figure 4. And
it is passed thru the RPC call graph with the scatter&gather pattern in the
distributed application in Step 2/3 in Figure 4.
54. ExpContext
Logical Data Structure
● ExpContext => treatment* // all assigned treatments on a request
● Treatment => factor*, activated=[true|false]
// each treatment has multiple factors
● Factor => (name, value) // each factor is a unique pair of {name, value}
● Within a distributed application, factors can be in different name spaces
corresponding to different services.
55. ExpContext -- Application Programming Model
EP Factors are parameters in implementation of a distributed application
• Flags directly used in code.
• Parameters referenced in meta-data that controls behaviors of code (aka
Configurations).
ExpContext (factors) is part of the context of whole call graph of services for a single
entry request.
• Implicit passing as opposed to explicit passing thru the RPC call graph.
Based on seeing different factors associated with a request:
• Implementation checks qualification (aka “activation”) of treatment.
• Implementation performs alternative logic on req processing.
Qualification check
• Activates/de-activates some ep_factors
56. ExpContext – Application Framework
Implements scatter&gather of ExpContext in application.
• Implicit passing
• Merging
• Splitting set of ep_factors based on name spaces of services.
Implements activation of treatments based activation of ep_factors.
Implements tracking message
57. ExpContext
EP Factor usage outside A-B Tests
• Passed in from entry request to overwrite application behavior (for
testing/debugging, etc)
59. Query Transformation: DSBE A-B Test
• Query Transformation
queries DSBE tables for hints.
• An A-B Test on a DSBE table
uses different versions of
records for treatment vs.
control.
• Query with version:
AND(DSBE_query,
version:val)
from original cassini query
from cassini query parameter,
or ep_factors in expContext
60. Query Transformation: DSBE A-B Test
Activation:
● Treatment and control (different DSBE versions) give different DSBE lookup
results
Alternative Logic of Treatment:
● When the above qualification is satisfied, original query is transformed
differently between treatment and control, and hence is processed differently in
Search Engine (in terms of retrieval and ranking, etc).
63. User
Agent
{epContext}
EP Serving
Controller
View Render
View Data
Model
Req/action
{impression,
trackingInfo}
{req,
epContext}
Tracking Service
Tracking_msg:
{…, xtTags}
• Experience Service may be a more
appropriate endpoint than Sa2S
as an A-B test entry point service.
• EpContext (contains epFactors) is at
application scope effectively shared
across services and subsystems.
• Condition for activation of treatment
is distributed across backend,
frontend and even user agent.
{data model,
epContext w.
activations}