Implementing and analyzing online experiments

IMPLEMENTING AND ANALYZING
ONLINE EXPERIMENTS
SEAN J. TAYLOR
28 JUL 2015
MULTITHREADED DATA

WHO AM I?
• Core Data Science Team at Facebook
• PhD from NYU in Information Systems
• Four academic papers employing online ﬁeld
experiments
• Teach and consult on experimental design at
Facebook
http://seanjtaylor.com 
http://github.com/seanjtaylor 
http://facebook.com/seanjtaylor 
@seanjtaylor

I ASSUME YOU KNOW
• Why causality matters
• A little bit of Python and R
• Basic statistics + linear regression

SIMPLEST POSSIBLE
EXPERIMENT
user_id version spent
123 B $10
596 A $0
456 A $4
991 B $9
def
get_version(user_id):

if
user_id
%
2:

return
'A'

else:

return
'B'

>
t.test(c(0,
4),
c(10,
9))

Welch
Two
Sample
t-‐test

data:

c(0,
4)
and
c(10,
9)

t
=
-‐3.638,
df
=
1.1245,
p-‐value
=
0.1487

alternative
hypothesis:
true
difference
in
means
is
not
equal
to
0

95
percent
confidence
interval:

-‐27.74338

12.74338

sample
estimates:

mean
of
x
mean
of
y

2.0

9.5

COMMON PROBLEMS
• Type I errors from measuring too many effects
• Type II and M errors from lack of power
• Repeated use of the same population (“pollution”)
• Type I errors from violation of the i.i.d. assumption
• Composing many changes into one experiment

POWER
OR
THE ONLY WAY TO
TRULY FAIL AT AN
EXPERIMENT
OR
THE SIZE OF YOUR
CONFIDENCE
INTERVALS

ERRORS
• Type I: Thinking your
metric changed when it
didn’t. We usually bound
this at 1 or 5%.
• Type II: Thinking your
metric didn’t change
when it did. You can
control this through
better planning.

HOW TO MAKE TYPE I ERRORS
$ Spent
Time
Spent
Survey
Satisfaction
oMale,
<25
Female,
<25
Male,
>=25
Female,
>=25
Measure a ton of metrics
Findasubgroupitworkson

AVOID TYPE II ERRORS WITH
POWER
1. Use enough subjects in your experiment.
2. Test a reasonably strong treatment. 
Remember: you care about the difference.

POWER ANALYSIS
First step in designing an
experiment is to determine how
much data you’ll need to learn the
answer to your question.
Process:
• set the smallest effect size you’d
like to detect.
• simulate your experiment 200
times at various sample sizes
• count the number of simulated
experiments where you correctly
reject the null of effect=0.

TYPE M ERRORS
• Magnitude error:
reporting an effect size
which is too large
• happens when your
experiment is
underpowered AND
you only report the
signiﬁcant results

PLANOUT: KEY IDEAS
• an experiment is just a pseudo-random mapping from
(user, context) → parameters, and is serializable.
• persistent randomizations implemented through hash
functions, salts make experiments orthogonal
• always log exposures (parameters assignment) to
improve precision, provide randomization check
• namespaces create ability to do sequential
experiments on new blocks of users
https://facebook.github.io/planout/

A/B TESTING IN PLANOUT
from
planout.ops.random
import
*

from
planout.experiment
import
SimpleExperiment

class
ButtonCopyExperiment(SimpleExperiment):

def
assign(self,
params,
user_id):

#
`params`
is
always
the
first
argument.

params.button_text
=
UniformChoice(

choices=["Buy
now!",
"Buy
later!"],

unit=user_id

)

#
Later
in
your
production
code:

from
myexperiments
import
ButtonCopyExperiment

e
=
ButtonCopyExperiment(user_id=212)

print(e.get('button_text'))

#
Event
later:

e
=
ButtonCopyExperiment(user_id=212)

e.log_event('purchase',
{'amount':
9.43})

PLANOUT LOGS → DATA
{"inputs":
{"user_id":
212},

"name":
"ButtonCopyExperiment",

"checksum":
"646e69a5",
"params":

{"button_text":
"Buy
later!"},

"time":
1437952369,
"salt":

"event":

“exposure"}

{"inputs":
{"user_id":
212},

"name":

"checksum":
"646e69a5",
"params":

{"button_text":
"Buy
later!"},

"time":
1437952369,
"extra_data":

{"amount":
9.43},
"salt":

"event":

"purchase"}

user_id button_text
123 Buy later!
596 Buy later!
456 Buy now!
991 Buy later!
user_id amount
123 $12
596 $9
Exposures
Purchases

ADVANCED DESIGN 1:
FACTORIAL DESIGN
• Can use conditional logic as well as other random
assignment operators:  
RandomInteger, RandomFloat, WeightedChoice, Sample.
class
FactorialExperiment(SimpleExperiment):

def
assign(self,
params,
user_id):

params.button_text
=
UniformChoice(

choices=["Buy
now!",
"Buy
later!"],

unit=user_id

)

params.button_color
=
UniformChoice(

choices=["blue",
"orange"],

unit=user_id

)

ADVANCED DESIGN 2:
INCREMENTAL CHANGES
##
We're
going
to
try
two
different
button
redesigns.

class
FirstExperiment(SimpleExperiment):

def
assign(self,
params,
user_id):

#
...
set
some
params

class
SecondExperiment(SimpleExperiment):

def
assign(self,
params,
user_id):

#
...
set
some
params
differently

class
ButtonNamespace(SimpleNamespace):

def
setup(self):

self.name
=
'button_experiment_sequence'

self.primary_unit
=
'user_id'

self.num_segments
=
1000

def
setup_experiments():

#
allocate
and
deallocate
experiments
here

#
First
gets
100
out
of
1000
segments.

self.add_experiment('first',
FirstExperiment,
100)

self.add_experiment('second',
SecondExperiment,
100)

ADVANCED DESIGN 3:
WITHIN-SUBJECTS
Previous experiments persistently assigned same
treatment to user, but unit of analysis can be more
complex:
class
DiscountExperiment(SimpleExperiment):

def
assign(self,
params,
user_id,
item_id):

params.discount
=
BernoulliTrial(p=0.1,
unit=[user_id,
item_id])

if
params.discount:

params.discount_amount
=
RandomInteger(

min=5,
max=15,
unit=user_id

)

else:

params.discount_amount
=
0

e
=
DiscountExperiment(user_id=212,
item_id=2)

print(e.get('discount_amount'))

THE IDEAL DATA SET
Subject
/ User
Gender Age
Button
Size
Button
Text
Spent Bounce
Erin F 22 Large
Buy
Now!
$20 0
Ashley F 29 Large
Buy
Later!
$4 0
Gary M 34 Small
Buy
Now!
$0 1
Leo M 18 Large
Buy
Now!
$0 1
Ed M 46 Small
Buy
Later!
$9 0
Sam M 25 Small
Buy
Now!
$5 0
Independent
Observations
Randomly
Assigned
Metrics
Pre-experiment
Covariates
{
{
{
{

SIMPLEST CASE: OLS
>
summary(lm(spent
~
button.size,
data
=
df))

Call:

lm(formula
=
spent
~
button.size,
data
=
df)

Residuals:

1

2

3

4

5

6

10.0

-‐0.5

-‐4.5
-‐10.0

4.5

0.5

Coefficients:

Estimate
Std.
Error
t
value
Pr(>|t|)

(Intercept)

10.000

5.489

1.822

0.143

factor(button.size)s

-‐5.500

6.722

-‐0.818

0.459

Residual
standard
error:
7.762
on
4
degrees
of
freedom

Multiple
R-‐squared:

0.1434,
Adjusted
R-‐squared:

-‐0.07079

F-‐statistic:
0.6694
on
1
and
4
DF,

p-‐value:
0.4592

DATA REDUCTION
Subject Xi Di Yi
Evan M 0 1
Ashley F 0 1
Greg M 1 0
Leena F 1 0
Ema F 0 0
Seamus M 1 1
X D Y Cases
M 0 1 1
M 1 1 1
F 0 1 1
F 1 1 0
M 0 0 0
M 1 0 1
F 0 0 1
F 1 0 1
N # treatments X # groups X #outcomes

source('css_stats.R')

reduced
<-‐
df
%>%

mutate(rounded.spent
=
round(spent,
0))
%>%

group_by(button.size,
rounded.spent)
%>%

summarise(n
=
n())

>
lm(rounded.spent
~
button.size,
data
=
reduced,
weights
=
n)
%>%

+

coeftest(vcov
=
sandwich.lm)

t
test
of
coefficients:

Estimate
Std.
Error
t
value

Pr(>|t|)

(Intercept)

7.43137

0.45162
16.4548
7.522e-‐14
***

button.sizes
-‐2.45178

0.59032
-‐4.1533
0.0004149
***

-‐-‐-‐

Signif.
codes:

0
'***'
0.001
'**'
0.01
'*'
0.05
'.'
0.1
'
'
1

DATA REDUCTION +
WEIGHTED OLS

FACTORIAL DESIGNS
• Identify two types of effects: marginal and
interactions. Need to ﬁx one group as the baseline.
>
coeftest(lm(spent
~
button.size
*
button.text,
data
=
df))

t
test
of
coefficients:

Estimate
Std.
Error
t
value

Pr(>|t|)

(Intercept)

6.79643

0.62998
10.7884
<
2.2e-‐16
***

button.sizes

-‐2.43253

0.86673
-‐2.8066

0.006064
**

button.textn

2.11611

0.86673

2.4415

0.016458
*

button.sizes:button.textn
-‐2.57660

1.27584
-‐2.0195

0.046219
*

-‐-‐-‐

Signif.
codes:

0
'***'
0.001
'**'
0.01
'*'
0.05
'.'
0.1
'
'
1

USING COVARIATES TO GAIN
PRECISION
• With simple random assignment, using covariates
is not necessary.
• However, you can improve precision of ATE
estimates if covariates explain a lot of variation in
the potential outcomes.
• Can be added to a linear model and SEs should
get smaller if they are helpful.

NON-IID DATA
• Repeated observations of
the same user are not
independent.
• Ditto if you ‘re
experimenting on certain
items only.
• If you ignore dependent
data, the true conﬁdence
intervals are larger than
you think.
Subject /
User
Item
Button
Size
Spent
Erin Shirt Large $20
Erin Socks Large $4
Erin Pants Large $0
Leo Shirt Large $0
Ed Shirt Small $9
Ed Socks Small $5

THE BOOTSTRAP
R1
All Your
Data
R2
…
R500
Generate random
sub-samples
s1
s2
s500
Compute statistics
or estimate model
parameters
…
} 0.0
2.5
5.0
7.5
-2 -1 0 1 2
Statistic
Count
Get a distribution
over statistic of interest
(e.g. the treatment effect)
- take mean
- CIs == 95% quantiles
- SEs == standard deviation

USER AND USER-ITEM
BOOTSTRAPS
source('css_stats.R')

library(broom)
##
for
extracting
model
coefficients

fitter
<-‐
function(.data)
{

lm(summary
~
opposed,
data
=
.data,
weights
=
.weights)
%>%

tidy

}

iid.replicates

<-‐
iid.bootstrap(df,
fitter,
.combine
=
bind_rows)

oneway.replicates
<-‐
clustered.bootstrap(df,
c('user_id'),
fitter,
.combine
=

bind_rows)

twoway.replicates
<-‐
clustered.bootstrap(df,
c('user_id',
'item_id'),
fitter,
.combine

=
bind_rows)

>
head(iid.replicates)

term

estimate

std.error
statistic

p.value

1

(Intercept)

0.4700000
0.04795154

9.801561
6.296919e-‐18

2
button.sizes
-‐0.2200000
0.08003333
-‐2.748855
6.695621e-‐03

3

(Intercept)

0.4250000
0.05307832

8.007036
5.768641e-‐13

4
button.sizes
-‐0.1750000
0.08456729
-‐2.069358
4.049329e-‐02

5

(Intercept)

0.4137931
0.05141050

8.048805
4.118301e-‐13

6
button.sizes
-‐0.1429598
0.08621804
-‐1.658119
9.965016e-‐02

DEPENDENCE CHANGES
CONFIDENCE INTERVALS

DATA REDUCTION WITH
DEPENDENT DATA
Subject Di Yij
Evan 1 1
Evan 1 0
Ashley 0 1
Ashley 0 1
Ashley 0 1
Greg 1 0
Leena 1 0
Leena 1 1
Ema 0 0
Seamus 1 1
Create bootstrap replicates
R1
R2
R3
reduce the replicates
as if they’re i.i.d.
r1
r2
r3
s1
s2
s3
compute statistics
on reduced data

THANKS!
HERE ARE SOME RESOURCES:
• Me: http://seanjtaylor.com
• These slides: 
http://www.slideshare.net/seanjtaylor/implementing-
and-analyzing-online-experiments
• Full Featured Tutorial:  
http://eytan.github.io/www-15-tutorial/
• “Field Experiments”  
by Gerber and Green

Implementing and analyzing online experiments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Implementing and analyzing online experiments

Similar to Implementing and analyzing online experiments (20)

Recently uploaded

Recently uploaded (20)

Implementing and analyzing online experiments