This document summarizes Stripe's approach to fraud detection using machine learning. Stripe builds customized random forests to detect fraudulent transactions and pseudorandom patterns in fraudster behavior. It evaluates models using counterfactual offline evaluation, where randomizing model decisions allows estimating performance without live experiments. This approach helps Stripe balance fraud prevention with merchant experience.
2. About
Stripe
Feature
genera6on:
fraudsters
are
“pseudorandom”
Model
training:
“customized”
random
forests
Model
evalua6on:
counterfactual
offline
evalua6on
3. What
is
Stripe?
“Full
stack”
for
e-‐commerce:
-‐
credit
cards
-‐
Checkout
(APMs:
Alipay,
Bitcoin)
-‐
fraud
(beta)
-‐
etc.
Merchant
fraud
Transac6on
fraud
6. What
features
detect
this
kind
of
regularity?
Distribu6on
of
leZer/digit/period/domain
frequencies
+
measures
of
distribu6onal
difference
Log-‐likelihood
ra6o:
good
at
low
counts
Digit
No
digit
Sample
(p)
9
1
Overall
(q)
200,000
200,000
Difference
in
log-‐likelihood
from
a
single
model
for
the
matrix
vs.
a
model
for
each
row
7. Example
2:
Distribu6on
of
user
agents
-‐Transform
so
that
it’s
less
“condi6onal”
-‐Get
rid
of
the
distribu6on
en6rely
(#
dis6nct
user
agents)
/
(#
dis6nct
IPs)
@jvns
@kelleyrivoire
9. At
each
node,
pick
a
feature
X
and
a
value
v
Splihng
on
X
<
v
should
minimize
I(L)
+
I(R)
-‐
I(D)
I:
“Impurity”
“PLANET”
10. Trained
trees
in
Python
with
scikit,
but...
Our
ETL
pipeline
runs
on
Hadoop
and
writes
Parquet
to
HDFS
Treatment
of
categorical
variables
is
subop6mal
(“x[1]
<=
0.500”)
No
customiza6on
(impurity:
“gini”
or
“entropy”)
11. “Brushfire”
@avibryant
@daniellesucher
Implemented
in
Scala
(Scalding)
Distributed
learning
approach
modeled
on
Google’s
PLANET
paper
Na6ve
support
for
ordered/ordinal/categorical
vars
Highly
customizable/modular
(e.g.,
splihng
func6on)
12. Customiza6on
We
don’t
necessarily
want
to
maximize
impurity
drop
with
each
split
X:
1
2
3
4
Y:
0
10
80
95
We
have
a
“split
budget”
(arer
enough
splits/
tree
levels
we’ll
run
out
of
data)
13. We
want
to
choose
splits
so
we
improve
the
ROC
curve
in
the
region
of
interest
(even
at
the
expense
of
total
AUC)
Want
improvement
here
Don’t
care
about
improvement
here
14. scikit
(ler)
vs.
brushfire
(right)
Fixed
FPR:
+7
percentage
points
in
recall
in
region
of
interest
15. Brushfire
to
be
open-‐sourced
in
the
next
month
(Talk
this
weekend
at
PNW
Scala)
16. Counterfactual
offline
evaluaHon
Li,
Chen,
Kleban,
Gupta:
“Counterfac6onal
Es6ma6on
and
Op6miza6on
of
Click
Metrics
for
Search
Engines”
17. Every
conversion
results
in
some
benefit
b
Every
chargeback
results
in
some
cost
c
Margin
=
30%,
product
costs
$10
Conversion:
$10
-‐
$7
(CGS)
=
$3
Chargeback:
-‐$7
(CGS)
-‐
$15
(fee)
=
-‐$22
The
rela6ve
sizes
of
b
and
c
determine
tolerance
for
false
pos6ves
and
false
nega6ves.
18. Train
a
model
on
charge
history
@ryw90
Historical
total
payoff:
3b
–
c
#
Outcome
Payoff
1
Conversion
b
2
Conversion
b
3
Chargeback
-‐c
4
Conversion
b
19. Evaluate
it
on
charge
history
Historical
total
payoff:
3b
–
c
Payoff
with
model:
2b
#
Outcome
Payoff
1
Conversion
b
2
Conversion
b
3
Disputed
-‐c
4
Conversion
b
Class
New
Outcome
Payoff
Good
Conversion
(TN)
b
Good
Conversion
(TN)
b
Fraud
Blocked
(TP)
0
Fraud
Blocked
(FP)
0
c
–
b
>
0
20. Model
evalua6on
possible
because
of
charge
log
without
interven6ons
Interve6on
beZer
than
no
interven6on
if
(odds
of
fraud)
x
(c/b)
x
(recall/fpr)
>
1
What
happens
with
the
next
model-‐building
itera6on?
21. Where
does
the
new
training
data
come
from?
#
Outcome
Payoff
1
Conversion
b
2
Conversion
b
3
Blocked
0
4
Blocked
0
New
model:
“good”
Conversion
or
chargeback?
An
A/B
test
would
be
complex/6me-‐consuming
23. #
Score
Original
acHon
P(Block)
Randomized
acHon
Outcome
Payoff
1
5
Allow
0.05
Allow
Conversion
b
2
20
Allow
0.10
Allow
Conversion
b
3
10
Allow
0.07
Block
N/A
0
4
50
Block
0.50
Allow
Chargeback
-‐c
5
65
Block
0.90
Allow
Conversion
b
Log
of
scores/probabili6es/ac6ons
Evaluate
performance
of
model
on
events
where
original
ac6on
==
randomized
ac6on
24. ...but
weight
by
inverse
of
expected
probability
#
Score
Original
acHon
P(allow)
P(Block)
Randomized
acHon
Outcome
Payoff
1
5
Allow
0.95
0.05
Allow
Conversion
b
2
20
Allow
0.90
0.10
Allow
Conversion
b
(1/0.95)b + (1/0.9)b
(1/0.95) + (1/0.9)
Average
payoff:
Intui6on:
If
the
ac6on
has
a
probability
p
and
we
see
it
in
the
log,
there
were
~1/p
total
such
events
= b
25. Similarly
for
the
candidate
model...
#
Score
Old
model
P(Allow)
P(Block)
Randomized
acHon
Outcome
Payoff
New
model
2
20
Allow
0.90
0.10
Allow
Conversion
b
Allow
4
50
Block
0.50
0.50
Allow
Chargeback
-‐c
Allow
5
65
Block
0.10
0.90
Allow
Conversion
b
Allow
(1/0.9)b + (1/0.5)(c) + (1/0.1)b
(1/0.9) + (1/0.5) + (1/0.1)
= 0.85b 0.15c
26. Compute
the
expected
payoff
offline
(arbitrarily
many
“experiments”)
Need
more
data
as
incumbent
model/policy
and
candidate
diverge
Propensity
func6on
controls
the
“exploita6on”
–
“explora6on”
tradeoff
Keep
merchant
experience
good
(adds
bias)
28. Fraudsters
generate
randomness
in
non-‐random
ways
(LLR
good
at
low
counts)
We
can
improve
our
random
forest
performance
by
biasing
the
training
(get
lir
where
you
need
it)
Randomizing
ac6ons
in
produc6on
makes
counterfactual
evalua6on
easier
(and
faster)
29. Thanks
mlm@stripe.com
@mlmanapat
Machine
learning
at
Stripe:
Avi
Bryant
@avibryant
Chris
Wu
(@chriswu_)
Dan
Frank
@danielhfrank
Danielle
Sucher
@daniellesucher
Julia
Evans
@jvns
Kelley
Rivoire
@kelleyrivoire
Ryan
Wang
@ryw90