Presenter: Michael Manapat, Stripe
Stripe processes billions of dollars in payments a year on behalf of tens of thousands of businesses, using machine learning to detect and stop fraudulent transactions and fraudulent merchants. Our modeling workflow involves the typical “data science” tools: R and IPython for exploratory analysis, Hadoop for batch data processing, and scikit-learn for model building. However, Stripe’s production backend is written in Ruby and uses MongoDB as its data store, and this has introduced difficulties for both model training and production scoring. In this talk, I’ll describe the various choices we’ve made to bridge “main land” and “data land” and how, in the process, our model development process has gone from terrible to “ok.”
3. Fraud
(1) Fraudster buys card "dumps"
(4) Cardholder disputes charge
(2) Fraudster makes purchases with cards
(3) Fraudster sells item and pockets cash
4. Fraud detection with ML
Static features
country of card-issuing bank
anonymous IP
charge amount
Aggregate features
number of charges from card in past hour
Binary classifier
Will the charge get disputed?
If we think so, then block the charge
Otherwise, allow it
Charge/Object Stream
5. First Iteration
Mongo
oplog
Postgres
Originally for analytics
Both raw and denormalized data
copy data to 'data.csv' csv;
Python / Hadoop
Produce aggregate features with Scalding
Static features either read from "denormalized"
tables or computed in Python
Ruby scoring
service
Ruby
charge
Service
Redis
Production Data/Training
6. First Iteration
Benefits
• ...?
Issues
• Janky
• Limited in features we could
express (Redis as prod store)
• Features defined in three
places
• Data is mutable
Training Production
Static Postgres and
Python
Ruby (directly
from Mongo
objects)
Aggregate
Scalding /
Hadoop
(raw data in
Postgres)
Ruby /
Redis
7. Second Iteration
Mongo HDFS
Snapshotted and denormalized
records available in Parquet
Recomputed nightly
Snapshots
Scalding
All feature generation in Scalding
(both static and dynamic)
Ruby scoring
service
Parse ODMs
to create Thrift
automatically
Ruby
charge
Service
Redis
Production Data/Training
8. Second Iteration
Benefits
• Less manual work
• Features defined in just two
places (Scalding and Ruby)
Issues
• Features defined twice
• Still limited in what features we
can express
• Data is mutable
• Entire history of Stripe is ETLed
into Parquet nightly
Training Production
Static
Scalding
(from Parquet
on HDFS)
Ruby (directly
from Mongo
objects)
Aggregate
Scalding
(from Parquet
on HDFS)
Ruby /
Redis
9. Third Iteration (Today)
Ruby
charge
service
HDFS
Stream of
immutable events
(static features)
Online aggregation
service (lambda architecture)
Training data
generation
Ruby scoring service
Production
Data/Training
Scala aggregation job
10. Third Iteration
Benefits
• Aggregate features are defined
just once (in Scala)
• More expressive features
(aggregates from monoids and
semigroups)
• Data is immutable
Issues
• Event generation logic written
in Ruby--if new fields are
added to event (new static
features), need to run a backfill
Training Production
Static
Events on
HDFS
(generated in
Ruby)
Events
(generated in
Ruby)
Aggregate
Aggregator
on events/
Summingbird
Aggregator
on events/
Summingbird
11. Observations
Divergence between production and data
environments leads to errors and
maintenance burden
Feature generation for training and in
production should share the same code
Better for both modeling and infrastructure to
generate immutable events in production
12. Thanks
mlm@stripe.com | @mlmanapat
Contributors:
Alyssa Frazee, Avi Bryant, Ben Kirwin, Colin Marc, Dan Frank, Danielle
Sucher, Franklin Hu, Jason Pellerin, Jeff Balogh, Jeff Sarnat, Jocelyn
Ross, Julia Evans, Kelley Rivoire, Roban Kramer, Ryan Wang, Thomas
Switzer