Machine Learning with a Data-Unfriendly Stack

•

1 like•532 views

Presenter: Michael Manapat, Stripe Stripe processes billions of dollars in payments a year on behalf of tens of thousands of businesses, using machine learning to detect and stop fraudulent transactions and fraudulent merchants. Our modeling workflow involves the typical “data science” tools: R and IPython for exploratory analysis, Hadoop for batch data processing, and scikit-learn for model building. However, Stripe’s production backend is written in Ruby and uses MongoDB as its data store, and this has introduced difficulties for both model training and production scoring. In this talk, I’ll describe the various choices we’ve made to bridge “main land” and “data land” and how, in the process, our model development process has gone from terrible to “ok.”

Technology

Machine learning with a
data-unfriendly stack
Michael Manapat
@mlmanapat

Fraud
(1) Fraudster buys card "dumps"
(4) Cardholder disputes charge
(2) Fraudster makes purchases with cards
(3) Fraudster sells item and pockets cash

Fraud detection with ML
Static features
country of card-issuing bank
anonymous IP
charge amount
Aggregate features
number of charges from card in past hour
Binary classiﬁer
Will the charge get disputed?
If we think so, then block the charge
Otherwise, allow it
Charge/Object Stream

First Iteration
Mongo
oplog
Postgres
Originally for analytics
Both raw and denormalized data
copy data to 'data.csv' csv;
Python / Hadoop
Produce aggregate features with Scalding
Static features either read from "denormalized"
tables or computed in Python
Ruby scoring
service
Ruby
charge
Service
Redis
Production Data/Training

First Iteration
Beneﬁts
• ...?
Issues
• Janky
• Limited in features we could
express (Redis as prod store)
• Features deﬁned in three
places
• Data is mutable
Training Production
Static Postgres and
Python
Ruby (directly
from Mongo
objects)
Aggregate
Scalding /
Hadoop
(raw data in
Postgres)
Ruby /
Redis

Second Iteration
Mongo HDFS
Snapshotted and denormalized
records available in Parquet
Recomputed nightly
Snapshots
Scalding
All feature generation in Scalding
(both static and dynamic)
Ruby scoring
service
Parse ODMs
to create Thrift
automatically
Ruby
charge
Service
Redis
Production Data/Training

Second Iteration
Beneﬁts
• Less manual work
• Features deﬁned in just two
places (Scalding and Ruby)
Issues
• Features deﬁned twice
• Still limited in what features we
can express
• Data is mutable
• Entire history of Stripe is ETLed
into Parquet nightly
Training Production
Static
Scalding
(from Parquet
on HDFS)
Ruby (directly
from Mongo
objects)
Aggregate
Scalding
(from Parquet
on HDFS)
Ruby /
Redis

Third Iteration (Today)
Ruby
charge
service
HDFS
Stream of
immutable events
(static features)
Online aggregation
service (lambda architecture)
Training data
generation
Ruby scoring service
Production
Data/Training
Scala aggregation job

Third Iteration
Beneﬁts
• Aggregate features are deﬁned
just once (in Scala)
• More expressive features
(aggregates from monoids and
semigroups)
• Data is immutable
Issues
• Event generation logic written
in Ruby--if new ﬁelds are
added to event (new static
features), need to run a backﬁll
Training Production
Static
Events on
HDFS
(generated in
Ruby)
Events
(generated in
Ruby)
Aggregate
Aggregator
on events/
Summingbird
Aggregator
on events/
Summingbird

Observations
Divergence between production and data
environments leads to errors and
maintenance burden
Feature generation for training and in
production should share the same code
Better for both modeling and infrastructure to
generate immutable events in production

Thanks
mlm@stripe.com | @mlmanapat
Contributors:
Alyssa Frazee, Avi Bryant, Ben Kirwin, Colin Marc, Dan Frank, Danielle
Sucher, Franklin Hu, Jason Pellerin, Jeff Balogh, Jeff Sarnat, Jocelyn
Ross, Julia Evans, Kelley Rivoire, Roban Kramer, Ryan Wang, Thomas
Switzer

Viewers also liked

Wrangle 2016: Staying Hippocratic with High Stakes DataWrangleConf

Data Science in Drug DiscoveryWrangleConf

Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangleConf

Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...WrangleConf

A/B Testing at Pinterest: Building a Culture of Experimentation WrangleConf

Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016MLconf

Viewers also liked (6)

Wrangle 2016: Staying Hippocratic with High Stakes Data

Data Science in Drug Discovery

Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow

Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...

A/B Testing at Pinterest: Building a Culture of Experimentation

Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

AI as an Interface for Commercial BuildingsMemoori

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Story boards and shot lists for my a level piececharlottematthew16

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Artificial intelligence in cctv survelliance.pptxhariprasad279825

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Training state-of-the-art general text embeddingZilliz

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

AI as an Interface for Commercial Buildings

Nell’iperspazio con Rocket: il Framework Web di Rust!

Story boards and shot lists for my a level piece

Designing IA for AI - Information Architecture Conference 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

SIP trunking in Janus @ Kamailio World 2024

Are Multi-Cloud and Serverless Good or Bad?

Advanced Test Driven-Development @ php[tek] 2024

Artificial intelligence in cctv survelliance.pptx

DevoxxFR 2024 Reproducible Builds with Apache Maven

Anypoint Exchange: It’s Not Just a Repo!

Developer Data Modeling Mistakes: From Postgres to NoSQL

Commit 2024 - Secret Management made easy

Training state-of-the-art general text embedding

"Debugging python applications inside k8s environment", Andrii Soldatenko

DevEX - reference for building teams, processes, and platforms

Unleash Your Potential - Namagunga Girls Coding Club

Machine Learning with a Data-Unfriendly Stack

1. Machine learning with a data-unfriendly stack Michael Manapat @mlmanapat

2. Full stack for online commerce

3. Fraud (1) Fraudster buys card "dumps" (4) Cardholder disputes charge (2) Fraudster makes purchases with cards (3) Fraudster sells item and pockets cash

4. Fraud detection with ML Static features country of card-issuing bank anonymous IP charge amount Aggregate features number of charges from card in past hour Binary classiﬁer Will the charge get disputed? If we think so, then block the charge Otherwise, allow it Charge/Object Stream

5. First Iteration Mongo oplog Postgres Originally for analytics Both raw and denormalized data copy data to 'data.csv' csv; Python / Hadoop Produce aggregate features with Scalding Static features either read from "denormalized" tables or computed in Python Ruby scoring service Ruby charge Service Redis Production Data/Training

6. First Iteration Beneﬁts • ...? Issues • Janky • Limited in features we could express (Redis as prod store) • Features deﬁned in three places • Data is mutable Training Production Static Postgres and Python Ruby (directly from Mongo objects) Aggregate Scalding / Hadoop (raw data in Postgres) Ruby / Redis

7. Second Iteration Mongo HDFS Snapshotted and denormalized records available in Parquet Recomputed nightly Snapshots Scalding All feature generation in Scalding (both static and dynamic) Ruby scoring service Parse ODMs to create Thrift automatically Ruby charge Service Redis Production Data/Training

8. Second Iteration Benefits • Less manual work • Features defined in just two places (Scalding and Ruby) Issues • Features defined twice • Still limited in what features we can express • Data is mutable • Entire history of Stripe is ETLed into Parquet nightly Training Production Static Scalding (from Parquet on HDFS) Ruby (directly from Mongo objects) Aggregate Scalding (from Parquet on HDFS) Ruby / Redis

9. Third Iteration (Today) Ruby charge service HDFS Stream of immutable events (static features) Online aggregation service (lambda architecture) Training data generation Ruby scoring service Production Data/Training Scala aggregation job

10. Third Iteration Benefits • Aggregate features are defined just once (in Scala) • More expressive features (aggregates from monoids and semigroups) • Data is immutable Issues • Event generation logic written in Ruby--if new fields are added to event (new static features), need to run a backfill Training Production Static Events on HDFS (generated in Ruby) Events (generated in Ruby) Aggregate Aggregator on events/ Summingbird Aggregator on events/ Summingbird

11. Observations Divergence between production and data environments leads to errors and maintenance burden Feature generation for training and in production should share the same code Better for both modeling and infrastructure to generate immutable events in production

12. Thanks mlm@stripe.com | @mlmanapat Contributors: Alyssa Frazee, Avi Bryant, Ben Kirwin, Colin Marc, Dan Frank, Danielle Sucher, Franklin Hu, Jason Pellerin, Jeff Balogh, Jeff Sarnat, Jocelyn Ross, Julia Evans, Kelley Rivoire, Roban Kramer, Ryan Wang, Thomas Switzer

Machine Learning with a Data-Unfriendly Stack

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Recently uploaded

Recently uploaded (20)

Machine Learning with a Data-Unfriendly Stack