SlideShare a Scribd company logo
1 of 12
Machine learning with a
data-unfriendly stack
Michael Manapat
@mlmanapat
Full stack for online commerce
Fraud
(1) Fraudster buys card "dumps"
(4) Cardholder disputes charge
(2) Fraudster makes purchases with cards
(3) Fraudster sells item and pockets cash
Fraud detection with ML
Static features
country of card-issuing bank
anonymous IP
charge amount
Aggregate features
number of charges from card in past hour
Binary classifier
Will the charge get disputed?
If we think so, then block the charge
Otherwise, allow it
Charge/Object Stream
First Iteration
Mongo
oplog
Postgres
Originally for analytics
Both raw and denormalized data
copy data to 'data.csv' csv;
Python / Hadoop
Produce aggregate features with Scalding
Static features either read from "denormalized"
tables or computed in Python
Ruby scoring
service
Ruby
charge
Service
Redis
Production Data/Training
First Iteration
Benefits
• ...?
Issues
• Janky
• Limited in features we could
express (Redis as prod store)
• Features defined in three
places
• Data is mutable
Training Production
Static Postgres and
Python
Ruby (directly
from Mongo
objects)
Aggregate
Scalding /
Hadoop
(raw data in
Postgres)
Ruby /
Redis
Second Iteration
Mongo HDFS
Snapshotted and denormalized
records available in Parquet
Recomputed nightly
Snapshots
Scalding
All feature generation in Scalding
(both static and dynamic)
Ruby scoring
service
Parse ODMs
to create Thrift
automatically
Ruby
charge
Service
Redis
Production Data/Training
Second Iteration
Benefits
• Less manual work
• Features defined in just two
places (Scalding and Ruby)
Issues
• Features defined twice
• Still limited in what features we
can express
• Data is mutable
• Entire history of Stripe is ETLed
into Parquet nightly
Training Production
Static
Scalding
(from Parquet
on HDFS)
Ruby (directly
from Mongo
objects)
Aggregate
Scalding
(from Parquet
on HDFS)
Ruby /
Redis
Third Iteration (Today)
Ruby
charge
service
HDFS
Stream of
immutable events
(static features)
Online aggregation
service (lambda architecture)
Training data
generation
Ruby scoring service
Production
Data/Training
Scala aggregation job
Third Iteration
Benefits
• Aggregate features are defined
just once (in Scala)
• More expressive features
(aggregates from monoids and
semigroups)
• Data is immutable
Issues
• Event generation logic written
in Ruby--if new fields are
added to event (new static
features), need to run a backfill
Training Production
Static
Events on
HDFS
(generated in
Ruby)
Events
(generated in
Ruby)
Aggregate
Aggregator
on events/
Summingbird
Aggregator
on events/
Summingbird
Observations
Divergence between production and data
environments leads to errors and
maintenance burden
Feature generation for training and in
production should share the same code
Better for both modeling and infrastructure to
generate immutable events in production
Thanks
mlm@stripe.com | @mlmanapat
Contributors:
Alyssa Frazee, Avi Bryant, Ben Kirwin, Colin Marc, Dan Frank, Danielle
Sucher, Franklin Hu, Jason Pellerin, Jeff Balogh, Jeff Sarnat, Jocelyn
Ross, Julia Evans, Kelley Rivoire, Roban Kramer, Ryan Wang, Thomas
Switzer

More Related Content

Viewers also liked

Wrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangleConf
 
Data Science in Drug Discovery
Data Science in Drug DiscoveryData Science in Drug Discovery
Data Science in Drug DiscoveryWrangleConf
 
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangleConf
 
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...WrangleConf
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation WrangleConf
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016MLconf
 

Viewers also liked (6)

Wrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes Data
 
Data Science in Drug Discovery
Data Science in Drug DiscoveryData Science in Drug Discovery
Data Science in Drug Discovery
 
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
 
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...
Wrangle 2016: Seeing Behaviors as Humans Do: Uncovering Hidden Patterns in Ti...
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Machine Learning with a Data-Unfriendly Stack

  • 1. Machine learning with a data-unfriendly stack Michael Manapat @mlmanapat
  • 2. Full stack for online commerce
  • 3. Fraud (1) Fraudster buys card "dumps" (4) Cardholder disputes charge (2) Fraudster makes purchases with cards (3) Fraudster sells item and pockets cash
  • 4. Fraud detection with ML Static features country of card-issuing bank anonymous IP charge amount Aggregate features number of charges from card in past hour Binary classifier Will the charge get disputed? If we think so, then block the charge Otherwise, allow it Charge/Object Stream
  • 5. First Iteration Mongo oplog Postgres Originally for analytics Both raw and denormalized data copy data to 'data.csv' csv; Python / Hadoop Produce aggregate features with Scalding Static features either read from "denormalized" tables or computed in Python Ruby scoring service Ruby charge Service Redis Production Data/Training
  • 6. First Iteration Benefits • ...? Issues • Janky • Limited in features we could express (Redis as prod store) • Features defined in three places • Data is mutable Training Production Static Postgres and Python Ruby (directly from Mongo objects) Aggregate Scalding / Hadoop (raw data in Postgres) Ruby / Redis
  • 7. Second Iteration Mongo HDFS Snapshotted and denormalized records available in Parquet Recomputed nightly Snapshots Scalding All feature generation in Scalding (both static and dynamic) Ruby scoring service Parse ODMs to create Thrift automatically Ruby charge Service Redis Production Data/Training
  • 8. Second Iteration Benefits • Less manual work • Features defined in just two places (Scalding and Ruby) Issues • Features defined twice • Still limited in what features we can express • Data is mutable • Entire history of Stripe is ETLed into Parquet nightly Training Production Static Scalding (from Parquet on HDFS) Ruby (directly from Mongo objects) Aggregate Scalding (from Parquet on HDFS) Ruby / Redis
  • 9. Third Iteration (Today) Ruby charge service HDFS Stream of immutable events (static features) Online aggregation service (lambda architecture) Training data generation Ruby scoring service Production Data/Training Scala aggregation job
  • 10. Third Iteration Benefits • Aggregate features are defined just once (in Scala) • More expressive features (aggregates from monoids and semigroups) • Data is immutable Issues • Event generation logic written in Ruby--if new fields are added to event (new static features), need to run a backfill Training Production Static Events on HDFS (generated in Ruby) Events (generated in Ruby) Aggregate Aggregator on events/ Summingbird Aggregator on events/ Summingbird
  • 11. Observations Divergence between production and data environments leads to errors and maintenance burden Feature generation for training and in production should share the same code Better for both modeling and infrastructure to generate immutable events in production
  • 12. Thanks mlm@stripe.com | @mlmanapat Contributors: Alyssa Frazee, Avi Bryant, Ben Kirwin, Colin Marc, Dan Frank, Danielle Sucher, Franklin Hu, Jason Pellerin, Jeff Balogh, Jeff Sarnat, Jocelyn Ross, Julia Evans, Kelley Rivoire, Roban Kramer, Ryan Wang, Thomas Switzer