SlideShare a Scribd company logo
1 of 22
Download to read offline
PachydermReproducible and Compliant
Data Science
Nick Harvey - Lead Developer Advocate
Pachyderm Inc.
Nick@pachyderm.com
@nicksharvey
As Data Scientists...
We
Pachyderm.com
As Data Scientists...
We
“Big” Data
Pachyderm.com
As Data Scientists...
We
Pachyderm.com
Production
ML/AI Model
Training
ML/AI Model
Inference or
Prediction
Pachyderm.com
Production
ML/AI Model
Training
Data Input
Transforms
Data
Ingestion
Data
Cleaning
Feature
Engineering
Model
Selection,
Parameter
Search
Feature
Transforms
Production
Model
Testing
Model
Export &
Optimization
ML/AI Model
Inference or
Prediction
Post
Processing
Pachyderm.com
To Reach Its Full Potential
Machine Learning Needs1.
Data to have the same
production practices as
code
2.
Empowered developers
not restricted
3.
Organization wide
confidence
Data Divergence
Data sets change constantly. Teams can’t make decisions from their
data if they don’t know what version was used.
Tooling Constraints
Infra often restricts the tooling options available to data scientists.
Not Reproducible
Data teams can’t reproduce results because they can’t track every
version of data and code throughout the system.
Obstacles that prevent
Effective Data Science
Pachyderm.com
For data science to be successful
outputs need to be reproducible
Manage data with the
same production
practices as code
Developers need to be
empowered with choice,
not restricted
Version control for Data
Containerized data pipelines
Be able to instantly
reconstruct any past
output/decision
Data Lineage
General Fusion uses Pachyderm to
Power Commercial Fusion
Research
“The true tipping point in our decision to use
Pachyderm was its version control features for
managing our data.”
- Jonathan Fraser
Engineer at General Fusion
General Fusion has collects large sets of complex data from thousands of
sensors. Managing, scaling, and processing that data is a challenge.
Criteria
1. A data science platform that could scale and adapt with their growth.
2. Augment existing experimental and analysis workflows.
3. Seamless collaboration with external scientific partners.
Business Outcome
1. Data versioning - Pachyderm enables data science teams to develop
reproducible and distributed data workflows without interfering with
each other's analysis.
2. Data provenance - Every data transformation is tracked, allowing any
result to be 100 percent reproducible and verifiable
Pachyderm provides reproducibility through
Data Versioning
Identify and revert “bad” data changes
Version model binaries and parameters
along with the data used to train them
Reproduce specific processes using
historical state(s) of data
Commit ID: a5bcc61...1812
Commit ID: 7afad96...680e
Commit ID: b85ea63...e4d4
Commit ID: 7585b4e...0cc5
Commit ID: af4cf48...8840
person.png
stopsign.png
road.png
boat.png
bike.png
Pachyderm.com
Pachyderm provides workflow management through
Containerized Analyses
Use any languages and frameworks in
pipelines
Port your workflows to any
infrastructure
Easily transition from local dev to production
deploy
Pachyderm.com
Pachyderm provides workflow management through
Data Pipelines
Use any languages and
frameworks in pipelines
Port your workflows to any
infrastructure
Easily transition from local
dev to production deploy
ETL Pipeline ML pipeline CI/CD Application
Pachyderm
Pachyderm.com
Versioned
Training
Data
Pre-Processing Model Export
Versioned
Pre-Processed
Data
Training Versioned
Model
Coming Soon
github.com/kubeflow/examples
Pachyderm provides audit trails via
Data Provenance
Track every version of data and code
that produced a result
Maintain compliance and reproducibility
Manage relationship between historical
data states
Pachyderm.com
Pachyderm
Stack Diagram
Pachyderm.com
Data Provenance In Action
Being able to pinpoint exactly what data is
being used is hard enough for most
companies. Tack on the requirement of having
to edit/remove a specific piece of data without
disruption, and that sees next to impossible.
General Data Protection
Regulation
Pachyderm.com
GDPR Example - Before
● File a ticket
● Entire audit of pipeline
● Removal of Jared’s data
● Models need to be
re-trained and tested.
● Audit to ensure Jared it
not part of the future
● Etc.
Time consuming
manual process
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
?
What happens when
“Black Box Problem”
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
“Pachctl delete-file jared.info”
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
GDPR Request
Met
Pachyderm.com
Pachyderm in 60-seconds
Pachyderm lets you deploy and manage multi-stage, language-agnostic data
pipelines while maintaining complete reproducibility and provenance.
Pachyderm.com
github.com/pachyderm
Thank you

More Related Content

What's hot

C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0
Buu Nguyen
 

What's hot (20)

Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme Programming
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Using dataset versioning in data science
Using dataset versioning in data scienceUsing dataset versioning in data science
Using dataset versioning in data science
 
Applying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeApplying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing Code
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Cqrs
CqrsCqrs
Cqrs
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
Web Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQLWeb Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQL
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Software Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalSoftware Frameworks for Music Information Retrieval
Software Frameworks for Music Information Retrieval
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Functional Programming - Worth the Effort
Functional Programming - Worth the EffortFunctional Programming - Worth the Effort
Functional Programming - Worth the Effort
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCop
 
How to NLProc from .NET
How to NLProc from .NETHow to NLProc from .NET
How to NLProc from .NET
 
C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Code Reviews in Python - PyZh
Code Reviews in Python - PyZhCode Reviews in Python - PyZh
Code Reviews in Python - PyZh
 

Similar to End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 

Similar to End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey (20)

Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
DEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNINGDEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNING
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Deeplearning and dev ops azure
Deeplearning and dev ops azureDeeplearning and dev ops azure
Deeplearning and dev ops azure
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Accelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSAccelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWS
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

  • 1. PachydermReproducible and Compliant Data Science Nick Harvey - Lead Developer Advocate Pachyderm Inc. Nick@pachyderm.com @nicksharvey
  • 7. To Reach Its Full Potential Machine Learning Needs1. Data to have the same production practices as code 2. Empowered developers not restricted 3. Organization wide confidence
  • 8. Data Divergence Data sets change constantly. Teams can’t make decisions from their data if they don’t know what version was used. Tooling Constraints Infra often restricts the tooling options available to data scientists. Not Reproducible Data teams can’t reproduce results because they can’t track every version of data and code throughout the system. Obstacles that prevent Effective Data Science Pachyderm.com
  • 9. For data science to be successful outputs need to be reproducible Manage data with the same production practices as code Developers need to be empowered with choice, not restricted Version control for Data Containerized data pipelines Be able to instantly reconstruct any past output/decision Data Lineage
  • 10. General Fusion uses Pachyderm to Power Commercial Fusion Research “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion General Fusion has collects large sets of complex data from thousands of sensors. Managing, scaling, and processing that data is a challenge. Criteria 1. A data science platform that could scale and adapt with their growth. 2. Augment existing experimental and analysis workflows. 3. Seamless collaboration with external scientific partners. Business Outcome 1. Data versioning - Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other's analysis. 2. Data provenance - Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable
  • 11. Pachyderm provides reproducibility through Data Versioning Identify and revert “bad” data changes Version model binaries and parameters along with the data used to train them Reproduce specific processes using historical state(s) of data Commit ID: a5bcc61...1812 Commit ID: 7afad96...680e Commit ID: b85ea63...e4d4 Commit ID: 7585b4e...0cc5 Commit ID: af4cf48...8840 person.png stopsign.png road.png boat.png bike.png Pachyderm.com
  • 12. Pachyderm provides workflow management through Containerized Analyses Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy Pachyderm.com
  • 13. Pachyderm provides workflow management through Data Pipelines Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy ETL Pipeline ML pipeline CI/CD Application Pachyderm Pachyderm.com
  • 14. Versioned Training Data Pre-Processing Model Export Versioned Pre-Processed Data Training Versioned Model Coming Soon github.com/kubeflow/examples
  • 15. Pachyderm provides audit trails via Data Provenance Track every version of data and code that produced a result Maintain compliance and reproducibility Manage relationship between historical data states Pachyderm.com
  • 17. Data Provenance In Action Being able to pinpoint exactly what data is being used is hard enough for most companies. Tack on the requirement of having to edit/remove a specific piece of data without disruption, and that sees next to impossible. General Data Protection Regulation Pachyderm.com
  • 18. GDPR Example - Before ● File a ticket ● Entire audit of pipeline ● Removal of Jared’s data ● Models need to be re-trained and tested. ● Audit to ensure Jared it not part of the future ● Etc. Time consuming manual process Model Training Users Database Model Deployed User “Jared” Opts out ? What happens when “Black Box Problem” Pachyderm.com
  • 19. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 “Pachctl delete-file jared.info” Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. Pachyderm.com
  • 20. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. GDPR Request Met Pachyderm.com
  • 21. Pachyderm in 60-seconds Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Pachyderm.com github.com/pachyderm