End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

PachydermReproducible and Compliant
Data Science
Nick Harvey - Lead Developer Advocate
Pachyderm Inc.
Nick@pachyderm.com
@nicksharvey

As Data Scientists...
We
Pachyderm.com

As Data Scientists...
We
“Big” Data
Pachyderm.com

Production
ML/AI Model
Training
ML/AI Model
Inference or
Prediction
Pachyderm.com

Production
ML/AI Model
Training
Data Input
Transforms
Data
Ingestion
Data
Cleaning
Feature
Engineering
Model
Selection,
Parameter
Search
Feature
Transforms
Production
Model
Testing
Model
Export &
Optimization
ML/AI Model
Inference or
Prediction
Post
Processing
Pachyderm.com

To Reach Its Full Potential
Machine Learning Needs1.
Data to have the same
production practices as
code
2.
Empowered developers
not restricted
3.
Organization wide
confidence

Data Divergence
Data sets change constantly. Teams can’t make decisions from their
data if they don’t know what version was used.
Tooling Constraints
Infra often restricts the tooling options available to data scientists.
Not Reproducible
Data teams can’t reproduce results because they can’t track every
version of data and code throughout the system.
Obstacles that prevent
Effective Data Science
Pachyderm.com

For data science to be successful
outputs need to be reproducible
Manage data with the
same production
practices as code
Developers need to be
empowered with choice,
not restricted
Version control for Data
Containerized data pipelines
Be able to instantly
reconstruct any past
output/decision
Data Lineage

General Fusion uses Pachyderm to
Power Commercial Fusion
Research
“The true tipping point in our decision to use
Pachyderm was its version control features for
managing our data.”
- Jonathan Fraser
Engineer at General Fusion
General Fusion has collects large sets of complex data from thousands of
sensors. Managing, scaling, and processing that data is a challenge.
Criteria
1. A data science platform that could scale and adapt with their growth.
2. Augment existing experimental and analysis workflows.
3. Seamless collaboration with external scientific partners.
Business Outcome
1. Data versioning - Pachyderm enables data science teams to develop
reproducible and distributed data workflows without interfering with
each other's analysis.
2. Data provenance - Every data transformation is tracked, allowing any
result to be 100 percent reproducible and verifiable

Pachyderm provides reproducibility through
Data Versioning
Identify and revert “bad” data changes
Version model binaries and parameters
along with the data used to train them
Reproduce specific processes using
historical state(s) of data
Commit ID: a5bcc61...1812
Commit ID: 7afad96...680e
Commit ID: b85ea63...e4d4
Commit ID: 7585b4e...0cc5
Commit ID: af4cf48...8840
person.png
stopsign.png
road.png
boat.png
bike.png
Pachyderm.com

Pachyderm provides workflow management through
Containerized Analyses
Use any languages and frameworks in
pipelines
Port your workflows to any
infrastructure
Easily transition from local dev to production
deploy
Pachyderm.com

Pachyderm provides workflow management through
Data Pipelines
Use any languages and
frameworks in pipelines
Port your workflows to any
infrastructure
Easily transition from local
dev to production deploy
ETL Pipeline ML pipeline CI/CD Application
Pachyderm
Pachyderm.com

Versioned
Training
Data
Pre-Processing Model Export
Versioned
Pre-Processed
Data
Training Versioned
Model
Coming Soon
github.com/kubeflow/examples

Pachyderm provides audit trails via
Data Provenance
Track every version of data and code
that produced a result
Maintain compliance and reproducibility
Manage relationship between historical
data states
Pachyderm.com

Pachyderm
Stack Diagram
Pachyderm.com

Data Provenance In Action
Being able to pinpoint exactly what data is
being used is hard enough for most
companies. Tack on the requirement of having
to edit/remove a specific piece of data without
disruption, and that sees next to impossible.
General Data Protection
Regulation
Pachyderm.com

GDPR Example - Before
● File a ticket
● Entire audit of pipeline
● Removal of Jared’s data
● Models need to be
re-trained and tested.
● Audit to ensure Jared it
not part of the future
● Etc.
Time consuming
manual process
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
?
What happens when
“Black Box Problem”
Pachyderm.com

GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
“Pachctl delete-file jared.info”
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
Pachyderm.com

GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
GDPR Request
Met
Pachyderm.com

Pachyderm in 60-seconds
Pachyderm lets you deploy and manage multi-stage, language-agnostic data
pipelines while maintaining complete reproducibility and provenance.
Pachyderm.com
github.com/pachyderm

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

Similar to End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey