Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

  1. 1. PachydermReproducible and Compliant Data Science Nick Harvey - Lead Developer Advocate Pachyderm Inc. Nick@pachyderm.com @nicksharvey
  2. 2. As Data Scientists... We Pachyderm.com
  3. 3. As Data Scientists... We “Big” Data Pachyderm.com
  4. 4. As Data Scientists... We Pachyderm.com
  5. 5. Production ML/AI Model Training ML/AI Model Inference or Prediction Pachyderm.com
  6. 6. Production ML/AI Model Training Data Input Transforms Data Ingestion Data Cleaning Feature Engineering Model Selection, Parameter Search Feature Transforms Production Model Testing Model Export & Optimization ML/AI Model Inference or Prediction Post Processing Pachyderm.com
  7. 7. To Reach Its Full Potential Machine Learning Needs1. Data to have the same production practices as code 2. Empowered developers not restricted 3. Organization wide confidence
  8. 8. Data Divergence Data sets change constantly. Teams can’t make decisions from their data if they don’t know what version was used. Tooling Constraints Infra often restricts the tooling options available to data scientists. Not Reproducible Data teams can’t reproduce results because they can’t track every version of data and code throughout the system. Obstacles that prevent Effective Data Science Pachyderm.com
  9. 9. For data science to be successful outputs need to be reproducible Manage data with the same production practices as code Developers need to be empowered with choice, not restricted Version control for Data Containerized data pipelines Be able to instantly reconstruct any past output/decision Data Lineage
  10. 10. General Fusion uses Pachyderm to Power Commercial Fusion Research “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion General Fusion has collects large sets of complex data from thousands of sensors. Managing, scaling, and processing that data is a challenge. Criteria 1. A data science platform that could scale and adapt with their growth. 2. Augment existing experimental and analysis workflows. 3. Seamless collaboration with external scientific partners. Business Outcome 1. Data versioning - Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other's analysis. 2. Data provenance - Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable
  11. 11. Pachyderm provides reproducibility through Data Versioning Identify and revert “bad” data changes Version model binaries and parameters along with the data used to train them Reproduce specific processes using historical state(s) of data Commit ID: a5bcc61...1812 Commit ID: 7afad96...680e Commit ID: b85ea63...e4d4 Commit ID: 7585b4e...0cc5 Commit ID: af4cf48...8840 person.png stopsign.png road.png boat.png bike.png Pachyderm.com
  12. 12. Pachyderm provides workflow management through Containerized Analyses Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy Pachyderm.com
  13. 13. Pachyderm provides workflow management through Data Pipelines Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy ETL Pipeline ML pipeline CI/CD Application Pachyderm Pachyderm.com
  14. 14. Versioned Training Data Pre-Processing Model Export Versioned Pre-Processed Data Training Versioned Model Coming Soon github.com/kubeflow/examples
  15. 15. Pachyderm provides audit trails via Data Provenance Track every version of data and code that produced a result Maintain compliance and reproducibility Manage relationship between historical data states Pachyderm.com
  16. 16. Pachyderm Stack Diagram Pachyderm.com
  17. 17. Data Provenance In Action Being able to pinpoint exactly what data is being used is hard enough for most companies. Tack on the requirement of having to edit/remove a specific piece of data without disruption, and that sees next to impossible. General Data Protection Regulation Pachyderm.com
  18. 18. GDPR Example - Before ● File a ticket ● Entire audit of pipeline ● Removal of Jared’s data ● Models need to be re-trained and tested. ● Audit to ensure Jared it not part of the future ● Etc. Time consuming manual process Model Training Users Database Model Deployed User “Jared” Opts out ? What happens when “Black Box Problem” Pachyderm.com
  19. 19. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 “Pachctl delete-file jared.info” Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. Pachyderm.com
  20. 20. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. GDPR Request Met Pachyderm.com
  21. 21. Pachyderm in 60-seconds Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Pachyderm.com github.com/pachyderm
  22. 22. Thank you

×