SlideShare a Scribd company logo
1 of 43
Download to read offline
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer
Agile Data Science on Greenplum
Using Airflow
“Agile software development refers to a group of software
development methodologies based on iterative
development, where requirements and solutions evolve
through collaboration between self-organizing cross-
functional teams.”
Data Science Phases
Discovery Phase Operationalization (O16n)
Phase
Data Science Phases
Discovery
Phase
Data exploration &
cleaning
Feature engineering
Model Building
Model Evaluation
Data Science Phases - Agility
Discovery
Phase
Rapid Iteration
and
Experimentation
Data exploration &
cleaning
Feature engineering
Model Building
Model Evaluation
Data Science Phases - Agility
Discovery
Phase
Rapid Iteration
and
Experimentation
Data exploration &
cleaning
Feature engineering
Model Building
Model Evaluation
Greenplum Database
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
● MPP database based on
Postgres
● In database analytics
● Parallel architecture
Jupyter Notebooks
Data Science Phases
Discovery Phase Operationalization (O16n)
Phase
Data Pipelines
Testing
Monitoring
● APIs to consume model
output
Data Science Phases
Operationalization
(O16n) Phase
Data Science Phases - Agility
Operationalization
(O16n) Phase
Automated manageable pipelines
Testing with CI
Monitoring to react to Failures
Madlib Flow Talk by Frank and
Jarrod
Data Pipelines
Testing
Monitoring
● APIs to consume model
output
Data Science Phases - Agility
Operationalization
(O16n) Phase
Automated manageable pipelines
Testing with CI
Monitoring to react to Failures
Madlib Flow Talk by Frank and
Jarrod
Data Pipelines
Testing
Monitoring
● APIs to consume model
output
Airflow
● Apache Project spun
out of Airbnb
● “Airflow is a platform to
programmatically
author, schedule and
monitor workflows.”
Data Science Use-Case
● The Data
○ Time-series trajectories with
latitude and longitude of
location.
○ Subset of trajectories are labeled as
walk / not walk
● Our Model
○ Build Classification model using
labelled data to identify if new
unlabeled trajectories are walk or
not walk
`
Example trajectories
Example data
Example trajectory data
Example label data
We have mode labels
of walk and not walk
only for subset of
incoming daily
trajectories
Discovery phase → Operationalization phase
After every model iteration we check if the model is viable
● Check the quantitative metrics of the model like AUC, ROC curve,
accuracy etc
● Check the qualitative results of the model and if it make sense to a
subject matter expert
Once we are convinced that the model is both quantitatively
and qualitatively viable we can move to the
Operationalization phase
Discovery phase → Operationalization phase
Example of code from the discovery phase which is converted into a task
script
Architecture overview
Modular idempotent tasks Connect tasks to create
automated workflows
Model refitting
workflow
output
output
New data
inference workflow
TBD: Expose
model results
using an API
Operationalization Phase
Model
Evaluation
Model
Building
Feature
Engineering
Data
Exploration
Iterative
Discover
y
Phase
output
ML Model + Data
pre-processing
Discovery Phase
Pipelines
Testing
Monitoring
APIs to consume model
output
Data Science Phases - Agility
Operationalization
(O16n) Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Madlib Flow Talk by Frank and
Jarrod
End to End Directed Acyclic Graph (DAG)
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data
Data Prep and Feature Engineering
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data
Data Prep and Feature Engineering
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data
Model Scoring DAG
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Data Prep and Feature Engineering -
Demo
● This DAG has a single task for model
training
● In this task we split the data into train
and test samples, train the model,
evaluate the model and capture the
accuracy, auc and model tables.
● We want all of the above to run at the
same time
Model Training DAG
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Model Training DAG - Demo
Model Scoring DAG
Fetch
Clean
Transform
Feature Engineering
Extract labelled data for
model creation/refitting
Inference for
unlabelled data
● The unlabeled data which is extracted from
the features table is scored in this DAG
● We first check if any model has been built
● If there is a model so we score the data
(inference)
Model Scoring DAG
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Model Scoring DAG - Demo
● Daily we get some more labeled data, once we have accumulated
enough labeled data we can retrain the model for better accuracy
● We have scheduled model re-training monthly
Model Re-Training
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Model Re-Training - Demo
Pipelines
Testing
Monitoring
APIs to consume model
output
Data Science Phases - Agility
Operationalization
(O16n) Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Madlib Flow Talk by Frank and
Jarrod
Testing with CI/CD
● Testing Data Pipelines is hard
● Test Coverage (Test Tasks vs Test DAGs)
● Testing as part of the CI/CD
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Testing with CI/CD - Demo
Pipelines
Testing
Monitoring
APIs to consume model
output
Data Science Phases - Agility
Operationalization (O16n)
Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Madlib Flow Talk by Frank and
Jarrod
● Monitoring and error fixing is big part of responsive data pipelines
● Ability to quickly identify what is failing, why it is failing and fixing
it with minimum lead time is crucial
● In this demo we will showcase an error fixing case
Monitoring and Error Fixing
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Monitoring and Error Fixing - Demo
Pipelines
Testing
Monitoring
APIs to consume model
output
Data Science Phases - Agility
Operationalization
(O16n) Phase
Automated manageable Pipelines
Testing with CI/CD
Monitoring to React to Failures
Madlib Flow Talk by Frank and
Jarrod
Greenplum and Jupyter notebooks provides a set of tools
to do Agile Data Science during discovery phase
Greenplum along with Airflow and Circle CI is very
effective to do Agile Data Science during the
operationalization phase
Conclusion
Questions
“We partner to help you
compete, grow, and transform.”
Demo Flow
1) 4/12 - 5/5 -> pre-demo
2) Insert row -> pre-demo
3) 5/5 - 5/10 -> data prep and feature engg
4) Model training (show geolife.models_metadata table before and after running)
5) Model scoring 5/10 - 5/12 (show geolife.walk_prediction_results and count increase
after each DAG run)
6) Model re-training (show geolife.models_metadata again)
7) Monitoring & Error Fixing 5/13 - 5/14 (uncomment line and rerun)
8) Testing - TDB

More Related Content

What's hot

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...confluent
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registryconfluent
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMydbops
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLEDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드confluent
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
non-strict functions, bottom and scala by-name parameters
non-strict functions, bottom and scala by-name parametersnon-strict functions, bottom and scala by-name parameters
non-strict functions, bottom and scala by-name parametersPhilip Schwarz
 

What's hot (20)

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To Transactions
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
non-strict functions, bottom and scala by-name parameters
non-strict functions, bottom and scala by-name parametersnon-strict functions, bottom and scala by-name parameters
non-strict functions, bottom and scala by-name parameters
 

Similar to Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Neotys_Partner
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?mabl
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUDmitrii Suslov
 
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f...
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f..." Performance testing for Automation QA - why and how " by Andrey Kovalenko f...
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f...Lohika_Odessa_TechTalks
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsHal Rottenberg
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 

Similar to Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019 (20)

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHU
 
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f...
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f..." Performance testing for Automation QA - why and how " by Andrey Kovalenko f...
" Performance testing for Automation QA - why and how " by Andrey Kovalenko f...
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data Analytics
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 

More from VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

More from VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Recently uploaded

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019

  • 1.
  • 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer Agile Data Science on Greenplum Using Airflow
  • 3. “Agile software development refers to a group of software development methodologies based on iterative development, where requirements and solutions evolve through collaboration between self-organizing cross- functional teams.”
  • 4. Data Science Phases Discovery Phase Operationalization (O16n) Phase
  • 5. Data Science Phases Discovery Phase Data exploration & cleaning Feature engineering Model Building Model Evaluation
  • 6. Data Science Phases - Agility Discovery Phase Rapid Iteration and Experimentation Data exploration & cleaning Feature engineering Model Building Model Evaluation
  • 7. Data Science Phases - Agility Discovery Phase Rapid Iteration and Experimentation Data exploration & cleaning Feature engineering Model Building Model Evaluation
  • 8. Greenplum Database Segment Host Segment Segment Segment Host Segment Segment Standby Master … Master Host SQL Interconnect Segment Host Node1 Segment Host Node2 Segment Host Node3 Segment Host NodeN ● MPP database based on Postgres ● In database analytics ● Parallel architecture
  • 10. Data Science Phases Discovery Phase Operationalization (O16n) Phase
  • 11. Data Pipelines Testing Monitoring ● APIs to consume model output Data Science Phases Operationalization (O16n) Phase
  • 12. Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable pipelines Testing with CI Monitoring to react to Failures Madlib Flow Talk by Frank and Jarrod Data Pipelines Testing Monitoring ● APIs to consume model output
  • 13. Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable pipelines Testing with CI Monitoring to react to Failures Madlib Flow Talk by Frank and Jarrod Data Pipelines Testing Monitoring ● APIs to consume model output
  • 14. Airflow ● Apache Project spun out of Airbnb ● “Airflow is a platform to programmatically author, schedule and monitor workflows.”
  • 15. Data Science Use-Case ● The Data ○ Time-series trajectories with latitude and longitude of location. ○ Subset of trajectories are labeled as walk / not walk ● Our Model ○ Build Classification model using labelled data to identify if new unlabeled trajectories are walk or not walk ` Example trajectories
  • 16. Example data Example trajectory data Example label data We have mode labels of walk and not walk only for subset of incoming daily trajectories
  • 17. Discovery phase → Operationalization phase After every model iteration we check if the model is viable ● Check the quantitative metrics of the model like AUC, ROC curve, accuracy etc ● Check the qualitative results of the model and if it make sense to a subject matter expert Once we are convinced that the model is both quantitatively and qualitatively viable we can move to the Operationalization phase
  • 18. Discovery phase → Operationalization phase Example of code from the discovery phase which is converted into a task script
  • 19. Architecture overview Modular idempotent tasks Connect tasks to create automated workflows Model refitting workflow output output New data inference workflow TBD: Expose model results using an API Operationalization Phase Model Evaluation Model Building Feature Engineering Data Exploration Iterative Discover y Phase output ML Model + Data pre-processing Discovery Phase
  • 20. Pipelines Testing Monitoring APIs to consume model output Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable Pipelines Testing with CI/CD Monitoring to React to Failures Madlib Flow Talk by Frank and Jarrod
  • 21. End to End Directed Acyclic Graph (DAG) Fetch Clean Transform Feature Engineering Extract labelled data for model creation/refitting Inference for unlabelled data
  • 22. Data Prep and Feature Engineering Fetch Clean Transform Feature Engineering Extract labelled data for model creation/refitting Inference for unlabelled data
  • 23. Data Prep and Feature Engineering Fetch Clean Transform Feature Engineering Extract labelled data for model creation/refitting Inference for unlabelled data
  • 24. Model Scoring DAG Fetch Clean Transform Feature Engineering Extract labelled data for model creation/refitting Inference for unlabelled data
  • 25. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Data Prep and Feature Engineering - Demo
  • 26. ● This DAG has a single task for model training ● In this task we split the data into train and test samples, train the model, evaluate the model and capture the accuracy, auc and model tables. ● We want all of the above to run at the same time Model Training DAG
  • 27. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Model Training DAG - Demo
  • 28. Model Scoring DAG Fetch Clean Transform Feature Engineering Extract labelled data for model creation/refitting Inference for unlabelled data
  • 29. ● The unlabeled data which is extracted from the features table is scored in this DAG ● We first check if any model has been built ● If there is a model so we score the data (inference) Model Scoring DAG
  • 30. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Model Scoring DAG - Demo
  • 31. ● Daily we get some more labeled data, once we have accumulated enough labeled data we can retrain the model for better accuracy ● We have scheduled model re-training monthly Model Re-Training
  • 32. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Model Re-Training - Demo
  • 33. Pipelines Testing Monitoring APIs to consume model output Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable Pipelines Testing with CI/CD Monitoring to React to Failures Madlib Flow Talk by Frank and Jarrod
  • 34. Testing with CI/CD ● Testing Data Pipelines is hard ● Test Coverage (Test Tasks vs Test DAGs) ● Testing as part of the CI/CD
  • 35. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Testing with CI/CD - Demo
  • 36. Pipelines Testing Monitoring APIs to consume model output Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable Pipelines Testing with CI/CD Monitoring to React to Failures Madlib Flow Talk by Frank and Jarrod
  • 37. ● Monitoring and error fixing is big part of responsive data pipelines ● Ability to quickly identify what is failing, why it is failing and fixing it with minimum lead time is crucial ● In this demo we will showcase an error fixing case Monitoring and Error Fixing
  • 38. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Monitoring and Error Fixing - Demo
  • 39. Pipelines Testing Monitoring APIs to consume model output Data Science Phases - Agility Operationalization (O16n) Phase Automated manageable Pipelines Testing with CI/CD Monitoring to React to Failures Madlib Flow Talk by Frank and Jarrod
  • 40. Greenplum and Jupyter notebooks provides a set of tools to do Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile Data Science during the operationalization phase Conclusion
  • 42. “We partner to help you compete, grow, and transform.”
  • 43. Demo Flow 1) 4/12 - 5/5 -> pre-demo 2) Insert row -> pre-demo 3) 5/5 - 5/10 -> data prep and feature engg 4) Model training (show geolife.models_metadata table before and after running) 5) Model scoring 5/10 - 5/12 (show geolife.walk_prediction_results and count increase after each DAG run) 6) Model re-training (show geolife.models_metadata again) 7) Monitoring & Error Fixing 5/13 - 5/14 (uncomment line and rerun) 8) Testing - TDB