Conference 2014: Rajat Arya - Deployment with GraphLab Create

•Download as PPTX, PDF•

1 like•554 views

Turi, Inc.

GraphLab Conference 2014 Train6: Rajat Arya - Deployment

Data & Analytics

GraphLab in
Production:
Data Pipelines
Rajat Arya
Software Engineer
July 21, 2014

Reusable components
Runs on Hadoop
CDH5 now; Pivotal, Spark coming…
Runs on Cloud
EC2 now; Azure, Google coming…
Data pipelines & Predictive services
GraphLab
Data Pipeline
Beyond batch & stream processing
Predictive applications
require real-time service
Deployed directly from
data pipeline
GraphLab
Predictive Service
Monitor from GraphLab Canvas

Sample Data Pipeline
A Simple Recommender System
Train Model Recommend Persist
• Source: Raw data from CSV
• Tasks: Train Model, Produce Recommendations, Persist
• Destination: Write to Database

Sample Prototype
MESSY NOT MODULAR
FILE PATHS NOT PORTABLE

Typical Challenges to Production
• Refactor code to remove magic numbers, file
paths, support dynamic config
• Rewrite entire prototype in ‘production’
language
• Build / integrate workflow support tools
• Build / integrate monitoring & management
tools

Sample Data Pipeline
TRAIN
RECOMMEND
Disc
.
users:
csv:
model:
def train_model(task):
csv = task.params[‘csv’]
data = gl.SFrame.read_csv(csv’)
model = gl.recommender.create(data)
task.outputs[‘model’] = model
task.outputs[‘users’] = data
PERSIST
 Code can be Python functions or file(s)

Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
.
users:
Disc
.
recs:  Code can be Python functions or file(s)
def gen_recs(task):
model = task.inputs[‘model’]
users = task.inputs[‘users’]
recs = model.recommend(users)
task.outputs[‘recs’] = recs
 Dependencies managed logically by
name
model:

Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
.
users:
Disc
.
recs:  Code can be Python functions or file(s)
 Dependencies managed logically by
name
def persist_db(task):
recs = task.inputs[‘recs’]
conn = task.params[‘conn’]
import mysqlconnector
save_to_db(conn, recs.save(format…)
model:
 Set required python packages so Task is
portable
 Automatic installation and configuration
prior to execution

Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
.
recs:
Disc
.
users:model:
INTERN TRAIN
 Tasks are modular and reusable,
enabling incremental development and
rapid iterations

Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘cdh5-prod’)
• One way to create Jobs (with task bindings)

Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘cdh5-prod’)
• One way to create Jobs (with task bindings)
• One way to monitor Jobs

Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘ec2-prod’)
• One way to create Jobs (with task bindings)
• One way to monitor Jobs
• Run on Hadoop, EC2, or locally without
changing code

GraphLab Data Pipeline Recap
Define it Once
Run & Monitor it anywhere
All in GraphLab Create

Thank you.
rajat@graphlab.com
@rajatarya

What's hot

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Use of standards and related issues in predictive analyticsPaco Nathan

Better {ML} Together: GraphLab Create + Spark Turi, Inc.

Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks

Distributed computing poliivascucristian

Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman

Distributed Heterogeneous Mixture Learning On SparkSpark Summit

Machine Learning with Apache SparkIBM Cloud Data Services

Graph Analytics for big dataSigmoid

Spark ML Pipeline servingStepan Pushkarev

Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian

Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks

Pandas/Data Analysis at BaypiggiesAndy Hayden

Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...Turi, Inc.

Analyzing Data With PythonSarah Guido

Machine learning model to productionGeorg Heiler

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Download Itbutest

Flux - Open Machine Learning Stack / PipelineJan Wiegelmann

What's hot (20)

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Use of standards and related issues in predictive analytics

Better {ML} Together: GraphLab Create + Spark

Tuning ML Models: Scaling, Workflows, and Architecture

Distributed computing poli

Building Custom Machine Learning Algorithms With Apache SystemML

Distributed Heterogeneous Mixture Learning On Spark

Machine Learning with Apache Spark

Graph Analytics for big data

Spark ML Pipeline serving

Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...

Accelerating Production Machine Learning with MLflow with Matei Zaharia

Pandas/Data Analysis at Baypiggies

Practical Distributed Machine Learning Pipelines on Hadoop

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...

Analyzing Data With Python

Machine learning model to production

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Download It

Flux - Open Machine Learning Stack / Pipeline

Viewers also liked

NCL coaches technical meetingRL Learning

Inferring networks of substitute and complementary productsTuri, Inc.

Electronic devices used in reinforced concrete by eng mustafamustafa abdulwahed

Bridge_USBaolin Liu

Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.

RyzykoAgnieszka Kaseja

ETP Introduction for Launch EventsRL Learning

Rapport ramed 2013 v2RACHID MABROUKI

Pandas & Cloudera: Scaling the Python Data ExperienceTuri, Inc.

Bob’s training programsBob Seshadri

David_BerminghamPeter Vervaene

9588_Lakewood_TroyWise_2016_part2_conv_NObleedTroy Wise

Zanim zostanę Twoim coachem...Agnieszka Kaseja

Untangling Graphs with GPU CloudsTuri, Inc.

Osvaldo Ajuda C.V.-EnglishOsvaldo Ajuda

Catalogue „judenrein“ differdange Henri Juda

Fuel cell stackingPana Mann

Pk 08.06 finalluisadoniacovo

Pharmacy babasainaburg09

Wrinkle Finishing TechniqueAzmir Latif Beg

Viewers also liked (20)

NCL coaches technical meeting

Inferring networks of substitute and complementary products

Electronic devices used in reinforced concrete by eng mustafa

Bridge_US

Towards a Comprehensive Machine Learning Benchmark

Ryzyko

ETP Introduction for Launch Events

Rapport ramed 2013 v2

Pandas & Cloudera: Scaling the Python Data Experience

Bob’s training programs

David_Bermingham

9588_Lakewood_TroyWise_2016_part2_conv_NObleed

Zanim zostanę Twoim coachem...

Untangling Graphs with GPU Clouds

Osvaldo Ajuda C.V.-English

Catalogue „judenrein“ differdange

Fuel cell stacking

Pk 08.06 final

Pharmacy baba

Wrinkle Finishing Technique

Similar to Conference 2014: Rajat Arya - Deployment with GraphLab Create

Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io

A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services

Continuous delivery for machine learningRajesh Muppalla

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu

Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling

Machine Learning Models in ProductionDataWorks Summit

Data ScienceSubhajit75

Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov

Ml2poovarasu maniandan

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman

AnalyticOps - Chicago PAW 2016Robert Grossman

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender

03_aiops-1.pptxFarazulHoda2

Dsdt meetup 2017 11-21JDA Labs MTL

Similar to Conference 2014: Rajat Arya - Deployment with GraphLab Create (20)

Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...

A Hands-on Intro to Data Science and R Presentation.ppt

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct

Continuous delivery for machine learning

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Productionizing Machine Learning - Bigdata meetup 5-06-2019

Serverless ML Workshop with Hopsworks at PyData Seattle

Machine Learning Models in Production

Data Science

Intro to big data analytics using microsoft machine learning server with spark

Ml2

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...

AnalyticOps - Chicago PAW 2016

Apache Arrow at DataEngConf Barcelona 2018

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...

Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...

03_aiops-1.pptx

Dsdt meetup 2017 11-21

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

IMA MSN - Medical Students Network (2).pptxdolaknnilon

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Generative AI for Social Good at Open Data Science East 2024

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

GA4 Without Cookies [Measure Camp AMS]

DBA Basics: Getting Started with Performance Tuning.pdf

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

20240419 - Measurecamp Amsterdam - SAM.pdf

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

IMA MSN - Medical Students Network (2).pptx

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Call Girls in Saket 99530🔝 56974 Escort Service

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Conference 2014: Rajat Arya - Deployment with GraphLab Create

1. GraphLab in Production: Data Pipelines Rajat Arya Software Engineer July 21, 2014

2. Reusable components Runs on Hadoop CDH5 now; Pivotal, Spark coming… Runs on Cloud EC2 now; Azure, Google coming… Data pipelines & Predictive services GraphLab Data Pipeline Beyond batch & stream processing Predictive applications require real-time service Deployed directly from data pipeline GraphLab Predictive Service Monitor from GraphLab Canvas

3. Sample Data Pipeline A Simple Recommender System Train Model Recommend Persist • Source: Raw data from CSV • Tasks: Train Model, Produce Recommendations, Persist • Destination: Write to Database

4. Sample Prototype

5. Sample Prototype MESSY NOT MODULAR FILE PATHS NOT PORTABLE

6. Typical Challenges to Production • Refactor code to remove magic numbers, file paths, support dynamic config • Rewrite entire prototype in ‘production’ language • Build / integrate workflow support tools • Build / integrate monitoring & management tools

7. Typical Challenges to Production • Refactor code to remove magic numbers, file paths, support dynamic config • Rewrite entire prototype in ‘production’ language • Build / integrate workflow support tools • Build / integrate monitoring & management tools GraphLab Create provides a better way …

8. Sample Data Pipeline TRAIN RECOMMEND Disc . users: csv: model: def train_model(task): csv = task.params[‘csv’] data = gl.SFrame.read_csv(csv’) model = gl.recommender.create(data) task.outputs[‘model’] = model task.outputs[‘users’] = data PERSIST  Code can be Python functions or file(s)

9. Sample Data Pipeline TRAIN RECOMMEND PERSIST csv: Disc . users: Disc . recs:  Code can be Python functions or file(s) def gen_recs(task): model = task.inputs[‘model’] users = task.inputs[‘users’] recs = model.recommend(users) task.outputs[‘recs’] = recs  Dependencies managed logically by name model:

10. Sample Data Pipeline TRAIN RECOMMEND PERSIST csv: Disc . users: Disc . recs:  Code can be Python functions or file(s)  Dependencies managed logically by name def persist_db(task): recs = task.inputs[‘recs’] conn = task.params[‘conn’] import mysqlconnector save_to_db(conn, recs.save(format…) model:  Set required python packages so Task is portable  Automatic installation and configuration prior to execution

11. Sample Data Pipeline TRAIN RECOMMEND PERSIST csv: Disc . recs: Disc . users:model: INTERN TRAIN  Tasks are modular and reusable, enabling incremental development and rapid iterations

12. Sample Data Pipeline TRAIN RECOMMEND PERSIST csv: Disc . recs: Disc . users:model: INTERN TRAIN  Tasks are modular and reusable, enabling incremental development and rapid iterations

13. Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-prod’) • One way to create Jobs (with task bindings)

14. Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-prod’) • One way to create Jobs (with task bindings) • One way to monitor Jobs

15. Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘ec2-prod’) • One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without changing code

16. Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-prod’) • One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without changing code • Recall previous Jobs and Tasks, maintain workbench

17. GraphLab Data Pipeline Demo

18. GraphLab Data Pipeline Recap Define it Once Run & Monitor it anywhere All in GraphLab Create

19. Thank you. rajat@graphlab.com @rajatarya

Editor's Notes

Thank you all for being here – I’m Rajat Arya an engineer at GraphLab and I am excited to share with you our vision for how to use GraphLab Create in Production.
In Production our vision starts with Data Pipelines and extends to Predictive Services Data Pipelines are all about making it easy to go from prototype to production, building data products with reusable components that can run in Hadoop or EC2. Predictive Services is about taking your data pipeline and deploying it to a real-time service that has low latency and a RESTful API over your model. Today I will talk about Data Pipelines, which are available now. Predictive Services will be generally available in GraphLab Create 1.0, but are in limited beta now. If you want to find out more come and talk with me at the booth. So the best way to start talking about data pipelines is to start with an example ...
Here is an example pipeline – a simple recommender system 3 tasks train_model: raw data to a trained model, with parameters, Recommend – generate recommendations for users based on model and user data, Persist – store recommendations to DB How would you start to build this type of system today?
… We heard these challenges over and over again from customers, and felt this pain ourselves, and this led us to develop GraphLab Data Pipelines. We believe GraphLab Create provides a better way … I want to briefly introduce how easy it is to define the sample pipeline from the previous slide as a series of Tasks in the Data Pipelines Framework…
… We heard these challenges over and over again from customers, and felt this pain ourselves, and this led us to develop GraphLab Data Pipelines. We believe GraphLab Create provides a better way … I want to briefly introduce how easy it is to define the sample pipeline from the previous slide as a series of Tasks in the Data Pipelines Framework…
Let’s see what it takes to define the train task in this pipeline… This code is a user-defined python function, with arbitrary python code in it. This task takes an input of a CSV, then parses that into an Sframe, and uses the Sframe to generate a model. Then it sets both the model and sframe as outputs. This task shows how code to be run can be defined as a python function or a set of python files in a larger software project.
The recommend task again is a python function, But this time takes both the model and users from the Train task as inputs, and then uses them to generate recommendations, which it then sets as its output. This task shows how dependencies between tasks are logically defined, by name, allowing for great flexibility when composing pipelines.
And finally the Persist task, here the code takes the recommendations from the recommend task, and then imports a mysql connector python package and uses a helper function defined elsewhere to save the recommendations to the mysql database. This task shows how the framework enables task portability, by automatically installing and configuring required packages prior to execution. This is a huge help in production systems because you no longer have to worry if the package you are using is available on the cluster you want to run the task on.
Tasks are much more then easy to define… they enable rapid iterations and experimentation. Let’s say you have a ‘rock star’ intern who claims that she can improve the model that is trained. Of course you don’t trust her, but you want to see if she is right… So with GraphLab Data Pipelines you simply clone the Train task, let’s call it INTERN TRAIN instead, and as long as it still takes a CSV and produces both the model and the users, it can seamlessly get plugged into the pipeline. This shows how tasks are modular and reusable, allowing for A/B testing, clean divisions of responsibility, and rapid iterations. Oh, and after experimenting with the intern’s results, turns out she was right, and so you ditch the original Train task altogether.
Tasks are much more then easy to define… they enable rapid iterations and experimentation. Let’s say you have a ‘rock star’ intern who claims that she can improve the model that is trained. Of course you don’t trust her, but you want to see if she is right… So with GraphLab Data Pipelines you simply clone the Train task, let’s call it INTERN TRAIN instead, and as long as it still takes a CSV and produces both the model and the users, it can seamlessly get plugged into the pipeline. This shows how tasks are modular and reusable, allowing for A/B testing, clean divisions of responsibility, and rapid iterations. Oh, and after experimenting with the intern’s results, turns out she was right, and so you ditch the original Train task altogether.
When thinking about executing pipelines, the first thing that comes to mind that there are many environments to support – I want to execute locally for debugging, then ec2 test environment, then ec2 prod environment, maybe my dev hadoop cluster, then my test hadoop cluster, and finally production hadoop cluster. I don’t want to have to learn and remember all the different ways to get my jobs executing in each of these environments. Preface execution with multiple environments – local, ec2 for testing, ec2 prod, hadoop test clusters, hadoop prod clusters With GraphLab Create there is one way to create jobs – using deploy.job.create – specifying the tasks to run and the environment for them to run in.
Once run, this API returns a Job object, which provides a single, consistent, way to monitor the execution of these jobs. No more needing to remember how to query the YARN Resource Manager to get status, or remembering how to query the status on EC2 – one consistent API to monitor and manage the job.
To change where this job runs, all you need to do is switch the environment parameter, no code changes are necessary.
And finally, having one way to launch and monitor jobs locally, in ec2, or hadoop wouldn’t be valuable if you couldn’t close your laptop and go for lunch. So with GraphLab Create your Jobs and Tasks are managed, so you can close your laptop, exit python, and come back later to pick up where you left off. Now that we’ve covered how easy it is to define tasks and compose them into jobs – let’s see this in action with a quick demo.

Conference 2014: Rajat Arya - Deployment with GraphLab Create

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Conference 2014: Rajat Arya - Deployment with GraphLab Create

Similar to Conference 2014: Rajat Arya - Deployment with GraphLab Create (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

Conference 2014: Rajat Arya - Deployment with GraphLab Create

Editor's Notes