2. Reusable components
Runs on Hadoop
CDH5 now; Pivotal, Spark coming…
Runs on Cloud
EC2 now; Azure, Google coming…
Data pipelines & Predictive services
GraphLab
Data Pipeline
Beyond batch & stream processing
Predictive applications
require real-time service
Deployed directly from
data pipeline
GraphLab
Predictive Service
Monitor from GraphLab Canvas
3. Sample Data Pipeline
A Simple Recommender System
Train Model Recommend Persist
• Source: Raw data from CSV
• Tasks: Train Model, Produce Recommendations, Persist
• Destination: Write to Database
6. Typical Challenges to Production
• Refactor code to remove magic numbers, file
paths, support dynamic config
• Rewrite entire prototype in ‘production’
language
• Build / integrate workflow support tools
• Build / integrate monitoring & management
tools
7. Typical Challenges to Production
• Refactor code to remove magic numbers, file
paths, support dynamic config
• Rewrite entire prototype in ‘production’
language
• Build / integrate workflow support tools
• Build / integrate monitoring & management
tools
GraphLab Create provides
a better way …
8. Sample Data Pipeline
TRAIN
RECOMMEND
Disc
.
users:
csv:
model:
def train_model(task):
csv = task.params[‘csv’]
data = gl.SFrame.read_csv(csv’)
model = gl.recommender.create(data)
task.outputs[‘model’] = model
task.outputs[‘users’] = data
PERSIST
Code can be Python functions or file(s)
13. Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘cdh5-prod’)
• One way to create Jobs (with task bindings)
14. Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘cdh5-prod’)
• One way to create Jobs (with task bindings)
• One way to monitor Jobs
15. Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘ec2-prod’)
• One way to create Jobs (with task bindings)
• One way to monitor Jobs
• Run on Hadoop, EC2, or locally without
changing code
16. Executing Data Pipelines
job = gl.deploy.job.create(
[train, recommend, persist],
environment=‘cdh5-prod’)
• One way to create Jobs (with task bindings)
• One way to monitor Jobs
• Run on Hadoop, EC2, or locally without
changing code
• Recall previous Jobs and Tasks, maintain
workbench
Thank you all for being here – I’m Rajat Arya an engineer at GraphLab and I am excited to share with you our vision for how to use GraphLab Create in Production.
In Production our vision starts with Data Pipelines and extends to Predictive Services
Data Pipelines are all about making it easy to go from prototype to production, building data products with reusable components that can run in Hadoop or EC2.
Predictive Services is about taking your data pipeline and deploying it to a real-time service that has low latency and a RESTful API over your model.
Today I will talk about Data Pipelines, which are available now. Predictive Services will be generally available in GraphLab Create 1.0, but are in limited beta now. If you want to find out more come and talk with me at the booth.
So the best way to start talking about data pipelines is to start with an example ...
Here is an example pipeline – a simple recommender system
3 tasks
train_model: raw data to a trained model, with parameters, Recommend – generate recommendations for users based on model and user data, Persist – store recommendations to DB
How would you start to build this type of system today?
…
We heard these challenges over and over again from customers, and felt this pain ourselves, and this led us to develop GraphLab Data Pipelines.
We believe GraphLab Create provides a better way …
I want to briefly introduce how easy it is to define the sample pipeline from the previous slide as a series of Tasks in the Data Pipelines Framework…
…
We heard these challenges over and over again from customers, and felt this pain ourselves, and this led us to develop GraphLab Data Pipelines.
We believe GraphLab Create provides a better way …
I want to briefly introduce how easy it is to define the sample pipeline from the previous slide as a series of Tasks in the Data Pipelines Framework…
Let’s see what it takes to define the train task in this pipeline…
This code is a user-defined python function, with arbitrary python code in it.
This task takes an input of a CSV, then parses that into an Sframe, and uses the Sframe to generate a model.
Then it sets both the model and sframe as outputs.
This task shows how code to be run can be defined as a python function or a set of python files in a larger software project.
The recommend task again is a python function,
But this time takes both the model and users from the Train task as inputs, and then uses them to generate recommendations, which it then sets as its output.
This task shows how dependencies between tasks are logically defined, by name, allowing for great flexibility when composing pipelines.
And finally the Persist task, here the code takes the recommendations from the recommend task, and then imports a mysql connector python package and uses a helper function defined elsewhere to save the recommendations to the mysql database.
This task shows how the framework enables task portability, by automatically installing and configuring required packages prior to execution. This is a huge help in production systems because you no longer have to worry if the package you are using is available on the cluster you want to run the task on.
Tasks are much more then easy to define… they enable rapid iterations and experimentation.
Let’s say you have a ‘rock star’ intern who claims that she can improve the model that is trained. Of course you don’t trust her, but you want to see if she is right…
So with GraphLab Data Pipelines you simply clone the Train task, let’s call it INTERN TRAIN instead, and as long as it still takes a CSV and produces both the model and the users, it can seamlessly get plugged into the pipeline.
This shows how tasks are modular and reusable, allowing for A/B testing, clean divisions of responsibility, and rapid iterations.
Oh, and after experimenting with the intern’s results, turns out she was right, and so you ditch the original Train task altogether.
Tasks are much more then easy to define… they enable rapid iterations and experimentation.
Let’s say you have a ‘rock star’ intern who claims that she can improve the model that is trained. Of course you don’t trust her, but you want to see if she is right…
So with GraphLab Data Pipelines you simply clone the Train task, let’s call it INTERN TRAIN instead, and as long as it still takes a CSV and produces both the model and the users, it can seamlessly get plugged into the pipeline.
This shows how tasks are modular and reusable, allowing for A/B testing, clean divisions of responsibility, and rapid iterations.
Oh, and after experimenting with the intern’s results, turns out she was right, and so you ditch the original Train task altogether.
When thinking about executing pipelines, the first thing that comes to mind that there are many environments to support –
I want to execute locally for debugging, then ec2 test environment, then ec2 prod environment, maybe my dev hadoop cluster, then my test hadoop cluster, and finally production hadoop cluster. I don’t want to have to learn and remember all the different ways to get my jobs executing in each of these environments.
Preface execution with multiple environments – local, ec2 for testing, ec2 prod, hadoop test clusters, hadoop prod clusters
With GraphLab Create there is one way to create jobs – using deploy.job.create – specifying the tasks to run and the environment for them to run in.
Once run, this API returns a Job object, which provides a single, consistent, way to monitor the execution of these jobs.
No more needing to remember how to query the YARN Resource Manager to get status, or remembering how to query the status on EC2 – one consistent API to monitor and manage the job.
To change where this job runs, all you need to do is switch the environment parameter, no code changes are necessary.
And finally, having one way to launch and monitor jobs locally, in ec2, or hadoop wouldn’t be valuable if you couldn’t close your laptop and go for lunch.
So with GraphLab Create your Jobs and Tasks are managed, so you can close your laptop, exit python, and come back later to pick up where you left off.
Now that we’ve covered how easy it is to define tasks and compose them into jobs – let’s see this in action with a quick demo.