3. Data-driven, predictive data apps
are making our world amazing…
Historical data
Sensor & interaction data
Real-time
predictions
& decisions
Recommenders Industrial Apps
Forecasters
Social
Human Sensing
Fraud &
Anomaly
Detection
Sentiment
Analysis &
other Text
Apps
Pers.
Medicine
4. Action 1: simple predictive app
Aren’t you tired of those spam txt messages???
7. Building a predictive app
Was using 217 business rules
hoping world doesn’t change
Have an inspiring idea to
reinvent their business
Key pains:
Hiring Talent
Shortfall in data-savvy workers
needed to make sense out of
big data by 2018 [McKinsey 2011]
35%
Noisy Space of Tools
Data scientists use a variety of tools, across
different programming languages…
require a lot of context-switching…
affects productivity and impedes reproducibility.
Ben Lorica,
Data Analysis: Just one component of
the Data Science workflow
8. Crossing the Big Data Chasm
speed of iteration
scaleofdata
Get a Hadoop cluster!?!?
single machine memory
production data
9. Crossing the Big Data Chasm
speed of iteration
scaleofdata
single machine memory
production data
big data chasm
10. Crossing the Big Data Chasm
speed of iteration
scaleofdata
single machine memory
production data
CreateTM
GraphLab Create:
Unleashing data science
from inspiration to production
11. Data scientist: inspiration to production
Analyze big data on one machine
graphs, tables, text, images
in Python
doesn’t have to fit in memory
Distribute in production
with same code
on EC2, Yarn,…
Use my laptop
Variety of data
Not toy data scales
Language I love
Iterate quickly
Prototype MonitorProduction
data pipeline predictive service
GraphLab Canvas: Monitor & visualize
from prototype to production
GraphLab Create
12. What folks are saying about
GraphLab Create
“The ease of use and scalable performance, which is not limited by the memory of the
machine, are allowing us to innovate and advance at an astonishing pace.”
- Andrew Bruce, Senior Director, Data Science, Zillow
“Graphlab Create provides us with an end-to-end efficient framework … both tabular and
graph data generated by the activity of our users.”
- Baldo Faeita, Social Computing Lead, Adobe Systems
“…during my time as Zynga's lead architect for big data, I found my way to GraphLab. I was
astounded at the dramatic savings, on the order of 500x…”
- Mohan Reddy, Chief Architect, The Hive LLC.
18. Productive on one
machine
Python
Tables, graphs, text & images
Scale beyond memory
Integrated visualization
Productive at scale
Deploy, monitor, improve on the Cloud/Cluster
Integrates with Hadoop/EC2
Deploy data pipelines & prediction services
Built for Machine Learning
in Production
Scalable, robust algorithms
End-to-end support
Get to value fast
Powered by
GraphLab Engine
Fastest analytics system
GraphLab Create
19. Getting started is as easy as…
1. pip install graphlab-create
2. be creative with your data
Hands-on training tomorrow
Online: Learning ML in Practice with GraphLab Notebook
http://graphlab.com/learn
ML tutorials, tips,
tricks
End-to-end GraphLab
Notebooks
Quick how-tos
20. Getting started is as easy as…
1. pip install graphlab-create
2. be creative with your data
Hands-on training tomorrow
Online: Learning ML in Practice with GraphLab Notebook
http://graphlab.com/learn
ML tutorials, tips,
tricks
End-to-end GraphLab
Notebooks
Quick how-tos
21. Getting started is as easy as…
1. pip install graphlab-create
2. be creative with your data
25. Scalable data structures
ease those scaling pains
Pain SFrame/SGraph
Running out of memory Graceful degradation
Optimized out-of-core computation
Integrating data Unified tables, graphs, text, images
Missing values
Strong/weak types
Missing value support from get-go
Strong types only when strength helps
26. SFrame: Scalable tabular data
Never run out of memory
Sharded, compressed, out-of-core, columnar
Arbitrary lambda transformations, joins,… from Python
Group-by
aggregate
10M rows/s
on your
desktop
27. SFrame: Scalable tabular data
Never run out of memory
Sharded, compressed, out-of-core, columnar
Arbitrary lambda transformations, joins,… from Python
Group-by
aggregate
10M rows/s
on your
desktop
Same
Python
code
28. SGraph: Scalable graph data
Easily and efficiently express entire pipelines
PageRank at
10M edges/s
on your
desktop
29. Action 3: data science on a
terabyte on data on my laptop
32. Most ML toolkits don’t focus on the
real challenges
Tools out there Real needs
Bag of algorithms Task-oriented, e.g., recommender
Brittle Robust to data problems
Lots of parameters to tune Automatic; tuning is a bonus
“State-of-the-art” methods
in “research” mode
state-of-the-art accuracy,
performance (& methods)
GraphLab Create: Robust ML & graph analytics
state-of-the-art scaling and accuracy
focused on solving tasks, automatically
39. Reusable components
Runs on Hadoop
CDH5 now; Pivotal, Spark coming…
Runs on Cloud
EC2 now; Azure, Google coming…
Data pipelines & predictive services
GraphLab
Data Pipeline
Beyond batch & stream processing
Predictive applications
require real-time service
Deployed directly from
data pipeline
GraphLab
Predictive Service
Monitor from GraphLab Canvas
43. Recommender
3.5B ratings
1 hour
on your
desktop
PageRank at
10M edges/s
on your
desktop
LDA at
1.4M tokens/s
on your
desktop
Tables Graphs Text
GraphLab Create is fast, really fast
Most importantly, it’s fast
enough for production now!
47. Logistic regression benchmark
Orders of magnitude faster
Timeinsecondsondesktop
KDD Cup data: predict student performance on math problems based on interactions with tutoring system
8.4M data points, 20M features, 2.4GB compressed
0
10000
20000
30000
40000
50000
60000
70000
Scikit Learn GraphLab Create
51. GraphLab Create Roadmap
March 2014 July 21st 2014 October 2014
Scalable data structures
Tables, graphs, text
Robust ML algorithms
GraphLab Canvas
Data pipelines
New ML algorithms
More data types
Predictive services
Monitoring in production
100+ companies participated in beta program
Already used in production
Extremely positive feedback
Every feature since March in response to customer requests
Please keep them coming!
52. Commitment to open-source
• We have been committed to open-source for 6 years
- PowerGraph, GraphChi,…
- Our focus now is on GraphLab Create
• We are inspired by companies like MongoDB & ElasticSearch
- Open-source core
- Provide value-add tools, such as monitoring & management
Our users can be successful by just
using open-source version
Who cares if you are using SGD, ALS, L-BFGS,..
You want performance, simplicity, accuracy
These slides are missing the punchline. So what if we can do "PageRank in 22 seconds" Lets turn this around and say how fast can these systems find Central Users (and put a footnote stating the algorithm). Lets make a case for realtime response speeds.