Spark Hearts GraphLab Create

PK
COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies.
+

OVERVIEW
ü  Strategy
ü  Technology Verticals
ü  Industry Verticals
ü  Platform Solutions
SERVICES
ü  21 Offices including 11 Global Delivery
Centers in US, India, Argentina & Peru
PRESENCE
ü  150+ Customers
ü  40 Fortune 1000 Companies
CUSTOMERS
ü  Doing business since 2000
ü  Global Staff of 2000+ employees
PEOPLE
PK

problem formulation -> tool chain construction
data science with notebooks
the data set
Demo: Spark + GraphLab Create
RDD -> SFrame
ML on PySpark
introduction to boosted trees in GraphLab Create
Demo: Spark + GraphLab Create
PySpark ML with GraphLab Create
PK
What should we talk about?
+

Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1.  Show value of
work
2.  Communicate
results
3.  Move models into
production
pipeline
PK
+
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
Based on Diagram Source: Paco Nathan

PK
+
“Models are not right or wrong; they're always
wrong. They're always approximations. The
question you have to ask is whether a model
tells you more information than you would
have had otherwise. If it does, it's skillful.” -
Gavin Schmidt’s excellent TED Talk
Data science with notebooks allows
data science teams to quickly move
from exploration > transformation >
modeling > visualization > export to
pipeline
Many startups create to fill the need
for one product solution with
collaboration and containerization
In the mean time:
PySpark +
IPython +
GraphLab Create

Fatality Analysis Reporting System
National Highway Transportation Safety Administration
●  Publicly available
●  Historical time-series
Currently available: 1975 – 2013
●  Raw, Rich, Relevant
●  Time-series, geo location
●  Human recorded events transcribed into annual databases (dbf,
SQL, SAS)
●  Measurable outcomes for modelling (updated on yearly basis)
PK
example data set: transportation safety +

²  Proof of concept demonstra/on for
customer concerned with
Transporta/on Safety
²  Applica/on of science method
towards diverse data sets
²  Evolving real-‐world data sets for
advanced analy/cs workshops and
training sessions
²  Visual and conceptual presenta/on
of scien/ﬁc approach to
computa/onal analysis
data science showcase
IENCE – DATA PROTOTYPING
25
30
35
40
45
50
−120 −100 −80
Fatalities in Fatal Accidents from 1975−2012
lat
0
40000
80000
120000
160000
deaths
ü  Scien/ﬁc Method
ü  Predic/ve Modeling
ü  Hidden Insights
PK
+

Hardware:
•  MacBook Pro (late 2012)
•  ~36 GB free disk space
•  8GB RAM
•  2 cores
•  Not exactly a blazingly fast, top of the line machine….
Software:
•  Spark 1.1.0 for Hadoop 2.4
•  GraphLab Create 1.2.1
•  Hadoop 2.4
•  Scala 2.10.4
•  Python 2.7.9 on Anaconda 1.9.1
PK
notes about my setup
+

One more step…
gl.get_spark_integration_jar_path()
$SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab-
create-spark-integration.jar --master yarn-client my_awesome_code.py
Or…
$SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create-
spark-integration.jar --master yarn-client
Works with –master local OR –master yarn-client
Then…it works J
PK
working with glc in pyspark
+

Demo time!
Graphic courtesy Dato

GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop
words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
PK
ML in PySpark…even better now
+
MLlib + GraphX
Classification and regression
linear models (SVMs, logistic regression,
linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and
Gradient-Boosted Trees)
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics

PK
Gradient Boosted Trees with GraphLab Create
+
Source: Dato, ICCV 2009 Tutorial
……!
tree t1! tree tT!
split nodes!
leaf nodes!
v! v!
!

+
PK
http://goo.gl/forms/y8LYl53hje
?

DATA SCIENCE – DATA PROTOTYPING
PK
Amanda Casari
acasari@prokarma.com
+
humor from xkcd

Spark Hearts GraphLab Create

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Hearts GraphLab Create

Similar to Spark Hearts GraphLab Create (20)

More from Amanda Casari

More from Amanda Casari (8)

Recently uploaded

Recently uploaded (20)

Spark Hearts GraphLab Create