2. 2
{brief} intro to Spark +
Step 1: Create Resilient Distributed
Dataset (RDD)
– Contain arbitrary Java or Python objects
Step 2: Perform parallel operations
– Transformations define new dataset (DAG)
– Actions kick off a job on the cluster
Source: Apache Spark
3. 3
ML in GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
MLlib + GraphX on Apache Spark
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
ML in PySpark > even better +
+ scikit-learn
+ pandas
4. Concur customers.
All data, January 2015
Concur Mobile
~3.5M
users
TripIt
~11M
users
Travel
~18,750
clients
Expense
~16,500
clients
5. 5
Build Data Science Products
to Help Deliver the Perfect Trip
through Personalization
{and Intelligent Automation}
6. 6
Use Case: Identifying Customer Behavior
+
• Can unsupervised learning help us learn more about our user behavior…without
additional instrumentation?
Business Problem
Automate the process of
expense filing by predicting for a
user:
1. the correct currency of a
transaction
2. the correct reimbursement
currency for a customer
Data Set
Expense Report
Transactions
(AKA Everyone’s Favorite
Part of Business Travel!)
7. A typical sprint planning...
Data Science Product Manager: “Just see how things go!”
8. A typical sprint demo...
Data Science Product Manager: “So it’s ready for production????”
9. 9
problem formulation > tool-chain construction
? - How best to quickly move
from raw through pipeline to:
1. Show value of work
2. Communicate results
3. Move models into
production pipeline
+
…NOTEBOOKS >
Jupyter + PySpark +
GraphLab Create
Use lots of different
data sources
ETL
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Communicate
Results and
Value
Based on Diagram Source: Paco Nathan
10. 10
+
{0. Initial ETL job with Spark}
1. Import data to SparkContext
2. Convert to DataFrames RDD for more exploration
3. Convert Spark DataFrame to GraphLab Create SFrame
4. Exploration in GraphLab Create
5. Case : Clustering Users based on Transaction History
+ Additional steps to improve performance using GLC toolkits
7. Save model for deployment to production
8. Convert SFrame back to RDD
QUICK demo: explore > readyForProd
14. 14
my local setup
• Hardware:
• MacBook Pro (mid 2014)
• ~420 GB free disk space
• 16GB RAM
• 2.5GHz with 4 cores
…{Not exactly an enterprise AWS cluster}
• Software:
• Spark 1.2.2/ 1.3.1 / 1.4.0 for Hadoop 2.4
Using Spark 1.3.1 for Demo
• GraphLab Create 0.internal
• Hadoop 2.5.2
• Scala 2.11.7
• Python 2.7.10 on Anaconda 2.3.0
+
15. 15
working with glc & spark
• Get these things working together first!
– Hadoop 2.4+ , YARN {if using Spark built on Hadoop}
– Spark 1.1+ {integrate with YARN if plan to use yarn-client}
• Increased spark.driver.memory to 3g (ref here)
• Added spark.driver.extraClassPath to spark-defaults.conf (ref here)
• Install GraphLab Create in a clean Python 2.7+ environment
• Follow instructions on how to point Spark at your glc jar
• If your JAVA_HOME is different than your HADOOP_JAVA_HOME, set GRAPHLAB_JAVA_HOME
• Helpful hints on working with IPython + Spark (including symlink info) here
• It’s not just you but it might be your configuration…
– Spark 1.3.1+ + IPython Kernel in yarn-client may be having issues? {we will stick with local for demo}
+