SlideShare a Scribd company logo
1 of 52
Scalable Machine Learning:
Single Machine to Distributed
Yucheng Low
Chief Architect
What is ML scalability?
Is this scalability?
1600s
Algorithm Implementation X
800s
400s
200s
300s
Best Single Machine
Implementation
True Scalability
How long does it take to get to a
predetermined accuracy?
Not About:
How well you can implement Algorithm X
Understand the tradeoffs between different algorithms.
It is not about
Scaling Up Scaling Out
It is about
Scaling Up Scaling Out
Going as fast as you can,
on any hardware
• Assume bounded resources
• Optimize for data scalability
The Dato Way
• Scales excellently
• Require fewer machines to
solve in the same runtime as
other systems
10
~1GB/s
1 TB
~0.1GB/s
10 TB
~1-10 GB/s
0.1 TB
Single Machine Scalability: Storage Hierarchy
Capacity
Throughput
Random access
is very slow!
Good External Memory
Datastructures For ML
SFrame:
Scalable Tabular Data Manipulation
User Com.
Title Body
User Disc.
SGraph:
Scalable Graph Manipulation
Data is usually rows…
user movie rating
But, data engineering typically column
transformations…
13
Feature engineering is columnar
Normalizes the feature x:
sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum()
Create a new feature:
sf[‘rating-squared’] =
sf[‘rating’].apply(lambda rating: rating*rating)
Create a new dataset with 2 of the features:
sf2 = sf[[‘rating’,’ rating-squared’]]
ratinguser movie
rating
squared
SFrame
• Rich Datatypes
• Strong schema types: int, double, string, image, ...
• Weak schema types: list, dictionary (Can contain arbitrary JSON)
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Lazy evaluation
• Statistics + sketches
• Type aware compression
User Com.
Title Body
User Disc.Scalable Out-Of-Core Table
Representation
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MB
Out of Core Machine Learning
Rethink all ML
Algorithms
Random Access Sequential Only
Sampling? Sort/Shuffle
Understanding the
Statistical/convergence impacts of ML
algorithm variations.
Single Machine Scaling
0 500 1000 1500 2000 2500
GraphLab-Create (1 Node)
MLlib 1.3 (5 Node)
MLlib 1.3 (1 Node)
Scikit-Learn
Runtime
Dataset Source: LIBLinear binary classification datsets.
KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.
Task: Predict student performance on math problems based on interactions with tutoring system
Single Machine Scaling
0 100 200 300 400 500 600 700 800 900
GraphLab-Create (1 Node)
BIDMach (1 GPU Node)
Runtime
Criteo Kaggle: Click Prediction
46M rows
34M sparse coefficients
Not a Compute Bound Task
Social Media
Graphs encode the relationships between:
•Big: trillions of vertices and edges and rich metadata
•Facebook (10/2012): 1B users, 144B friendships
•Twitter (2011): 15B follower edges
AdvertisingScience Web
People
Facts
Products
Interests
Ideas
SGraph
1. Immutable disk-backed graph representation.
(Append only)
2. Vertex / Edge Attributes.
3. Optimized for bulk access, not fine-grained queries.
Get neighborhood of [5 Million Vertices]
Get neighborhood of 1 vertex
Standard Graph Representations
src dest
1 102
132 10
48 999
129 192
998 23
392 124
Edge List
Easy to Insert
src dest
1 10
1 99
1 102
2 5
2 10
2 120
Sparse Matrix /
Sorted Edge List
Difficult to Insert
(random writes)102 103
349 13
Difficult to
Query
Fast to Query
1 105
SGraph Layout
1
2
3
4
Vertex SFrames
__id Address ZipCode
Alice … 98105
Bob … 98102
Vertices partitioned
into p = 4 SFrames
Edges partitioned into
p^2 = 16 SFrames
__id Address ZipCode
John … 98105
Jack … 98102
SGraph Layout
1
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id Message
Alice Bob “hello”
Bob Charlie “world”
Charlie Alice “moof”
2
__id Address ZipCode
John … 98105
Jack … 98102
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id Message
Alice Bob “hello”
Bob Charlie “world”
Charlie Alice “moof”
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
Largest available public Graph.
 200GB
Compression factor 10:1
12.5 bits per edge
2 TB
Benefit From SFrame Compression Methods
Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
Largest available public Graph.
 200GB
Compression factor 10:1
12.5 bits per edge
2 TB
Common Crawl Graph
1x r3.8xlarge  using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.
SFrame & SGraph
BSD License
(August)
Distributed
Train on bigger
datasets
Train Faster
Speedup Relative to Best Single
Machine Implementation
X Y
Time for 1 pass
= 100s
Extending Single Machine to Distributed
Extending Single Machine to Distributed
X Y
Time for 1 pass
= 50s
X Y
Parallel Disks
Good External Memory
Datastructures For ML Still Help
Distributed Optimization
Newton, LBFGS, FISTA, etc
Parallel
Sweep over
data
X Y
Synchronize
Parameters
Parallel
Sweep over
data
X Y
Synchronize
Parameters
Make sure this is
embarrassingly parallel
Talk Quickly
Distributed Optimization
HDFS
X Y
1. Data begins on HDFS
X YX Y
2. Every machine takes part
of the data to local disk/SSD
3. Inter machine communication by
fast supercomputer-style primitives
Criteo Terabyte Click Logs
Click Prediction Task:
Whether visitor clicked on a link or not.
Criteo Terabyte Click Prediction
4.4 Billion Rows
13 Features
½ TB of data
0
500
1000
1500
2000
2500
3000
3500
4000
0 4 8 12 16
Runtime
#Machines
225s
3630s
Distributed Graphs
Graph Partitioning  Minimizing Communication
YYYCommunication is linear in
the number of machines
each vertex spans
49
Vertex-Cut: Placing edges on machines, and
letting vertex span machines
Graph Partitioning
Communication Minimization
Time to
compute a
partition
Quality of
partition
Graph Partitioning
Since Large Natural Graphs are difficult to partition anyway…
Time to
compute a
partition
Quality of
partition
How good a partition quality can we get while
doing almost no work at all?
Machine 2Machine 1 Machine 3
Randomly assign edges to machines
YYYY ZYYYY ZY Z
Random Partitioning
But is probably the worst partition you
can construct. Can we do better?
Sgraph Partitioning
(1,2)
(2,2)
(1,1)
(2,1)
(3,2)
(4,2)
(3,1)
(4,1)
(1,4)
(2,4)
(1,3)
(2,3)
(3,4)
(4,4)
(3,3)
(4,3)
Slides from a couple
of years ago
Distributed Graphs
New Graph Partitioning Ideas
Mixed in-core out-of-core
computation
Common Crawl Graph
0
100
200
300
400
500
600
0 4 8 12 16
Runtime
#Machines
16 Machines, (c3.8xlarge, 512 vCPUs)
45 sec per iteration
3B edges per second
3.5 billion Nodes and 128 billion Edges
In search of Performance
Understand memory access patterns of algorithms:
Single Machine and Distributed
Sequential? Random?
User Com.
Title Body
User Disc. Optimize datastructures for
access patterns
It is not merely about speed, or scaling
Doing more with what you already have
Excess Slides
Our Tools Are Easy To Use
import graphlab as gl
train_data = gl.SFrame.read_csv(traindata_path)
train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1)
train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2)
cls = gl.classifier.create(train_data, target='sentiment’)
5 line sentiment analysis
But
You have preexisting code in Numpy, Scipy, Scikit-learn
Automatic Numpy Scaling
Automatic in-memory, type aware compression using SFrame type
compression technology.
import graphlab.numpy
Scalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than
memory  Works with scipy, sklearn.
Demo
Scikit Learn SGDLinearCLassifier
0
500
1000
1500
2000
2500
3000
3500
4000
0 100 200 300 400
Runtime(s)
Millions of Rows
Airline Delay Dataset
Numpy
Graphlab
+ numpy
Automatic Numpy Scaling
Automatic in-memory, type aware compression using SFrame type
compression technology.
import graphlab.numpy
Scalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than
memory  Works with scipy, sklearn.
Demo
Caveats apply
- Sequential Access highly preferred.
- Scales most memory bound sklearn algorithms by at least 2x,
some by more.
0
5000
10000
15000
20000
25000
30000
H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab
Create GPU
ImagesperSecond
Deep Learning Throughput GPU
Dataset Source: MNIST 60K examples, 764 dimensions
Source(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..

More Related Content

What's hot

Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 

What's hot (20)

Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 

Similar to Making Machine Learning Scale: Single Machine and Distributed

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge ProcessingG-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge ProcessingPradeep Kumar
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Scalable Learning in Computer Vision
Scalable Learning in Computer VisionScalable Learning in Computer Vision
Scalable Learning in Computer Visionbutest
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Shobhit Saxena
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
 

Similar to Making Machine Learning Scale: Single Machine and Distributed (20)

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Malstone KDD 2010
Malstone KDD 2010Malstone KDD 2010
Malstone KDD 2010
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge ProcessingG-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Graph processing
Graph processingGraph processing
Graph processing
 
Scalable Learning in Computer Vision
Scalable Learning in Computer VisionScalable Learning in Computer Vision
Scalable Learning in Computer Vision
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Making Machine Learning Scale: Single Machine and Distributed

  • 1. Scalable Machine Learning: Single Machine to Distributed Yucheng Low Chief Architect
  • 2. What is ML scalability?
  • 3. Is this scalability? 1600s Algorithm Implementation X 800s 400s 200s 300s Best Single Machine Implementation
  • 4. True Scalability How long does it take to get to a predetermined accuracy? Not About: How well you can implement Algorithm X Understand the tradeoffs between different algorithms.
  • 5. It is not about Scaling Up Scaling Out
  • 6. It is about Scaling Up Scaling Out Going as fast as you can, on any hardware
  • 7. • Assume bounded resources • Optimize for data scalability The Dato Way • Scales excellently • Require fewer machines to solve in the same runtime as other systems
  • 8. 10 ~1GB/s 1 TB ~0.1GB/s 10 TB ~1-10 GB/s 0.1 TB Single Machine Scalability: Storage Hierarchy Capacity Throughput Random access is very slow! Good External Memory Datastructures For ML
  • 9. SFrame: Scalable Tabular Data Manipulation User Com. Title Body User Disc. SGraph: Scalable Graph Manipulation
  • 10. Data is usually rows… user movie rating But, data engineering typically column transformations…
  • 11. 13 Feature engineering is columnar Normalizes the feature x: sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum() Create a new feature: sf[‘rating-squared’] = sf[‘rating’].apply(lambda rating: rating*rating) Create a new dataset with 2 of the features: sf2 = sf[[‘rating’,’ rating-squared’]] ratinguser movie rating squared
  • 12. SFrame • Rich Datatypes • Strong schema types: int, double, string, image, ... • Weak schema types: list, dictionary (Can contain arbitrary JSON) • Columnar Architecture • Easy feature engineering + Vectorized feature operations. • Lazy evaluation • Statistics + sketches • Type aware compression User Com. Title Body User Disc.Scalable Out-Of-Core Table Representation Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB
  • 13. Out of Core Machine Learning Rethink all ML Algorithms Random Access Sequential Only Sampling? Sort/Shuffle Understanding the Statistical/convergence impacts of ML algorithm variations.
  • 14. Single Machine Scaling 0 500 1000 1500 2000 2500 GraphLab-Create (1 Node) MLlib 1.3 (5 Node) MLlib 1.3 (1 Node) Scikit-Learn Runtime Dataset Source: LIBLinear binary classification datsets. KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed. Task: Predict student performance on math problems based on interactions with tutoring system
  • 15. Single Machine Scaling 0 100 200 300 400 500 600 700 800 900 GraphLab-Create (1 Node) BIDMach (1 GPU Node) Runtime Criteo Kaggle: Click Prediction 46M rows 34M sparse coefficients Not a Compute Bound Task
  • 16. Social Media Graphs encode the relationships between: •Big: trillions of vertices and edges and rich metadata •Facebook (10/2012): 1B users, 144B friendships •Twitter (2011): 15B follower edges AdvertisingScience Web People Facts Products Interests Ideas
  • 17. SGraph 1. Immutable disk-backed graph representation. (Append only) 2. Vertex / Edge Attributes. 3. Optimized for bulk access, not fine-grained queries. Get neighborhood of [5 Million Vertices] Get neighborhood of 1 vertex
  • 18. Standard Graph Representations src dest 1 102 132 10 48 999 129 192 998 23 392 124 Edge List Easy to Insert src dest 1 10 1 99 1 102 2 5 2 10 2 120 Sparse Matrix / Sorted Edge List Difficult to Insert (random writes)102 103 349 13 Difficult to Query Fast to Query 1 105
  • 19. SGraph Layout 1 2 3 4 Vertex SFrames __id Address ZipCode Alice … 98105 Bob … 98102 Vertices partitioned into p = 4 SFrames
  • 20. Edges partitioned into p^2 = 16 SFrames __id Address ZipCode John … 98105 Jack … 98102 SGraph Layout 1 3 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames __src_id __dst_id Message Alice Bob “hello” Bob Charlie “world” Charlie Alice “moof” 2
  • 21. __id Address ZipCode John … 98105 Jack … 98102 3 SGraph Layout 1 2 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames __src_id __dst_id Message Alice Bob “hello” Bob Charlie “world” Charlie Alice “moof”
  • 23. Common Crawl Graph 3.5 billion Nodes and 128 billion Edges Largest available public Graph.  200GB Compression factor 10:1 12.5 bits per edge 2 TB Benefit From SFrame Compression Methods
  • 24. Common Crawl Graph 3.5 billion Nodes and 128 billion Edges Largest available public Graph.  200GB Compression factor 10:1 12.5 bits per edge 2 TB
  • 25. Common Crawl Graph 1x r3.8xlarge  using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.
  • 26. SFrame & SGraph BSD License (August)
  • 28. Train on bigger datasets Train Faster Speedup Relative to Best Single Machine Implementation
  • 29. X Y Time for 1 pass = 100s Extending Single Machine to Distributed
  • 30. Extending Single Machine to Distributed X Y Time for 1 pass = 50s X Y Parallel Disks Good External Memory Datastructures For ML Still Help
  • 31. Distributed Optimization Newton, LBFGS, FISTA, etc Parallel Sweep over data X Y Synchronize Parameters Parallel Sweep over data X Y Synchronize Parameters Make sure this is embarrassingly parallel Talk Quickly
  • 32. Distributed Optimization HDFS X Y 1. Data begins on HDFS X YX Y 2. Every machine takes part of the data to local disk/SSD 3. Inter machine communication by fast supercomputer-style primitives
  • 33. Criteo Terabyte Click Logs Click Prediction Task: Whether visitor clicked on a link or not.
  • 34. Criteo Terabyte Click Prediction 4.4 Billion Rows 13 Features ½ TB of data 0 500 1000 1500 2000 2500 3000 3500 4000 0 4 8 12 16 Runtime #Machines 225s 3630s
  • 36. Graph Partitioning  Minimizing Communication YYYCommunication is linear in the number of machines each vertex spans 49 Vertex-Cut: Placing edges on machines, and letting vertex span machines
  • 37. Graph Partitioning Communication Minimization Time to compute a partition Quality of partition
  • 38. Graph Partitioning Since Large Natural Graphs are difficult to partition anyway… Time to compute a partition Quality of partition How good a partition quality can we get while doing almost no work at all?
  • 39. Machine 2Machine 1 Machine 3 Randomly assign edges to machines YYYY ZYYYY ZY Z Random Partitioning But is probably the worst partition you can construct. Can we do better?
  • 41. Slides from a couple of years ago
  • 42. Distributed Graphs New Graph Partitioning Ideas Mixed in-core out-of-core computation
  • 43. Common Crawl Graph 0 100 200 300 400 500 600 0 4 8 12 16 Runtime #Machines 16 Machines, (c3.8xlarge, 512 vCPUs) 45 sec per iteration 3B edges per second 3.5 billion Nodes and 128 billion Edges
  • 44. In search of Performance Understand memory access patterns of algorithms: Single Machine and Distributed Sequential? Random? User Com. Title Body User Disc. Optimize datastructures for access patterns
  • 45. It is not merely about speed, or scaling Doing more with what you already have
  • 46.
  • 48. Our Tools Are Easy To Use import graphlab as gl train_data = gl.SFrame.read_csv(traindata_path) train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2) cls = gl.classifier.create(train_data, target='sentiment’) 5 line sentiment analysis But You have preexisting code in Numpy, Scipy, Scikit-learn
  • 49. Automatic Numpy Scaling Automatic in-memory, type aware compression using SFrame type compression technology. import graphlab.numpy Scalable numpy activation successful Scales all numeric numpy arrays to datasets much larger than memory  Works with scipy, sklearn. Demo
  • 50. Scikit Learn SGDLinearCLassifier 0 500 1000 1500 2000 2500 3000 3500 4000 0 100 200 300 400 Runtime(s) Millions of Rows Airline Delay Dataset Numpy Graphlab + numpy
  • 51. Automatic Numpy Scaling Automatic in-memory, type aware compression using SFrame type compression technology. import graphlab.numpy Scalable numpy activation successful Scales all numeric numpy arrays to datasets much larger than memory  Works with scipy, sklearn. Demo Caveats apply - Sequential Access highly preferred. - Scales most memory bound sklearn algorithms by at least 2x, some by more.
  • 52. 0 5000 10000 15000 20000 25000 30000 H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU ImagesperSecond Deep Learning Throughput GPU Dataset Source: MNIST 60K examples, 764 dimensions Source(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..