SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Machine Learning at the Limit
John Canny*^
* Computer Science Division
University of California, Berkeley
^ Yahoo Research Labs
@Alpine Labs, March, 2015

Outline
Scaling Inward:
• What are the performance limits for machine learning on
single machines, and how close can we get to them?
(Single-node Roofline Design)
Scaling Out:
• What happens when we try to scale up from optimized
single nodes to clusters?
(Roofline Design for Clusters)

My Other Job(s)
Yahoo [Chen, Pavlov, Canny, KDD 2009]*
Ebay [Chen, Canny, SIGIR 2011]**
Quantcast 2011-2013
Microsoft 2014
Yahoo 2015
* Best application paper prize
** Best paper honorable mention

Data Scientist’s Workflow
Digging Around
in Data
Hypothesize
Model
Customize
Large Scale
Exploitation
Evaluate
Interpret
Sandbox
Production

Why Build a New ML Toolkit?
• Performance: GPU performance pulling away from other
platforms for *sparse* and dense data.
Minibatch + SGD methods dominant on Big Data,…
• Customizability: Great value in customizing models (loss
functions, constraints,…)
• Explore/Deploy: Explore fast, run the same code in
prototype and production. Be able to run on clusters.

Desiderata
• Performance:
• Roofline Design (single machine and cluster)
• General Matrix Library with full CPU/GPU acceleration
• Customizability:
• Modular Learner Architecture (reusable components)
• Likelihood “Mixins”
• Explore/Deploy:
• Interactive, Scriptable, Graphical
• JVM based (Scala) w/ optimal cluster primitives

Roofline Design (Williams, Waterman, Patterson, 2009)
• Roofline design establishes fundamental performance
limits for a computational kernel.
Operational Intensity (flops/byte)
Throughput(gflops)
1
10
100
1000
0.01 0.1 1 10 100
GPU ALU throughput
CPU ALU throughput
1000

A Tale of Two Architectures
Intel CPU NVIDIA GPU
Memory Controller
L3 Cache
Core
ALU
Core
ALU
Core
ALU
Core
ALU
L2 Cache
ALU ALU ALU ALU ALU ALU

CPU vs GPU Memory Hierarchy
Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU
8 MB L3 Cache
1.5 MB L2 Cache
4 MB register file (!)4kB registers:
1 MB Shared Mem
2 MB L2 Cache
512K L1 Cache
1 MB Constant Mem
1 TB/s1 TB/s
40 TB/s
13 TB/s
5 TB/s
10s GB Main Memory 4 GB Main Memory
20 GB/s
500 GB/s
500 GB/s
200 GB/s

Roofline Design – Matrix kernels
• Dense matrix multiply
• Sparse matrix multiply
Operational Intensity (flops/byte)
Throughput(gflops)
1
10
100
1000
0.01 0.1 1 10 100
GPU ALU throughput
CPU ALU throughput
1000

DataSource
(JBOD disks)
Learner
Model
Optimizer
Mixins
Model
Optimizer
Mixins
GPU 1 thread 1
GPU 2 thread 2
:
:
CPU host code
data
blocks
DataSource
(Memory)
DataSource
HDFS over
network
Zhao+Canny
SIAM DM 13, KDD 13, BIGLearn 13
A Rooflined Machine Learning Toolkit
Compressed disk streaming at
~ 0.1-2 GB/s  100 HDFS nodes
30 Gflops to
2 Teraflops per GPU

Matrix + Machine Learning Layers
Written in the beautiful Scala language:
• Interpreter with JIT, scriptable.
• Open syntax +,-,*, ,, etc, math looks like math.
• Java VM + Java codebase – runs on Hadoop, Yarn, Spark.
• Hardware acceleration in C/C++ native code (CPU/GPU).
• Easy parallelism: Actors, parallel collections.
• Memory management (sort of ).
• Pre-built for multiple Platforms (Windows, MacOS, Linux).
Experience similar to Matlab, R, SciPy

Benchmarks
Recent benchmarks on some representative tasks:
• Text Classification on Reuters news data (0.5 GB)
• Click prediction on the Kaggle Criteo dataset (12 GB)
• Clustering of handwritten digit images (MNIST) (25 GB)
• Collaborative filtering on the Netflix prize dataset (4 GB)
• Topic modeling (LDA) on a NY times collection (0.5 GB)
• Random Forests on a UCI Year Prediction dataset (0.2 GB)
• Pagerank on two social network graphs at 12GB and 48GB

Benchmarks
Systems (single node)
• BIDMach
• VW (Vowpal Wabbit) from Yahoo/Microsoft
• Scikit-Learn
• LibLinear
Cluster Systems
• Spark v1.1 and v1.2
• Graphlab (academic version)
• Yahoo’s LDA cluster

Benchmarks: Single-Machine Systems
System Algorithm Dataset Dim Time
(s)
Cost
($)
Energy
(KJ)
BIDMach Logistic Reg. RCV1 103 14 0.002 3
Vowpal
Wabbit
Logistic Reg. RCV1 103 130 0.02 30
LibLinear Logistic Reg. RCV1 103 250 0.04 60
Scikit-Learn Logistic Reg. RCV1 103 576 0.08 120
RCV1: Text Classification, 103 topics (0.5GB).
Algorithms were tuned to achieve similar accuracy.

Benchmarks: Cluster Systems
System A/B Algorithm Dataset Dim Time
(s)
Cost
($)
Energy
(KJ)
Spark-72
BIDMach
Logistic Reg. RCV1 1
103
30
14
0.07
0.002
120
3
Spark-64
BIDMach
RandomForest YearPred 1 280
320
0.48
0.05
480
60
Spark-128
BIDMach
Logistic Reg. Criteo 1 400
81
1.40
0.01
2500
16
Spark-XX = System with XX cores
BIDMach ran on one node with GTX-680 GPU

Benchmarks: Cluster Systems
System A/B Algorithm Dataset Dim Time
(s)
Cost
($)
Energy
(KJ)
Spark-384
BIDMach
K-Means MNIST 4096 1100
735
9.00
0.12
22k
140
GraphLab-576
BIDMach
Matrix
Factorization
Netflix 100 376
90
16
0.015
10k
20
Yahoo-1000
BIDMach
LDA (Gibbs) NYtimes 1024 220k
300k
40k
60
4E10
6E7
Spark-XX or GraphLab-XX = System with XX cores
Yahoo-1000 had 1000 nodes

BIDMach at Scale
Latent Dirichlet Allocation
BIDMach outperforms cluster systems on this problem, and
has run up to 10 TB on one node.
Convergence on 1TB data

Benchmark Summary
• BIDMach is at least 10x faster than other single-machine
systems for comparable accuracy.
• For Random Forests or single-class regression, BIDMach on
a GPU node is comparable with 8-16 worker clusters.
• For multi-class regression, factor models, clustering etc.,
GPU-assisted BIDMach is comparable to 100-1000-worker
clusters. Larger problems correlate with larger values in this
range.

Cost and Energy Use
• BIDMach had a 10x-1000x cost advantage over the other
systems. The ratio was higher for larger-scale problems.
• Energy savings were similar to the cost savings, at 10x-
1000x.

In the Wild (Examples from Industry)
• Multilabel regression problem (summer intern project):
• Existing tool (single-machine) took ~ 1 week to build a model.
• BIDMach on a GPU node takes 1 hour (120x speedup)
• Iteration and feature engineering gave +15% accuracy.
• Auction simulation problem (cluster job):
• Existing tool simulates auction variations on log data.
• On NVIDIA 3.0 devices (64 registers/thread) we achieve a 70x
speedup over a reference implementation in Scala
• On NVIDIA 3.5 devices (256 registers/thread) we can move
auction state entirely into register storage and gain a 400x
speedup.

In the Wild (Examples from Industry)
• Classification (cluster job):
• Cluster job (logistic regression) took 8 hours.
• BIDMach version takes < 1 hour on a single node.
• SVMs for image classification (single machine)
• Large multi-label classification took 1 week with LibSVM.
• BIDMach version (SGD-based SVM) took 90 seconds.

Performance Revisited
• BIDMach had a 10x-1000x cost advantage over the other
systems. The ratio was higher for larger-scale problems.
• Energy savings were similar to the cost savings, at 10x-
1000x.
But why??
• We only expect about 10x
from GPU acceleration?

The Exponential Growth of O(1)
Once upon a Time…
a = b + c // single instruction
y = A[i] // single instruction
Now five orders
of magnitude

The Exponential Growth of O(1)
By slight abuse of notation, let O(1) denote the ratio of times
for a main memory access vs an ALU operation.
Then O(1) is not only not “small,” but its growing exponentially
with time.
A lot of non-rooflined code is too “memory-hungry” – you can
often do (much) better with custom, memory-aware kernels.
Log (Memaccess/ALUop)
Time (Years)

Performance Creep
• Code that isnt explicitly rooflined seems to be creeping
further and further from its roofline limit.
• With Intel’s profiling tools we can check the throughput of a variety of
running programs – often 10s–100s of Mflops – memory bound.
• Need well-designed kernels (probably hardware-aware
code) to get to the limits.
• Coding for performance:
Avoid:
Large hash tables
Large binary trees
Linked data structures
Use:
Sorting
Memory B-trees
Packed data structures

BIDMach MLAlgorithms
1. Regression (logistic, linear)
2. Support Vector Machines
3. k-Means Clustering
4. Topic Modeling - Latent Dirichlet Allocation
5. Collaborative Filtering
6. NMF – Non-Negative Matrix Factorization
7. Factorization Machines
8. Random Forests
9. Multi-layer neural networks
10. IPTW (Causal Estimation)
11. ICA
= Likely the fastest implementation available

CoDesign: Gibbs Sampling
• Gibbs sampling is a very general method for inference in
machine learning, but typically slow.
• But it’s a particularly good match
for SIMT hardware (GPUs), not
difficult to get 10-100x speedups.
• But many popular models have slow convergence:
• Latent Dirichlet Allocation
• 10 passes of online Variational Bayes ≈
• 1000 passes for collapsed Gibbs sampling

Consequences
LDA graphical model:
• Topic samples Z are discrete and sparse: very few samples
for a specific topic in a particular document.
• This gives a high-variance random walk through parameter
space before  converges.
Documents

Gibbs Parameter Estimation
Gibbs sampling in ML often applied to distributions of the
form
P(D,Z,)
Where D is the data,  are the parameters of the model, and
X are other hidden states we would like to integrate over.
e.g for LDA,  are the ,  parameters and Z are
assignments of words to topics.
• Gibbs sampling most often simply samples over both, which
is slow.
• But we really want to optimize over , and marginalize
(integrate) over X.

State-Augmented Marginal Estimation (SAME)
– Doucet et al. 2002
The marginal parameter distribution is 𝑃 Θ = 𝑃 𝑋, Θ 𝑑𝑋
We can sharpen it to 𝑃 𝑘
Θ as follows
𝑃 𝑘 Θ = … 𝑃 𝑋, Θ
𝑍𝑌𝑋
𝑃 𝑌, Θ … 𝑃 𝑍, Θ 𝑑𝑋𝑑𝑌 … 𝑑𝑍
𝑃 𝑘
Θ has the same peaks as 𝑃 Θ and corresponds to a
Gibbs distribution cooled to T = 1/k.
We modify a standard, blocked Gibbs sampler to take k
samples instead of 1 for X, and then recompute Θ from
these samples.

SAME Sampling
In the language of graphical models:
Run independent simulations with tied parameters Θ
Θ

SAME sampling as cooling
What cooling does:
Likelihood P() in model parameter space (peaks are good
models)

Making SAME sampling fast
The approach in general:
• Wherever you would take 1 sample before, take k
independent samples and average for parameter estimation.
Instead of taking distinct samples take counts of
anonymous samples – complexity independent of k
• Exact for LDA, HMM, Factorization Machines,…
• Corresponds to factored joint approximation for other
models.
Varying k allows for annealing optimization over .

Results on LDA
• We use k=100, make  10 passes over the dataset
(effectively taking 1000 total samples).
• Our SAME Gibbs sampler is within a factor of 3 of online
Variational Bayes. More than 100x faster than standard
Gibbs samplers.
• The model was more accurate than any other LDA method
we tested:

SAME sampling in general
• SAME sampling is an alternative to other message-passing
methods (VMP etc.)
• But rather than using bounds on posterior distributions it:
• Samples from the distribution of “uninteresting” variables
• Sharpens the posteriors on parameter variables.

Current Work
Inference on general graphical
models with Gibbs sampling.
Larger graphical models arise
in many applications but are
discarded for performance
reasons.
Existing tools (JAGS, Stan, Infer.net) are far from the roofline
for this problem.

Inference for Bayesian Networks
• (Main memory) Rooflined Approach: Sampling distribution
can be computed with an SpMM – 1-2GF on CPU and 5-
10GF on GPU.
• Experiment on MOOC learning concept graph with ~300
nodes and ~350 edges, 3000 samples

Getting to Warp Speed
The model for this problem
(CPTs for all nodes) fits
comfortably in GPU SHMEM.
The state of thousands of
particles fits comfortably in
GPU register memory.
It should be possible to gain another 10x to 100x in
performance with this design, and approach ALU limits.
4 MB registers
1 MB Shared Mem

Challenges with Scaling MB algorithms
Can we combine node acceleration with clustering?
Its harder than it seems:
Dataset
Minibatch
updates
Single-node Minibatch Learner
Dataset partitions across nodes
Cluster-based Learner
Reduce (Allreduce)

Allreduce
Every node j has data Uj (a local copy of a global model).
We want to compute an aggregate (e.g. sum) of all data
vectors and distribute it to all nodes, i.e. to synchronize the
model across nodes

Why Allreduce?
• Speed: Minibatch SGD and related algorithms are the
fastest way to solve many of these problems (deep
learning, CF, topic modeling,…). But dataset sizes (several
TB) make single machine training slow (> 1 day).
• The more updates/sec the faster the method converges,
but:
• In practice minibatch dof ~ model dof is near-optimal.
• That means the ideal B/W for model updates is at least the
rate of data consumption (several Gb/s).
• In other words, we want an Allreduce that consumes data
at roughly the full NIC bandwidth.

Why Allreduce?
• Scale: Some models wont fit on one machine (e.g.
Pagerank).
• Other models will fit but minibatch updates only touch a
small fraction of features (Click Prediction). Updating only
those is much faster than a full Allreduce.
• Sparse Allreduce: Each node j pushes updates to a
subset of features Uj and pulls a subset Vj from a
distributed model.

Why Allreduce?
• Worker driven (emergency room model) vs. aggregator-
driven (doctor’s surgery)?
• “Excuse me, worker task, the aggregator will consume
your data now”

Lower Bounds for total message B/W.
• Intuitively, every node needs to send D values
somewhere, and receive D values for the final aggregate.
• The data from each node needs to be caught somewhere
if allreduce happens in the same network, and the final
aggregate message need to originate from somewhere.
• This suggests a naïve lower bound of 2D values in or out
of each node, or 2DN for an N-node network.
• In fact more careful analysis shows the lower bound is
2D(N-1) for total bandwidth, and ~ 2D/B for latency:
Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of
workstations

Why not Parameter servers?
• Potentially B/W optimal, but latency limited by net B/W of
servers:
Clients
servers
This network has at least 2D(N-1)/PB latency with P
servers, suboptimal by (N-1)/P.

A Roofline For networks
• Optimal operating point
Packet size (kBytes)
Throughput(MB/s)
1
10
100
1000
1 10 100 1000 10000
Network capacity
100000
10000

Allreduce (example: Click Prediction)
• To run the most efficient solver (distributed SGD), we need to
synchronize models across the network.
• An Allreduce (reduce+broadcast) operation both
synchronizes and combines the updates from each node.
• For Logistic Regression, SVM and many other problems, we
need a Sparse Allreduce.

What about MapReduce?
The all-to-all Reduce implementations in Hadoop, Spark,
Powergraph etc. having scaling problems because of message
overhead. Smaller messages  lower throughput.
During reduce, N nodes exchange
O(N2) messages of diminishing size.

Solutions (Dense data)
• For dense data, there are optimal (and fast in practice)
methods for Allreduce on clusters.
• They are implemented in MPI, but not always used by
default.
• In a recent Baidu paper (Wu et al.*), the authors achieved
near-linear speedup on an Infiniband cluster using MPI all-to-
all primitives.
* “Deep Image: Scaling up Image Recognition” Arxiv 1501.02876v2

Kylix: A Sparse AllReduce
for Commodity Clusters
Huasha Zhao
John Canny
Department of Electrical Engineering
and Computer Sciences
University of California, Berkeley

Power-Law Data
Term-Document Graph

Power-Law Data
𝑭 = 𝒓−𝜶
F: frequency
r: rank
α: power law exponent

Sparse AllReduce
• Allreduce is a general primitive for distributed graph
mining and machine learning.
• Reduce, e.g. 𝑣 = 𝑣𝑖
𝑚
𝑖=1 or other commutative and associative aggregate
operator such as mean, max etc.
• And redistributed back across nodes.
• Fast Allreduce makes the cluster behave like a single giant node.
• 𝑣𝑖 is sparse (power law) for big data problems, and each
cluster node may not require all of the sum v but only a
sparse subset.
• Kylix is an efficient Sparse Allreduce primitive for
computing on commodity clusters.

• Loss function
• The derivative of loss, which defines the SGD update, is a
scaled copy of Xi – same sparse non-zeros as Xi
Mini-batch Machine Learning
…Features
Examples
Xi Xi+1

Mini-batch Machine Learning
…Features
Examples
Xi Xi+1
Pull requests:
non-zeros of
next iteration
Push requests:
non-zeros of
current iteration

Kylix – Nested Heterogeneous Butterfly
• Data split to 6 machines

• Each machine is responsible for reducing 1/6 of the indices.

• Scatter-Reduce along dimension 1
• data split by 3

• Scatter-Reduce along dimension 2
• Reduce for shorter range, but denser data
grey scale indicates density

• Allgather in a reverse order
• Redistribute reduced values
grey scale indicates density

• Heterogeneous degree allows tailoring message size for each
layer in order to achieve optimal message size (larger than
the efficient message floor).

• Heterogeneous degree allows tailoring message size for each
layer – achieve optimal message size.
• Nesting allows indices to be collapsed down the network, so
that communication is greatly reduced in the lower layers –
works particularly well for power data.

Kylix - Summary
• Each network node specifies a set of input indices to pull
and a set output indices to push.
• Configuration step
• Build the routing table, once for static graphs, multiple times for
dynamic graphs
• Partitioning
• Union/Collapsing indices
• Mapping
• Reduction step
• ConfigReduce
• Can be performed in a single down-pass

Tuning Degrees of the Network
• To compute the optimal degree for that layer, we need to know the
amount of data per node. That will be the length of the partition
range at that node times the density.
• Given this data size, we find the
largest degree d such that P/d is
larger then the message floor.
• Degree / diameter trade-off
• degree^diameter = const.

Data Volumes at Each Layer
• Total communication across all layers a small constant larger than
the top layer, which is close to optimal.
• Communication volume across layers has a characteristic Kylix
shape.

Experiments (PageRank)
• Twitter Followers’ Graph
• 1.4 billion edges and 40 million vertices
• Yahoo Web Graph
• 6.7 billion edges and 1.4 billion vertices
• EC2 cluster compute node (cc2.8xlarge)
90-node Yahoo M45 64-node EC2 64-node EC2

Summary
• Roofline design exposes fundamental limits for ML tasks,
and provides guidance to reaching them.
• Roofline design gives BIDMach one to several orders of
magnitude advantage over other tools – most of which ware
well below roofline limits.
• Rooflining is a powerful tool for network primitives and
allows cluster computing to approach linear speedups
(strong scaling).

Software (version 1.0 just released)
Code: github.com/BIDData/BIDMach
Wiki: http://bid2.berkeley.edu/bid-data-project/overview/
BSD open source libs and dependencies.
In this release:
• Random Forests
• Double-precision GPU matrices
• Ipython/IScala Notebook
• Simple DNNs
Wrapper for Berkeley’s Caffe coming soon)

Thanks
Sponsors:
Collaborators:

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Similar to SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny (20)

More from Chester Chen

More from Chester Chen (20)

Recently uploaded

Recently uploaded (20)

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny