Machine Learning on Big Data

Machine Learning on Big Data
Lessons Learned from Google Projects

Max Lin
Software Engineer | Google Research

Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011

Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems

“Machine Learning is a study
of computer algorithms that
improve automatically
through experience.”

The quick brown fox
jumped over the lazy dog.

The quick brown fox
English

The quick brown fox
English
To err is human, but to
really foul things up you
need a computer.

The quick brown fox
English
really foul things up you English
need a computer.

The quick brown fox
English
need a computer.
No hay mal que por bien
no venga.

The quick brown fox
English
need a computer.
Spanish
no venga.

The quick brown fox
English
need a computer.
Spanish
no venga.
La tercera es la vencida.

The quick brown fox
English
need a computer.
Spanish
no venga.
La tercera es la vencida. Spanish

The quick brown fox
English
need a computer.
Spanish
no venga.

To be or not to be -- that
is the question

The quick brown fox
English
need a computer.
Spanish
no venga.

?
is the question

The quick brown fox
English
need a computer.
Spanish
no venga.

?
is the question

La fe mueve montañas.

The quick brown fox
English
need a computer.
Spanish
no venga.

?
is the question

La fe mueve montañas. ?

The quick brown fox
English
Training need a computer.
Spanish
no venga.

?
is the question


The quick brown fox
English
Training Input X
need a computer.
Spanish
no venga.

?
is the question


The quick brown fox
English
Training Input X
need a computer. Output Y
Spanish
no venga.

?
is the question


The quick brown fox
English
Training Input X
Spanish
no venga.
Model f(x)

?
is the question


The quick brown fox
English
Training Input X
Spanish
no venga.
Model f(x)

?
Testing is the question


The quick brown fox
English
Training Input X
Spanish
no venga.
Model f(x)

?
Testing f(x’)
is the question


The quick brown fox
English
Training Input X
Spanish
no venga.
Model f(x)

?
Testing f(x’)
is the question
= y’

Linear Classiﬁer
The quick brown fox jumped over the lazy dog.

Linear Classiﬁer

‘a’

Linear Classiﬁer

‘a’ ...

Linear Classiﬁer

‘a’ ... ‘aardvark’

Linear Classiﬁer

‘a’ ... ‘aardvark’ ...

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’ ...

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’

Linear Classiﬁer

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...

Linear Classiﬁer

0,

Linear Classiﬁer

0, ...

Linear Classiﬁer

0, ... 0,

Linear Classiﬁer

0, ... 0, ...

Linear Classiﬁer

0, ... 0, ... 1,

Linear Classiﬁer

0, ... 0, ... 1, ...

Linear Classiﬁer

0, ... 0, ... 1, ... 1,

Linear Classiﬁer

0, ... 0, ... 1, ... 1, ...

Linear Classiﬁer

0, ... 0, ... 1, ... 1, ... 0,

Linear Classiﬁer

0, ... 0, ... 1, ... 1, ... 0, ...

Linear Classiﬁer

[ 0, ... 0, ... 1, ... 1, ... 0, ...

Linear Classiﬁer

[ 0, ... 0, ... 1, ... 1, ... 0, ... ]

Linear Classiﬁer

x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]

Linear Classiﬁer

x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]

[ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]

Linear Classiﬁer

x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]

w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]

Linear Classiﬁer

x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]

w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
P
f (x) = w · x = w p ∗ xp
p=1

Training Data
Input X Ouput Y

P

...

...

...
N

... ... ... ... ... ...

...

Typical machine learning
data at Google

N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)

http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053

Classiﬁer Training

• Training: Given {(x, y)} and f, minimize the
following objective function
N
arg min L(yi , f (xi ; w)) + R(w)
w
n=1

Use Newton’s method?
t +1 t t −1 t
w ← w − H(w ) J(w )

http://www.ﬂickr.com/photos/visitﬁnland/5424369765/

Scaling Up

• Why big data?
• Parallelize machine learning algorithms

Scaling Up

• Why big data?
• Embarrassingly parallel

Scaling Up

• Why big data?
• Parallelize sub-routines

Scaling Up

• Why big data?
• Distributed learning

Subsampling
Big Data

Shard 1 Shard 2 Shard 3 Shard M
...

Subsampling
Big Data

Reduce N Shard 1

Subsampling
Big Data

Reduce N Shard 1

Machine

Subsampling
Big Data

Reduce N

Machine

Shard 1

Subsampling
Big Data

Reduce N

Machine

Shard 1

Model

Why not Small Data?

[Banko and Brill, 2001]

Parallelize Estimates
• Naive Bayes Classiﬁer
N P
i
arg min − P (xp |yi ; w)P (yi ; w)
w
i=1 p=1

• Maximum Likelihood Estimates
N i
i=1 1EN,the (x )
wthe|EN = N
i=1 1EN (xi )

Word Counting
X: “The quick brown fox ...”
Map
Y: EN

Word Counting
(‘the|EN’, 1)
Map
Y: EN

Word Counting
(‘the|EN’, 1)
Map (‘quick|EN’, 1)
Y: EN

Word Counting
(‘the|EN’, 1)
Y: EN
(‘brown|EN’, 1)

Word Counting
(‘the|EN’, 1)
Y: EN
(‘brown|EN’, 1)

Reduce

Word Counting
(‘the|EN’, 1)
Y: EN
(‘brown|EN’, 1)

C(‘the’|EN) = SUM of values = 3

Word Counting
Big Data

Shard 1 Shard 2 Shard 3 ... Shard M

Word Counting
Big Data

Mapper 1 Mapper 2 Mapper 3 Mapper M

Map Shard 1 Shard 2 Shard 3 ... Shard M

(‘the’ | EN, 1)

Word Counting
Big Data



(‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Reducer
Reduce Tally counts
and update w

Word Counting
Big Data



(‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Reducer
Reduce Tally counts
and update w

Model

Parallelize Optimization
N P i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

• Maximum Entropy Classiﬁers
P
N i yi
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

P
N i yi
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

• Good: J(w) is concave

P
N i yi
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

• Bad: no closed-form solution like NB

P
N i yi
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

• Bad: no closed-form solution like NB
• Ugly: Large N

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf

Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients
•

Gradient Descent
• for t in 1 to T
• Calculate gradients J(w)

•

Gradient Descent
• for t in 1 to T
• w ← w − η J(w)
t+1 t

Gradient Descent
• for t in 1 to T
• w ← w − η J(w)
t+1 t

N
J(w) = P (w, xi , yi )
i=1

Distribute Gradient
• for t in 1 to T
• Calculate gradients in parallel

• Training CPU: O(TPN) to O(TPN / M)

Distribute Gradient
• for t in 1 to T
• Calculate gradients in parallel
wt+1 ← wt − η J(w)


Distribute Gradient
Big Data

Distribute Gradient
Big Data

Machine 1 Machine 2 Machine 3 Machine M


(dummy key, partial gradient sum)

Distribute Gradient
Big Data




Reduce Sum and
Update w

Distribute Gradient
Big Data




Reduce Sum and
Update w

Repeat M/R
until converge Model

Parallelize Subroutines
• Support Vector Machines
1
n
2
arg min ||w||2 +C ζi
w,b,ζ 2 i=1

s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
1 T
arg min α Qα − αT 1
α 2

s.t. 0 ≤ α ≤ C, yT α = 0

The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
memory

http://www.ﬂickr.com/photos/sea-turtle/198445204/

Parallel SVM [Chang et al, 2007]

√
N


• Parallel, row-wise incomplete Cholesky
Factorization for Q

√
N


Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)


Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)
• Parallel Support Vector Machines (psvm) http://
code.google.com/p/psvm/
• Implement in MPI

Parallel ICF
• Distribute Q by row into M machines
Machine 1 Machine 2 Machine 3

row 1 row 3 row 5 ...
row 2 row 4 row 6

• For each dimension n < N √

• Send local pivots to master
• Master selects largest local pivots and
broadcast the global pivot to workers

Majority Vote
Big Data


Majority Vote
Big Data



Model 1 Model 2 Model 3 Model 4

Majority Vote

• Train individual classiﬁers independently
• Predict by taking majority votes

Parameter Mixture
[Mann et al, 2009]

Parameter Mixture [Mann et al, 2009]

Big Data


Big Data



Big Data



(dummy key, w1) (dummy key, w2) ...


Big Data




Reduce Average w


Big Data




Reduce Average w

Model

Much Less network
usage than
distributed gradient
descent
O(MN) vs. O(MNT)

ttp://www.ﬂickr.com/photos/annamatic3000/127945652/

Iterative Param Mixture
[McDonald et al., 2010]

Iterative Param Mixture[McDonald et al., 2010]

Big Data

Iterative Param Mixture [McDonald et al., 2010]

Big Data



Big Data





Big Data



Reduce
after each Average w

epoch


Big Data



Reduce
after each Average w

epoch
Model

Scalable

http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053

Parallel

http://www.ﬂickr.com/photos/aloshbennett/3209564747/

Accuracy
http://www.ﬂickr.com/photos/wanderlinse/4367261825/

http://www.ﬂickr.com/photos/imagelink/4006753760/

Binary
Classiﬁcation
http://www.ﬂickr.com/photos/brenderous/4532934181/

Automatic
Feature
Discovery

http://www.ﬂickr.com/photos/mararie/2340572508/

Fast
Response

http://www.ﬂickr.com/photos/prunejuice/3687192643/

Memory is new
hard disk.

http://www.ﬂickr.com/photos/jepoirrier/840415676/

Algorithm +
Infrastructure

http://www.ﬂickr.com/photos/neubie/854242030/

Design for
Multicores
http://www.ﬂickr.com/photos/geektechnique/2344029370/

Multi-shard Combiner

[Chandra et al., 2010]

Parallelize ML
Algorithms


Parallelize ML
Algorithms

• Distributed learning

Parallel Accuracy

Fast
Response

Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict

Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict

• BigQuery
• interactive analysis of massive data on the cloud
• http://code.google.com/apis/bigquery

Machine Learning on Big Data

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Machine Learning on Big Data

Editor's Notes