Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Machine Learning on Big Data
1. Machine Learning on Big Data
Lessons Learned from Google Projects
Max Lin
Software Engineer | Google Research
Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011
2. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
3. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
4. “Machine Learning is a study
of computer algorithms that
improve automatically
through experience.”
15. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you
need a computer.
16. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
17. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
no venga.
18. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
19. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida.
20. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
21. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
is the question
22. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
23. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas.
24. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas. ?
25. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas. ?
26. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer.
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas. ?
27. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas. ?
28. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish
To be or not to be -- that
?
is the question
La fe mueve montañas. ?
29. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish
To be or not to be -- that
?
Testing is the question
La fe mueve montañas. ?
30. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish
To be or not to be -- that
?
Testing f(x’)
is the question
La fe mueve montañas. ?
31. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish
To be or not to be -- that
?
Testing f(x’)
is the question
= y’
La fe mueve montañas. ?
34. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’
35. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ...
36. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’
37. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ...
38. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’
39. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ...
40. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
41. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
42. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
43. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
44. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0,
45. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ...
46. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0,
47. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ...
48. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1,
49. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1, ...
50. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1, ... 1,
51. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1, ... 1, ...
52. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1, ... 1, ... 0,
53. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
0, ... 0, ... 1, ... 1, ... 0, ...
54. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
[ 0, ... 0, ... 1, ... 1, ... 0, ...
55. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
[ 0, ... 0, ... 1, ... 1, ... 0, ... ]
56. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
57. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
[ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
58. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
59. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
P
f (x) = w · x = w p ∗ xp
p=1
60. Training Data
Input X Ouput Y
P
...
...
...
N
... ... ... ... ... ...
...
61. Typical machine learning
data at Google
N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)
http://www.flickr.com/photos/mr_t_in_dc/5469563053
62. Classifier Training
• Training: Given {(x, y)} and f, minimize the
following objective function
N
arg min L(yi , f (xi ; w)) + R(w)
w
n=1
63. Use Newton’s method?
t +1 t t −1 t
w ← w − H(w ) J(w )
http://www.flickr.com/photos/visitfinland/5424369765/
64. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
80. Parallelize Estimates
• Naive Bayes Classifier
N P
i
arg min − P (xp |yi ; w)P (yi ; w)
w
i=1 p=1
• Maximum Likelihood Estimates
N i
i=1 1EN,the (x )
wthe|EN = N
i=1 1EN (xi )
83. Word Counting
X: “The quick brown fox ...”
Map
Y: EN
84. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map
Y: EN
85. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
86. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
87. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
Reduce
88. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
89. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
C(‘the’|EN) = SUM of values = 3
90. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
C(‘the’|EN) = SUM of values = 3
C( the |EN )
w the |EN =
C(EN )
101. Parallelize Optimization
• Maximum Entropy Classifiers
P
N i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p
• Good: J(w) is concave
102. Parallelize Optimization
• Maximum Entropy Classifiers
P
N i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p
• Good: J(w) is concave
• Bad: no closed-form solution like NB
103. Parallelize Optimization
• Maximum Entropy Classifiers
P
N i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p
• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N
106. Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients
•
107. Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients J(w)
•
108. Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients J(w)
• w ← w − η J(w)
t+1 t
109. Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients J(w)
• w ← w − η J(w)
t+1 t
N
J(w) = P (w, xi , yi )
i=1
110. Distribute Gradient
• w is initialized as zero
• for t in 1 to T
• Calculate gradients in parallel
• Training CPU: O(TPN) to O(TPN / M)
111. Distribute Gradient
• w is initialized as zero
• for t in 1 to T
• Calculate gradients in parallel
wt+1 ← wt − η J(w)
• Training CPU: O(TPN) to O(TPN / M)
115. Distribute Gradient
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, partial gradient sum)
116. Distribute Gradient
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, partial gradient sum)
Reduce Sum and
Update w
117. Distribute Gradient
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, partial gradient sum)
Reduce Sum and
Update w
Repeat M/R
until converge Model
119. Parallelize Subroutines
• Support Vector Machines
1
n
2
arg min ||w||2 +C ζi
w,b,ζ 2 i=1
s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
1 T
arg min α Qα − αT 1
α 2
s.t. 0 ≤ α ≤ C, yT α = 0
120. The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
memory
http://www.flickr.com/photos/sea-turtle/198445204/
122. Parallel SVM [Chang et al, 2007]
• Parallel, row-wise incomplete Cholesky
Factorization for Q
√
N
123. Parallel SVM [Chang et al, 2007]
• Parallel, row-wise incomplete Cholesky
Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)
124. Parallel SVM [Chang et al, 2007]
• Parallel, row-wise incomplete Cholesky
Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)
• Parallel Support Vector Machines (psvm) http://
code.google.com/p/psvm/
• Implement in MPI
125. Parallel ICF
• Distribute Q by row into M machines
Machine 1 Machine 2 Machine 3
row 1 row 3 row 5 ...
row 2 row 4 row 6
• For each dimension n < N √
• Send local pivots to master
• Master selects largest local pivots and
broadcast the global pivot to workers
136. Parameter Mixture [Mann et al, 2009]
Big Data
Shard 1 Shard 2 Shard 3 ... Shard M
137. Parameter Mixture [Mann et al, 2009]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
138. Parameter Mixture [Mann et al, 2009]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce Average w
139. Parameter Mixture [Mann et al, 2009]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce Average w
Model
140. Much Less network
usage than
distributed gradient
descent
O(MN) vs. O(MNT)
ttp://www.flickr.com/photos/annamatic3000/127945652/
146. Iterative Param Mixture [McDonald et al., 2010]
Big Data
Shard 1 Shard 2 Shard 3 ... Shard M
147. Iterative Param Mixture [McDonald et al., 2010]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
148. Iterative Param Mixture [McDonald et al., 2010]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce
after each Average w
epoch
149. Iterative Param Mixture [McDonald et al., 2010]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce
after each Average w
epoch
Model
150.
151. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
176. Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict
177. Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict
• BigQuery
• interactive analysis of massive data on the cloud
• http://code.google.com/apis/bigquery
Sub-sampling provides inferior performance\nParameter mixture improves, but not as good as all data\nIterative parameter mixture achieves as good as all data\nDistributed algorithms return better classifiers quicker\n\n
\n
billion of instances, millions of features\nwithin reasonable resources\n
\n
guarantee, state-of-the-art\n
easy to use - adoption setup\neasy to maintain - reliable, production, work with batch systems\n
\n
\n
new features every day\nfeedback change\n
iterative, fast retrieval for data and model / parameters\n
fault-tolerant pieces: MapReduce (scalable), multi-cores, GFS for data\n