How to Troubleshoot Apps for the Modern Connected Worker
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions
1. What you can do with a Tall-and-Skinny !
QR Factorization on Hadoop: !
Large regressions, Principal Components
Slides bit.ly/16LS8Vk
@dgleich
Code github.com/dgleich/mrtsqr
dgleich@purdue.edu
DAVID F. GLEICH
ASSISTANT PROFESSOR !
COMPUTER SCIENCE !
PURDUE UNIVERSITY
1
David Gleich · Purdue
bit.ly/16LS8Vk
2. Why you should stay …
you like advanced machine learning techniques
you want to understand how to compute the
singular values and vectors of a huge matrix
(that’s tall and skinny)
you want to learn about large-scale regression,
and principal components from a matrix
perspective
2
David Gleich · Purdue
bit.ly/16LS8Vk
3. What I’m going to assume
you know
MapReduce
Python
Some simple matrix manipulation
3
David Gleich · Purdue
bit.ly/16LS8Vk
4. Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A
A few columns (under 10,000)
regression and!
general linear models!
with many samples! From tinyimages"
collection
Used in
block iterative methods
panel factorizations
approximate kernel k-means
big-data SVD/PCA!
4
David Gleich · Purdue
bit.ly/16LS8Vk
5. If you have tons of small
records, then there is probably
a tall-and-skinny matrix
somwhere
5
David Gleich · Purdue
bit.ly/16LS8Vk
6. Tall-and-skinny matrices are
common in BigData
A : m x n, m ≫ n
A1
Key is an arbitrary row-id
A2
Value is the 1 x n array "
for a row
A3
Each submatrix Ai is an "
A4
the input to a map task.
6
David Gleich · Purdue
bit.ly/16LS8Vk
7. PCA of 80,000,000!
images
1000 pixels
1
0.8 0
Fraction of variance
Fraction of variance
80,000,000 images
0.6 0
A 0.4 0
0.2 0
First 16 columns of V as 0
20 40 60 80 100
images
Principal Components
Figure 5: The 16 most impo
nent basis functions (by row
7
Constantine & Gleich, MapReduce 2011.
David Gleich · Purdue
bit.ly/16LS8Vk
8. via the sum of red-pixel values in each image as a linear combi-
nation of the gray values in each image. Formally, if ri is the
time
and
Regression with 80,000,000
sum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wanted
per-
ates
images
q q
to find min i (ri ≠ j Gi,j sj )2 . There is no particular im-
(for portance to this regression problem, we use it merely as a
demonstration.
1000 pixels
on),
split The coe cients sj are dis-
file played as an image to approx.
The goal was at the right.
d by They reveal regionsthere was
how much red of the im-
test age in a picture fromimportant
that are not as the
80,000,000 images
the in determining the overall red
value of the grayscale
r in component of an image. The
pixels only.
A color scale varies from light-
final
size blue (strongly measure of blue
We get a negative) to
pers (0) howred (strongly positive).
and much “redness”
The computation took 30 min-
each pixel contributes to
1000 utes using the Dumbo frame-
the whole.
work and a two-iteration job with 250 intermediate reducers.
h is
the We also solved a principal component problem to find a
hav- principal component basis for each image. Let G be matrix
final of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk
let Gleich · Purdue
of the ith
8
10. QR Factorization and the
Gram Schmidt process
Consider a set of vectors v1 to
vn. Set u1 to be v1.
Create a new vector u2 by
removing any “component” of
u1 from v2.
Create a new vector u3 by
removing any “component” of
u1 and u2 from v3.
…
“Gram-Schmidt process” "
10
from Wikipedia
David Gleich · Purdue
bit.ly/16LS8Vk
11. QR Factorization and the
Gram Schmidt process
v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3
⇥ ⇤
v1 v2 v3 ...
2 3
a1 b1 c1 ...
⇥ ⇤6 0
6 b2 c2 ... 77
= u1 u2 v3 ... 6 0 0 c3 ... 7
4 5
.
. .
. .
. ..
. . . .
11
David Gleich · Purdue
bit.ly/16LS8Vk
12. QR Factorization and the
Gram Schmidt process
v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3
For this problem
V = UR All vectors in U
are at right
angles, i.e. they
What it’s usually"
written as by others
A = QR are decoupled
12
David Gleich · Purdue
bit.ly/16LS8Vk
13. QR Factorization and the
Gram Schmidt process
R
v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3
A =
Q All vectors in U
are at right
angles, i.e. they
are decoupled
13
David Gleich · Purdue
bit.ly/16LS8Vk
14. PCA of 80,000,000!
images
First 16
columns
of V as
images
1000 pixels
R V
SVD
(principal
TSQR
components)
80,000,000 images
Top 100
A X singular
values
Zero"
mean"
rows
MapReduce Post Processing
14
Constantine & Gleich, MapReduce 2010.
David Gleich · Purdue
bit.ly/16LS8Vk
15. Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.
15
David Gleich · Purdue
bit.ly/16LS8Vk
16. The rest of the talk!
Full TSQR code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
self.bsize=blocksize key = random.randint(0,2000000000)
self.data = [] yield key, row
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper def mapper(self,key,value):
self.collect(key,value)
def compress(self):
R = numpy.linalg.qr( def reducer(self,key,values):
numpy.array(self.data),'r') for value in values: self.mapper(key,value)
# reset data and re-initialize to R
self.data = [] if __name__=='__main__':
for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
16
David Gleich · Purdue
bit.ly/16LS8Vk
17. Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2010)
Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4
A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7
A8 qr emit
A8 Q8 R8
R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R
17
David Gleich · Purdue
bit.ly/16LS8Vk
18. The rest of the talk!
Full TSQR code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
self.bsize=blocksize key = random.randint(0,2000000000)
self.data = [] yield key, row
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper def mapper(self,key,value):
self.collect(key,value)
def compress(self):
R = numpy.linalg.qr( def reducer(self,key,values):
numpy.array(self.data),'r') for value in values: self.mapper(key,value)
# reset data and re-initialize to R
self.data = [] if __name__=='__main__':
for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
18
David Gleich · Purdue
bit.ly/16LS8Vk
19. Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
19
David Gleich · Purdue
bit.ly/16LS8Vk
20. map emit reduce emit reduce emit
R1 R2,1 R
A1 Mapper 1-1
S1 Reducer 1-1
S(2)
A2 Reducer 2-1
Serial TSQR Serial TSQR Serial TSQR
shuffle
identity map
map emit reduce emit
R2 R2,2
A2 Mapper 1-2 S(1) S
A2 Reducer 1-2
shuffle
Serial TSQR Serial TSQR
A
map emit reduce emit
R3 R2,3
A3 Mapper 1-3
S3
A2 Reducer 1-3
Serial TSQR Serial TSQR
map emit
R4
A3
4 Mapper 1-4
Serial TSQR
20
Iteration 1 Iteration 2
David Gleich · Purdue
bit.ly/16LS8Vk
21. Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.
21
David Gleich · Purdue
bit.ly/16LS8Vk
22. Hadoop streaming isn’t
always slow!
Synthetic data test on 100,000,000-by-500 matrix (~500GB)
Codes implemented in MapReduce streaming
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+ATLAS matrix.
Custom C++ TypedBytes reader/writer with ATLAS matrix.
Iter 1
Iter 2
Overall"
Total (secs.)
Total (secs.)
Total (secs.)
Dumbo
960
217
1177
Hadoopy
612
118
730
C++! 350! 37! 387!
Java
436
66
502
22
David Gleich · Purdue
bit.ly/16LS8Vk
23. Use multiple iterations for
problems with many columns
Cols.
Iters.
Split" Maps
Secs.
(MB)
Increasing split size
50
1
64
8000
388
improves performance
(accounts for Hadoop –
–
256
2000
184
data movement)
–
–
512
1000
149
Increasing iterations –
2
64
8000
425
helps for problems with –
–
256
2000
220
many columns.
–
–
512
1000
191
(1000 columns with 64- 1000
1
512
1000
666
MB split size overloaded
–
2
64
6000
590
the single reducer.)
–
–
256
2000
432
–
–
512
1000
337
23
David Gleich · Purdue
bit.ly/16LS8Vk
24. More about how to !
compute a regression
2
min kAx bk
XX
2
= min ( Aij xj bi )
i j
A b1
A1 A1
Q2 b2 = Q2T b1
qr
A2 A2 R2
Mapper 1 qr
Serial TSQR A3 A3
b
A4
24
David Gleich · Purdue
bit.ly/16LS8Vk
25. TSQR code in hadoopy for
regressions
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for i,row in enumerate(self.data):
[…] key = random.randint(0,2000000000)
yield key, (row, self.rhs[i])
def compress(self):
Q,R = numpy.linalg.qr( def mapper(self,key,value):
numpy.array(self.data), ‘full’) self.collect(key,unpack(value))
# reset data and re-initialize to R
self.data = [] def reducer(self,key,values):
for row in R: for value in values: self.mapper(key,
self.data.append([float(v) for v in row]) unpack(value))
self.rhs = list( numpy.dot(Q.T,
numpy.array(self.rhs) ) if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
def collect(self,key,valuerhs): reducer = SerialTSQR(blocksize=3,isreducer=True)
self.data.append(valuerhs[0]) hadoopy.run(mapper, reducer)
self.rhs.append(valuerhs[1])
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
25
David Gleich · Purdue
bit.ly/16LS8Vk
26. More about how to !
compute a regression
min kAx bk2
= min kQRx bk2
Orthogonal or “right angle” matrices"
don’t change vector magnitude
T T 2
QT b
= min kQ QRx Q bk
A R = min kRx Q T bk2
QR"
for " This is a tiny linear system!
Regression
def compute_x(output):!
R,y = load_from_hdfs(output)!
x = numpy.linalg.solve(R,y)!
write_output(x,output+’-x’)!
b
26
David Gleich · Purdue
bit.ly/16LS8Vk
27. We do a similar step for the
PCA and compute the 1000-
by-1000 SVD on one machine
27
David Gleich · Purdue
bit.ly/16LS8Vk
29. What about the matrix Q?
We want Q to be Constantine & Gleich,
MapReduce 2011
numerically
orthogonal.
Prior work
norm ( QTQ – I )
AR-1
A condition number
measures problem Benson, Gleich,
sensitivity.
Demmel, Submitted
AR + "
-1
nt
Direct TSQR
refineme
iterative Benson, Gleich, "
Prior methods all Demmel, Submitted
failed without any 105
1020
warning.
Condition number
29
David Gleich · Purdue
bit.ly/16LS8Vk
30. Taking care of business by
keeping track of Q
3. Distribute the
pieces of Q*1 and
form the true Q
Mapper 1
Mapper 3
Task 2
R1
Q11
A1
Q1
R1
Q11
R
Q1
Q1
R2
Q21
Q output
R output
R2
R3
Q31
Q21
A2
Q2
Q2
Q2
R4
Q41
R3
Q31
2. Collect R on one
A3
Q3
Q3
Q3
node, compute Qs
for each piece
R4
Q41
A4
Q4
Q4
Q4
1. Output local Q and
R in separate files
30
David Gleich · Purdue
bit.ly/16LS8Vk
32. Future work … more columns!
With ~3000 columns, one 64MB chunk is a local
QR computation.
Could “iterate in blocks of 3000” columns to
continue … maybe “efficient” for 10,000 columns
Need different ideas for 100,000 columns
(randomized methods?)
32
David Gleich · Purdue
bit.ly/16LS8Vk
I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.
I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.