Whats Right and Wrong with Apache Mahout

1©MapR Technologies 2013- Confidential
Apache Mahout
How it's good, how it's awesome, and where it falls short

What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations

What is Right and Wrong with Mahout?
 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff

 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff

 Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
All the stuff that
isn’t there

Mahout Math

Mahout Math
 Goals are
– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
 But not
– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature

Matrices and Vectors
 At the core:
– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
 Highly composable API
 Important ideas:
– view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)

Assign
 Matrices
 Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);

Views
 Matrices
 Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);

Examples
 The trace of a matrix
 Random projection
 Low rank random matrix

Examples
m.viewDiagonal().zSum()

Examples
m.viewDiagonal().zSum()
m.times(new DenseMatrix(1000, 3).assign(new Normal()))

Recommenders

Examples of Recommendations
 Customers buying books (Linden et al)
 Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
 Internet radio listeners not skipping songs (Musicmatch)
 Internet video watchers watching >30 s (Veoh)
 Visibility in a map UI (new Google maps)

Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1

Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) once,
 (t2, t4) once,
 (t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1

A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations

Clustering

An Example

Diagonalized Cluster Proximity

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Lots of Clusters Are Fine

Decompositions

Low Rank Matrix
 Or should we see it differently?
 Are these scaled up versions of all the same column?
1 2 5
2 4 10
10 20 50
20 40 100

Low Rank Matrix
 Matrix multiplication is designed to make this easy
 We can see weighted column patterns, or weighted row patterns
 All the same mathematically
1
2
10
20
1 2 5x
Column pattern
(or weights)
Weights
(or row pattern)

Low Rank Matrix
 What about here?
 This is like before, but there is one exceptional value
1 2 5
2 4 10
10 100 50
20 40 100

Low Rank Matrix
 OK … add in a simple fixer upper
1
2
10
20
1 2 5x
0
0
10
0
0 8 0x
Which row
Exception
pattern
+[
[
]
]

Random Projection

SVD Projection

Classifiers

Mahout Classifiers
 Naïve Bayes
– high quality implementation
– uses idiosyncratic input format
– … but it is naïve
 SGD
– sequential, not parallel
– auto-tuning has foibles
– learning rate annealing has issues
– definitely not state of the art compared to Vowpal Wabbit
 Random forest
– scaling limits due to decomposition strategy
– yet another input format
– no deployment strategy

The stuff that isn’t there

What Mahout Isn’t
 Mahout isn’t R, isn’t SAS
 It doesn’t aim to do everything
 It aims to scale some few problems of practical interest
 The stuff that isn’t there is a feature, not a defect

 Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscribe@mahout.apache.org
 Slides and such
http://www.slideshare.net/tdunning
 Hash tags: #mapr #apachemahout

Whats Right and Wrong with Apache Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Whats Right and Wrong with Apache Mahout

Similar to Whats Right and Wrong with Apache Mahout (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

Whats Right and Wrong with Apache Mahout