More Related Content
Similar to Whats Right and Wrong with Apache Mahout (20)
More from Ted Dunning (20)
Whats Right and Wrong with Apache Mahout
- 2. 2©MapR Technologies 2013- Confidential
What is Mahout?
“Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
Components
– math library
– clustering
– classification
– decompositions
– recommendations
- 3. 3©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
- 4. 4©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
- 5. 5©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components
– recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
All the stuff that
isn’t there
- 7. 7©MapR Technologies 2013- Confidential
Mahout Math
Goals are
– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
But not
– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
- 8. 8©MapR Technologies 2013- Confidential
Matrices and Vectors
At the core:
– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
Highly composable API
Important ideas:
– view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)
- 9. 9©MapR Technologies 2013- Confidential
Assign
Matrices
Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
- 10. 10©MapR Technologies 2013- Confidential
Views
Matrices
Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);
- 12. 12©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum()
- 13. 13©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum()
m.times(new DenseMatrix(1000, 3).assign(new Normal()))
- 15. 15©MapR Technologies 2013- Confidential
Examples of Recommendations
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s (Veoh)
Visibility in a map UI (new Google maps)
- 17. 17©MapR Technologies 2013- Confidential
Recommendation Basics
History as matrix:
(t1, t3) cooccur 2 times,
(t1, t4) once,
(t2, t4) once,
(t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
- 18. 18©MapR Technologies 2013- Confidential
A Quick Simplification
Users who do h
Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
- 23. 23©MapR Technologies 2013- Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
- 26. 26©MapR Technologies 2013- Confidential
Low Rank Matrix
Or should we see it differently?
Are these scaled up versions of all the same column?
1 2 5
2 4 10
10 20 50
20 40 100
- 27. 27©MapR Technologies 2013- Confidential
Low Rank Matrix
Matrix multiplication is designed to make this easy
We can see weighted column patterns, or weighted row patterns
All the same mathematically
1
2
10
20
1 2 5x
Column pattern
(or weights)
Weights
(or row pattern)
- 28. 28©MapR Technologies 2013- Confidential
Low Rank Matrix
What about here?
This is like before, but there is one exceptional value
1 2 5
2 4 10
10 100 50
20 40 100
- 29. 29©MapR Technologies 2013- Confidential
Low Rank Matrix
OK … add in a simple fixer upper
1
2
10
20
1 2 5x
0
0
10
0
0 8 0x
Which row
Exception
pattern
+[
[
]
]
- 33. 33©MapR Technologies 2013- Confidential
Mahout Classifiers
Naïve Bayes
– high quality implementation
– uses idiosyncratic input format
– … but it is naïve
SGD
– sequential, not parallel
– auto-tuning has foibles
– learning rate annealing has issues
– definitely not state of the art compared to Vowpal Wabbit
Random forest
– scaling limits due to decomposition strategy
– yet another input format
– no deployment strategy
- 35. 35©MapR Technologies 2013- Confidential
What Mahout Isn’t
Mahout isn’t R, isn’t SAS
It doesn’t aim to do everything
It aims to scale some few problems of practical interest
The stuff that isn’t there is a feature, not a defect
- 36. 36©MapR Technologies 2013- Confidential
Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscribe@mahout.apache.org
Slides and such
http://www.slideshare.net/tdunning
Hash tags: #mapr #apachemahout