Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Mahout
Distributed Matrix Math for Machine Learning

About Me
• Senior Director of Data Science at
Lucidworks (Apache Solr/Lucene, Fusion
search tools)
• Formerly Chief Data Scientist, Technical
Lead of Data Science Practice at Accenture
• Committer and PMC Member, Apache
Mahout
• On Twitter @akm
• Email at akm@apache.org,
andrew.musselman@lucidworks.com
• Adversarial Learning podcast with
@joelgrus at http://adversariallearning.com

Apache Mahout
Recent Trends
in 0.12/0.13
• Simplify and improve performance of
distributed matrix-math programming
• Provide ﬂexible computation options for
software and hardware
• Enable easier and quicker new algorithm
development
• Allow polyglot programming and plotting in
notebooks via Apache Zeppelin

Introduction to Apache Mahout
Apache Mahout is an environment for creating scalable, performant, machine-learning
applications
Apache Mahout provides:
• Mathematically expressive Scala DSL
• A collection of pre-canned math and statistics algorithms
• Interchangeable distributed engines
• Interchangeable native solvers (JVM, CPU, GPU, CUDA, or custom)

Feature Highlights in Recent Releases
• v 0.13.1, Soon — CUDA Solvers, Apache Spark 2.1/Scala 2.11 support
• New web site platform, May 2017 — Moved from ASF CMS system to Markdown and
Jekyll; allows documentation pull requests to be merged in and published
automatically
• v 0.13.0, Apr 2017 — GPU/CPU Solvers, algorithm framework
• v 0.12.2, Nov 2016 — Apache Zeppelin integration for notebooks and visualization
• v 0.12.0, Apr 2016 — Apache Flink backend support
• New Mahout book, Feb 2016 — ‘Apache Mahout: Beyond MapReduce’ by Dmitriy
Lyubimov and Andrew Palumbo
• v 0.10.0 - Apr 2015 - Mahout-Samsara vector-math DSL, MapReduce jobs soft-
deprecated, Spark backend support

Topic Overview
• Mahout-Samsara: Declarative, R-like, domain-speciﬁc language (DSL) for matrix math
• Backend-agnostic programming
• Apache Zeppelin notebooks
• Algorithm development framework (modeled after sk-learn)
• Solve on available CPU cores, single or multiple GPUs, or in the JVM
• Next steps, and how to get involved

Mahout-Samsara
MapReduce is dead; long live the little clip-art blue man!

Mahout-Samsara
• Mahout-Samsara is an easy-to-use domain-speciﬁc language (DSL) for large-scale
machine learning on distributed systems like Apache Spark and Flink
• Uses Scala as programming/scripting environment
• Algebraic expression optimizer for distributed linear algebra
• Provides a translation layer to distributed engines
• Support for Spark and Flink DataSets, RDDs
• System-agnostic, R-like DSL; actual formula from (d)spca:
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)

Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
• Logical optimization
• Optimizer rewrites plan to
use logical operator for
Transpose-Times-Self
matrix multiplication
• Single pass: multiply
partitioned rows by
themselves as transposed
columns
• Computation of A’A:
val C = A.t %*% A
• Naïve execution
• 1st pass: transpose A
(requires repartitioning of
A)
• 2nd pass: multiply result
with A (expensive,
potentially requires
repartitioning again)

Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization

Apache Zeppelin Notebooks
• Notebooks for polyglot programming with all types of data
• Plotting with R and Python off of computed data from other tools in the same notebook
• Share variables between interpreters
• For more: https://zeppelin.apache.org
• Mahout interpreter for Zeppelin released June 2016
• Post by Trevor Grant on how to use it at https://rawkintrevo.org/2016/05/19/visualizing-
apache-mahout-in-r-via-apache-zeppelin-incubating
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/mahout-in-zeppelin/

Add the Mahout Interpreter

Add the Mahout Interpreter, click “Create”

Example usage

Hand results to R for plotting

Algorithm Development Framework

Algorithm Development Framework
• Patterned after R and Python (sk-learn) APIs
• Fitter populates a Model, which contains the parameter estimates, ﬁt statistics, a summary, and has a predict()
method
• https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/contributing-algos

Solve on CPU, GPU, or JVM
Current architecture with native CPU and GPU support and unreleased jCUDA bindings

Initial benchmarking on latest release

Initial benchmarking on latest release
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.2, with 5 runs
Mahout JVM Sparse multiplication time: 1501 ms 
Mahout jCUDA Sparse multiplication time: 49 ms
30x speedup
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02, with 5 runs
Mahout JVM Sparse multiplication time: 34 ms
8.5x speedup 
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002, with 5 runs
Mahout JVM Sparse multiplication time: 1 ms
0x speedup

• jCUDA work is still in a branch, will be in master in the next couple months
• Currently the modes of compute are JVM, CPU (using all available cores), and single
GPU
• Multi-GPU is next priority
• Currently multiplication takes place in different solvers based on matrix shape
(banding, triangularity, etc.)
• Directing location for data and compute based on shape and density is another priority
• Watch this space for other speedups
Next steps

How to Use Mahout and Get Involved

How to Use Mahout and Get Involved
Web: https://mahout.apache.org
Source code, PRs welcome: https://github.com/apache/mahout
Mailing lists: https://mahout.apache.org/community/mailing-
lists.html
Download, install, embed: https://mahout.apache.org/
downloads.html

Thank You
Q&A
h2ps://mahout.apache.org
h2ps://github.com/apache/mahout

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Similar to Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017