Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
2. About Me
• Senior Director of Data Science at
Lucidworks (Apache Solr/Lucene, Fusion
search tools)
• Formerly Chief Data Scientist, Technical
Lead of Data Science Practice at Accenture
• Committer and PMC Member, Apache
Mahout
• On Twitter @akm
• Email at akm@apache.org,
andrew.musselman@lucidworks.com
• Adversarial Learning podcast with
@joelgrus at http://adversariallearning.com
3. Apache Mahout
Recent Trends
in 0.12/0.13
• Simplify and improve performance of
distributed matrix-math programming
• Provide flexible computation options for
software and hardware
• Enable easier and quicker new algorithm
development
• Allow polyglot programming and plotting in
notebooks via Apache Zeppelin
4. Introduction to Apache Mahout
Apache Mahout is an environment for creating scalable, performant, machine-learning
applications
Apache Mahout provides:
• Mathematically expressive Scala DSL
• A collection of pre-canned math and statistics algorithms
• Interchangeable distributed engines
• Interchangeable native solvers (JVM, CPU, GPU, CUDA, or custom)
5. Feature Highlights in Recent Releases
• v 0.13.1, Soon — CUDA Solvers, Apache Spark 2.1/Scala 2.11 support
• New web site platform, May 2017 — Moved from ASF CMS system to Markdown and
Jekyll; allows documentation pull requests to be merged in and published
automatically
• v 0.13.0, Apr 2017 — GPU/CPU Solvers, algorithm framework
• v 0.12.2, Nov 2016 — Apache Zeppelin integration for notebooks and visualization
• v 0.12.0, Apr 2016 — Apache Flink backend support
• New Mahout book, Feb 2016 — ‘Apache Mahout: Beyond MapReduce’ by Dmitriy
Lyubimov and Andrew Palumbo
• v 0.10.0 - Apr 2015 - Mahout-Samsara vector-math DSL, MapReduce jobs soft-
deprecated, Spark backend support
6. Topic Overview
• Mahout-Samsara: Declarative, R-like, domain-specific language (DSL) for matrix math
• Backend-agnostic programming
• Apache Zeppelin notebooks
• Algorithm development framework (modeled after sk-learn)
• Solve on available CPU cores, single or multiple GPUs, or in the JVM
• Next steps, and how to get involved
9. Mahout-Samsara
• Mahout-Samsara is an easy-to-use domain-specific language (DSL) for large-scale
machine learning on distributed systems like Apache Spark and Flink
• Uses Scala as programming/scripting environment
• Algebraic expression optimizer for distributed linear algebra
• Provides a translation layer to distributed engines
• Support for Spark and Flink DataSets, RDDs
• System-agnostic, R-like DSL; actual formula from (d)spca:
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
10. Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
• Logical optimization
• Optimizer rewrites plan to
use logical operator for
Transpose-Times-Self
matrix multiplication
• Single pass: multiply
partitioned rows by
themselves as transposed
columns
• Computation of A’A:
val C = A.t %*% A
• Naïve execution
• 1st pass: transpose A
(requires repartitioning of
A)
• 2nd pass: multiply result
with A (expensive,
potentially requires
repartitioning again)
11. Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
12. Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
13. Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
14. Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
18. Apache Zeppelin Notebooks
• Notebooks for polyglot programming with all types of data
• Plotting with R and Python off of computed data from other tools in the same notebook
• Share variables between interpreters
• For more: https://zeppelin.apache.org
• Mahout interpreter for Zeppelin released June 2016
• Post by Trevor Grant on how to use it at https://rawkintrevo.org/2016/05/19/visualizing-
apache-mahout-in-r-via-apache-zeppelin-incubating
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/mahout-in-zeppelin/
25. Algorithm Development Framework
• Patterned after R and Python (sk-learn) APIs
• Fitter populates a Model, which contains the parameter estimates, fit statistics, a summary, and has a predict()
method
• https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/contributing-algos
27. Solve on CPU, GPU, or JVM
Current architecture with native CPU and GPU support and unreleased jCUDA bindings
28. Solve on CPU, GPU, or JVM
Initial benchmarking on latest release
29. Solve on CPU, GPU, or JVM
Initial benchmarking on latest release
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.2, with 5 runs
Mahout JVM Sparse multiplication time: 1501 ms
Mahout jCUDA Sparse multiplication time: 49 ms
30x speedup
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02, with 5 runs
Mahout JVM Sparse multiplication time: 34 ms
Mahout jCUDA Sparse multiplication time: 4 ms
8.5x speedup
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002, with 5 runs
Mahout JVM Sparse multiplication time: 1 ms
Mahout jCUDA Sparse multiplication time: 1 ms
0x speedup
30. Solve on CPU, GPU, or JVM
• jCUDA work is still in a branch, will be in master in the next couple months
• Currently the modes of compute are JVM, CPU (using all available cores), and single
GPU
• Multi-GPU is next priority
• Currently multiplication takes place in different solvers based on matrix shape
(banding, triangularity, etc.)
• Directing location for data and compute based on shape and density is another priority
• Watch this space for other speedups
Next steps
32. How to Use Mahout and Get Involved
Web: https://mahout.apache.org
Source code, PRs welcome: https://github.com/apache/mahout
Mailing lists: https://mahout.apache.org/community/mailing-
lists.html
Download, install, embed: https://mahout.apache.org/
downloads.html