This is part of an introductory course on Big Data tools for Artificial Intelligence. These final slides gather some of the most important AI layers in Big Data.
Artificial Intelligence Layer: Mahout, MLLib, and other projects
1. Artificial Intelligence
Layer
Mahout, MLLib, & other projects
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
2. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Core technologies like DFS and MR (i.e.,
Hadoop)
➢ ETL for transforming data (i.e., Pig)
➢ Alternative core/ETL technology (i.e., Spark)
➢ Now we can build AI tools from scratch
So far...
3. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
Can I save some work with
existing code?
4. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Write some UDF wrappers for Weka in
Pig/Spark
➢ Use connectors to R and Python
➢ Parallelize execution of multiple non-
distributed algorithms
Actually, we can...
5. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Still problematic if algorithm instances are
very big
➢ They are not really parallel algorithms
➢ Use parallel algorithms to tackle big problems:
○ Apache Mahout
○ Apache Spark
But we can do better!
6. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Collection of
parallel AI & ML
algorithms
➢ Map Reduce
algorithms → Spark
➢ Latest major
release: Mahout 0.9
(February 2014)
http://mahout.apache.org/
Apache Mahout
8. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Dimensionality reduction:
○ Singular Value Decomposition (parallel)
○ PCA (parallel)
○ Lanczos decomposition (parallel)
○ QR decomposition (parallel)
➢ Text algorithms:
○ TF-IDF (parallel)
Apache Mahout: Algorithms
9. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Just type mahout in the shell
➢ A list of available algorithms will pop out
➢ Typing mahout algorithm_name will print the
help for the specific algorithm
➢ Executing distributed algorithms requires of
Hadoop and DFS
Mahout from shell
10. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ mahout recommenditembased :
○ --input: file with user_id item_id rows to represent
purchases
○ --output: where mahout should store results
○ --usersFile: who we should recommend
○ --itemsFile: what items we can recommend
○ -b: true (in our case, binary data)
○ --similarityClassname: SIMILARITY_LOGLIKELIHOOD
or SIMILARITY_TANIMOTO_COEFFICIENT (in our
case, binary data)
Mahout example: Item-based
Collaborative filtering
11. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Execute:
mahout recommenditembased --input
data/purchases_mahout.tsv --output mahout_cf --
usersFile data/users_mahout.tsv --itemsFile
data/valid_products_mahout.tsv --booleanData --
similarityClassname SIMILARITY_LOGLIKELIHOOD
Mahout example: Item-based
Collaborative filtering
12. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Machine learning
library inside Spark
➢ Completely
distributed
➢ It is bundled with
Spark!
MLLib
13. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Classification & Regression:
○ Support Vector Machines
○ Logistic Regression
○ Linear Regression
○ Random Forests
➢ Clustering:
○ K-means
MLLib: Algorithms
14. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Dimensionality reduction:
○ Singular Value Decomposition
○ PCA
➢ Clustering:
○ K-means
➢ Collaborative filtering
○ ALS item-based recommender
MLLib: Algorithms
15. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Let us apply K-means on the iris data set
MLLib: K-Means example
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
val numClusters = 3
val numIteration = 20
val data_iris = sc.textFile( “hdfs:///user/sanguix/data/iris.csv”).map(
l=> l.split(“,”,-1) )
val parsedData = data_iris.map( r => Vectors.dense( Array( r(0).toDouble,
r(1).toDouble, r(2).toDouble, r(3).toDouble ) ) ).cache()
val clusters = KMeans.train( parsedData, numClusters, numIteration )
16. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Spark built-in library for graphs
➢ Algorithms:
○ PageRank
○ (Strong) Connected components
○ Label propagation
○ Other basic graph operations
Other projects: Graphx
17. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Graph framework over Hadoop
➢ Specialized for building
algorithms for graphs
➢ Latest major release:
Giraph 1.1.0 (Nov. 2014)
http://giraph.apache.org/
Other projects: Giraph
18. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Distributed framework for machine learning
➢ Originally created at Carnegie Mellon
➢ Algorithms:
○ Collaborative filtering
○ Text analysis
○ Page Rank
○ Deep learning
➢ Latest release: GraphLab 2.2 (July 2013)
https://github.com/graphlab-code/graphlab
Other projects: GraphLab
19. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ ML library on top of Hadoop/Spark
➢ Algorithms:
○ Random Forests
○ Generalized Linear Model
○ Deep learning
○ K-Means
➢ Latest release: H2O 2.8.4.4
(February 2015)
https://github.com/h2oai/h2o-dev
Other projects: H2O
20. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Large scale data processing engine in
Java/Scala
➢ In memory collections
➢ Latest release: Flink 0.8.0
(January 2015)
http://flink.apache.org/
Other projects: Apache Flink
21. Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and
Digital Image
➢ Mahout in Action. Sean Owen. Eds. Manning
Publications (2011)
➢ Apache Mahout Cookbook. Piero Giacomelli.
Ed. Packt Publishing (2013)
➢ StackOverflow
Extra information
22. Artificial Intelligence
Layer
Mahout, MLLib, & other projects
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015