Mahout Tutorial and Hands-on (version 2015)

Apache Mahout – Tutorial (2015)
Cataldo Musto, Ph.D.
Corso di Accesso Intelligente all’Informazione ed Elaborazione del Linguaggio Naturale
Università degli Studi di Bari – Dipartimento di Informatica – A.A. 2014/2015
07/01/2015 1

Outline
• What is Mahout ?
– Overview
• How to use Mahout ?
– Hands-on session
2

What is (a) Mahout?
4
an elephant driver

• Mahout is a Java library
– Implementing Machine Learning techniques
5
What is (a) Mahout?

• Clustering
• Classification
• Recommendation
• Frequent ItemSet
6
What is (a) Mahout?

• Clustering
• Classification
• Recommendation
• Frequent ItemSet (removed)
7
What is (a) Mahout?

What can we do?
• Currently Mahout supports mainly four use
cases:
– Recommendation - takes users' behavior and tries
to find items users might like.
– Clustering - takes e.g. text documents and groups
them into groups of topically related documents.
– Classification - learns from existing categorized
documents what documents of a specific category
look like and is able to assign unlabelled
documents to the (hopefully) correct category.
8

Why Mahout?
• Mahout is not the only ML framework
– Weka
– R (http://www.r-project.org/)
• Why do we prefer Mahout?
(http://www.cs.waikato.ac.nz/ml/weka/)
9

Why Mahout?
– Apache License
– Good Community
– Good Documentation
10

Why Mahout?
– Apache License
– Good Community
–Scalable
11

Why Mahout?
– Apache License
– Good Community
–Scalable
• Based on Hadoop (not mandatory!)
12

Why do we need a scalable framework?
Big Data!
13

When do we need a scalable framework?
(e.g. Recommendation Task)
Over 100m
user-preferences connections
14
http://mahout.apache.org/users/recommende
r/recommender-first-timer-faq.html

Use Cases
Recommendation Engine on Foursquare 16

Use Cases
User Interest Modeling on Twitter 17

Use Cases
Pattern Mining on Yahoo! (as anti-spam) 18

Algorithms
• Recommendation
– User-based Collaborative Filtering
– Item-based Collaborative Filtering
– Matrix Factorization-based CF
• Several factorization techniques
19

Algorithms
• Clustering
– K-Means
– Fuzzy K-Means
– Streaming K-Means
– etc.
• Topic Modeling
– LDA (Latent Dirichlet Allocation)
20

Algorithms
• Classification
– Logistic Regression
– Bayes (only on Hadoop, from release 0.9)
– Random Forests (only on Hadoop, from release
0.9)
– Hidden Markov Models
– Perceptrons
21

Algorithms
• Other
– Text processing
• Creation of sparse vectors from text
– Dimensionality Reduction techniques
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
– (and much more.)
22

Mahout in the
Apache Software Foundation
23

Mahout in the
Original Mahout Project
24

Mahout in the
Taste: collaborative filtering framework
25

Mahout in the
Lucene: information retrieval software library
26

Mahout in the
Hadoop: framework for distributed storage and programming based on MapReduce
27

Next Releases: Watch out!
Hadoop is going to be replaced by Apache Spark
28
X
http://spark.apache.org

General Architecture
Three-tiers architecture
(Application, Algorithms and Shared Libraries)
29

Business Logic
30

Data Storage and Shared Libraries
31

External Applications invoking Mahout APIs
32

In this tutorial we will focus on
Recommendation
33

Recommendation
• Mahout implements a Collaborative Filtering
framework
– Uses historical data (ratings, clicks, and purchases) to
provide recommendations
• User-based: recommend items by finding similar users. This is
often harder to scale because of the dynamic nature of users;
• Item-based: calculate similarity between items and make
recommendations. Items usually don't change much, so this often
can be computed offline;
– Popularized by Amazon and others
• Matrix factorization-based: split the original user-item matrix in
smaller matrices in order to analyze item rating patterns and learn
some latent factors explaining users’ behavior and items
characteristics.
– Popularized by Netflix Prize
34

Recommendation - Architecture
35

Recommendation Workflow
Inceptive Idea:
A Java/J2EE application
invokes a Mahout
Recommender whose
DataModel is based on a set
of User Preferences that are
built on the ground of a
physical DataStore
36

Physical Storage (database, files, etc.)
37

Data Model
38

Data Model
Recommender
39

External Application
Data Model
Recommender
40

Recommendation in Mahout
• Input: raw data (user preferences)
• Output: preferences estimation
• Step 1
– Mapping raw data into a DataModel Mahout-compliant
• Step 2
– Tuning recommender components
• Similarity measure, neighborhood, etc.
• Step 3
– Computing rating estimations
• Step 4
– Evaluating recommendation
41

Recommendation Components
• Mahout key abstractions are implemented through
Java interfaces :
– DataModel interface
• Methods for mapping raw data to a Mahout-compliant form
– UserSimilarity interface
• Methods to calculate the degree of correlation between two users
– ItemSimilarity interface
• Methods to calculate the degree of correlation between two items
– UserNeighborhood interface
• Methods to define the concept of ‘neighborhood’
– Recommender interface
• Methods to implement the recommendation step itself
42

Recommendation Components
• Mahout key abstractions are implemented
through Java interfaces :
– example: DataModel interface
• Each methods for mapping raw data to a Mahout-
compliant form is an implementation of the generic
interface
• e.g. MySQLJDBCDataModel feeds a DataModel from a
MySQL database
• (and so on)
43

Components: DataModel
• A DataModel is the interface to draw information about user
preferences.
• Which sources is it possible to draw?
– Database
• MySQLJDBCDataModel, PostgreSQLDataModel
• NoSQL databases supported: MongoDBDataModel, CassandraDataModel
– External Files
• FileDataModel
– Generic (preferences directly feed through Java code)
• GenericDataModel
(They are all implementations of the DataModel interface)
44

• GenericDataModel
– Feed through Java calls
• FileDataModel
– CSV (Comma Separated Values)
• JDBCDataModel
– JDBC Driver
– Standard database structure
45

FileDataModel – CSV input
46

• Regardless the source, they all share a
common implementation.
• Basic object: Preference
– Preference is a triple (user,item,score)
– Stored in UserPreferenceArray
47

• Basic object: Preference
– Preference is a triple (user,item,score)
– Stored in UserPreferenceArray
• Two implementations
– GenericUserPreferenceArray
• It stores numerical preference, as well.
– BooleanUserPreferenceArray
• It skips numerical preference values.
48

Components: UserSimilarity
• UserSimilarity defines a notion of similarity between two
Users.
– (respectively) ItemSimilarity defines a notion of similarity
between two Items.
• Which definition of similarity are available?
– Pearson Correlation
– Spearman Correlation
– Euclidean Distance
– Tanimoto Coefficient
– LogLikelihood Similarity
– Already implemented!
49

Different Similarity definitions
influence neighborhood formation
52

Pearson’s vs. Euclidean distance
53

54

55

Components: UserNeighborhood
• Which definition of neighborhood are available?
– Nearest N users
• The first N users with the highest similarity are labeled as ‘neighbors’
– Thresholds
• Users whose similarity is above a threshold are labeled as ‘neighbors’
– Already implemented!
56

Components: Recommender
• Given a DataModel, a definition of similarity between
users (items) and a definition of neighborhood, a
recommender produces as output an estimation of
relevance for each unseen item
• Which recommendation algorithms are implemented?
– User-based CF
– Item-based CF
– SVD-based CF
(and much more…)
57

Recap
• Many implementations of a CF-based recommender!
– Different recommendation algorithms
– Different neighborhood definitions
– Different similarity definitions
• Evaluation fo the different implementations is actually
very time-consuming
– The strength of Mahout lies in that it is possible to save
time in the evaluation of the different combinations of the
parameters!
– Standard interface for the evaluation of a Recommender
System
58

Evaluation
• Mahout provides classes for the evaluation of
a recommender system
– Prediction-based measures
• Mean Average Error
• RMSE (Root Mean Square Error)
– IR-based measures
• Precision, Recall, F1-measure, F1@n
• NDCG (ranking measure)
59

Evaluation
• Prediction-based Measures
– Class: AverageAbsoluteDifferenceEvaluator
– Method: evaluate()
– Parameters:
• Recommender implementation
• DataModel implementation
• TrainingSet size (e.g. 70%)
• % of the data to use in the evaluation (smaller % for
fast prototyping)
60

Evaluation
• IR-based Measures
– Class: GenericRecommenderIRStatsEvaluator
– Method: evaluate()
– Parameters:
• Recommender implementation
• DataModel implementation
• Relevance Threshold (mean+standard deviation)
• % of the data to use in the evaluation (smaller % for
fast prototyping)
61

Part 2
How to use Mahout?
Hands-on
62

Download Mahout
• Download
– The latest Mahout release is 0.9
– Available at:
http://archive.apache.org/dist/mahout/0.9/mahout-
distribution-0.9.zip
– Extract all the libraries and include them in a new
NetBeans (Eclipse) project
• Requirement: Java 1.6.x or greater.
• Hadoop is not mandatory!
63

Important
JavaDoc
https://builds.apache.org/job/Mahout-Quality/javadoc/
64

Exercise 1
• Create a Preference object
• Set preferences through some simple Java call
• Print some statistics about preferences
– How many preferences? On which items?
– Wheter a user has expressed preference on a
certain item.
– Which one is the item with the highest score?
65

Hints
• Hints about objects to be used:
– Preference
• Methods setUserId, setItemId, setValue;
– GenericUserPreferenceArray
• Dimension = number of preferences to be defined;
• Methods: getIds(),
sortByValueReversed(),hasPrefWithItemId(id);
66

Exercise 1: preferences
import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.PreferenceArray;
class CreatePreferenceArray {
private CreatePreferenceArray() {
}
public static void main(String[] args) {
PreferenceArray user1Prefs = new GenericUserPreferenceArray(2);
user1Prefs.setUserID(0, 1L);
user1Prefs.setItemID(0, 101L);
user1Prefs.setValue(0, 2.0f);
Preference pref = user1Prefs.get(1);
System.out.println(pref);
}
} 67

Exercise 1: preferences
import org.apache.mahout.cf.taste.model.Preference;
class CreatePreferenceArray {
private CreatePreferenceArray() {
}
PreferenceArray user1Prefs = new GenericUserPreferenceArray(2);
user1Prefs.setUserID(0, 1L);
Preference pref = user1Prefs.get(1);
System.out.println(pref);
}
}
Score 2 for Item 101
68

Exercise 2
• Create a DataModel
• Feed the DataModel through some simple
Java calls
• Print some statistics about data (how many
users, how many items, maximum ratings,
etc.)
69

Exercise 2
– FastByIdMap
• PreferenceArray stores the preferences of a single user
• Where do the preferences of all the users are stored?
– An HashMap? No.
– Mahout introduces data structures optimized for
recommendation tasks
– HashMap are replaced by FastByIDMap
– Model
• Methods: getNumItems(),
getNumUsers(),getMaxPreference()
• General statistics about the model.
70

Exercise 2: data model
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.model.GenericDataModel;
import org.apache.mahout.cf.taste.model.DataModel;
class CreateGenericDataModel {
private CreateGenericDataModel() {
}
FastByIDMap<PreferenceArray> preferences = new FastByIDMap<PreferenceArray>();
PreferenceArray prefsForUser1 = new GenericUserPreferenceArray(10);
prefsForUser1.setUserID(0, 1L);
prefsForUser1.setItemID(0, 101L);
prefsForUser1.setValue(0, 3.0f);
prefsForUser1.setItemID(1, 102L);
prefsForUser1.setValue(1, 4.5f);
preferences.put(1L, prefsForUser1);
DataModel model = new GenericDataModel(preferences);
System.out.println(model);
}
}
71

Exercise 3
• Feed the DataModel through a CSV file
• Calculate similarities between users
– CSV file should contain enough data!
72

Exercise 3
– FileDataModel
• Argument: new File with the path of the CSV
– PearsonCorrelationSimilarity,
TanimotoCoefficientSimilarity, etc.
• Argument: the model
73

Exercise 3: similarity
import org.apache.mahout.cf.taste.impl.similarity.*;
import org.apache.mahout.cf.taste.impl.model.*;
import org.apache.mahout.cf.taste.impl.model.file.FileDatModel;
class Example3_Similarity {
public static void main(String[] args) throws Exception {
// Istanzia il DataModel e crea alcune statistiche
DataModel model = new FileDataModel(new File("intro.csv"));
UserSimilarity pearson = new PearsonCorrelationSimilarity(model);
UserSimilarity euclidean = new EuclideanDistanceSimilarity(model);
System.out.println("Pearson:"+pearson.userSimilarity(1, 2));
System.out.println("Euclidean:"+euclidean.userSimilarity(1, 2));
}
}
74

}
} FileDataModel
75

}
} Similarity Definitions
76

}
} Output
77

Exercise 4
• Generate neighboorhood
• Generate recommendations
78

Exercise 4
– Compare different combinations of parameters!
79

Exercise 4
– Compare different combinations of parameters!
80

Exercise 4
– NearestNUserNeighborhood
– GenericUserBasedRecommender
• Parameters:
– data model  already shown
– Neighborhood
» Class: NearestNUserNeighborhood(n,similarity,model)
» Class: ThresholdUserNeighborhood(thr,similarity,model)
– similarity measure  already shown
81

Exercise 4: First Recommender
import org.apache.mahout.cf.taste.impl.model.file.*;
import org.apache.mahout.cf.taste.impl.neighborhood.*;
import org.apache.mahout.cf.taste.impl.recommender.*;
import org.apache.mahout.cf.taste.model.*;
import org.apache.mahout.cf.taste.neighborhood.*;
import org.apache.mahout.cf.taste.recommender.*;
import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro {
private RecommenderIntro() {
}
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender(
model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
}
} 82

}
}
}
}
FileDataModel
83

}
}
}
}
2 neighbours
84

}
}
}
}
Top-1 Recommendation
for User 1 85

• Download the GroupLens dataset (100k)
– Its format is already Mahout compliant
– http://files.grouplens.org/datasets/movielens/ml-
100k.zip
• Preparatory Exercise: repeat exercise 3 (similarity
calculations) with a bigger dataset
• Next: now we can run the recommendation
framework against a state-of-the-art dataset
Exercise 5: MovieLens Recommender
86

}
DataModel model = new FileDataModel(new File("ua.base"));
}
}
}
87

}
}
}
}
We can play with
parameters! 88

• Analyze Recommender behavior with different
combinations of parameters
– Do the recommendations change with a different
similarity measure?
– Do the recommendations change with different
neighborhood sizes?
– Which one is the best one?
• …. Let’s go to the next exercise!
89

• Evaluate different CF recommender
configurations on MovieLens data
• Metrics: RMSE, MAE, Precision
Exercise 6: Recommender Evaluation
90

• Evaluate different CF recommender
configurations on MovieLens data
• Metrics: RMSE, MAE
• Hints: useful classes
– Implementations of RecommenderEvaluator
interface
• AverageAbsoluteDifferenceRecommenderEvaluator
• RMSRecommenderEvaluator
91

• Further Hints:
– Use RandomUtils.useTestSeed()to ensure
the consistency among different evaluation runs
– Invoke the evaluate() method
• Parameters
– RecommenderBuilder: recommender instance (as in previous
exercises.
– DataModelBuilder: specific criterion for training
– Split Training-Test: double value (e.g. 0.7 for 70%)
– Amount of data to use in the evaluation: double value (e.g 1.0
for 100%)
92

Example 6: evaluation
class EvaluatorIntro {
private EvaluatorIntro() {
}
RandomUtils.useTestSeed();
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
// Build the same recommender for testing that we did last time:
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
@Override
public Recommender buildRecommender(DataModel model) throws TasteException {
return new GenericUserBasedRecommender(model, neighborhood, similarity);
}
};
double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);
System.out.println(score);
}
}
Ensures the consistency between
different evaluation runs.
93

Exercise 6: evaluation
}
@Override
}
};
}
}
94

}
@Override
}
};
}
}
70%training
(whole dataset evaluation)
95

}
@Override
}
};
}
}
Recommendation Engine
96

Example 6: evaluation
}
RecommenderEvaluator rmse = new RMSEEvaluator();
@Override
}
};
}
}
We can add more measures
97

}
@Override
}
};
double rmse = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);
System.out.println(rmse);
}
} 98

Exercise 7: item-based recommender
• Mahout provides Java classes for building an
item-based recommender system
– Amazon-like
– Recommendations are based on similarities
among items (generally pre-computed offline)
– Evaluate it with the MovieLens dataset!
99

class IREvaluatorIntro {
private IREvaluatorIntro() {
}
@Override
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
return new GenericItemBasedRecommender(model, similarity);
}
};
System.out.println(rmse); }
}
Exercise 7: item-based recommender
100

}
@Override
}
};
}
ItemSimilarity
Example 7: item-based recommender
101

}
@Override
}
};
}
No Neighborhood definition for item-
based recommenders
Example 7: item-based recommender
102

Exercise 8: MF-based recommender
• Mahout provides Java classes for building an a CF
recommender system based on state-of-the-art
matrix factorization techniques
– Class: SVDRecommender
• Parameters: DataModel, Factorizer
• Factorizer: a factorization algorithm
– Alternating Least Squares (ALSWRFactorizer)
– SVD++ (SVDPlusPlusFactorizer)
– Stochastic Gradient Descent (ParallelSGDFactorizer)… etc
– Several parameters to tune!
– Evaluate it with the MovieLens dataset!
103

}
@Override
ALSWRFactorizer = new ALSWRFactorizer(model, 10, 0.065, 60);
return new SVDRecommender(model, factorizer);
}
};
}
104

}
@Override
ALSWRFactorizer = new ALSWRFactorizer(model, 10, 0.065, 60);
return new SVDRecommender(model, factorizer);
}
};
}
105
Hyperparameters: latent factors,
lambda, iterations

Mahout Strengths
• Fast-prototyping and evaluation
– To evaluate a different configuration of the same
algorithm we just need to update a parameter and
run again.
– Example
• Different Neighborhood Size
• Different similarity measures, etc.
106

5 minutes to look for the
best configuration 
107

• Evaluation of CF algorithms through IR
measures
• Metrics: Precision, Recall
108

• Evaluation of CF algorithms through IR
measures
• Metrics: Precision, Recall
• Hints: useful classes
– GenericRecommenderIRStatsEvaluator
– Evaluate() method
• Same parameters of exercise 6 and 7
109

}
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
@Override
}
};
IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1,0);
System.out.println(stats.getPrecision());
System.out.println(stats.getRecall());
System.out.println(stats.getF1());
}
}
Exercise 9: IR-based evaluation
Precision@5 , Recall@5, etc.
110

}
@Override
}
};
}
}
Precision@5 , Recall@5, etc.
111

}
@Override
}
};
}
}
Set Neighborhood to 500
112

}
@Override
UserSimilarity similarity = new EuclideanDistanceSimilarity(model);
}
};
}
}
Set Euclidean Distance
113

• Write a class that automatically runs
evaluation with different parameters
– e.g. fixed neighborhood sizes from an Array of
values
– Print the best scores and the configuration
114

• Find the best configuration for several
datasets
– Download datasets from
http://mahout.apache.org/users/basics/collections.html
–Write classes to transform input data in a
Mahout-compliant form
–Extend exercise 10!
Exercise 11: more datasets!
115

Do you want more?
• Recommendation
– Deploy of a Mahout-based Web Recommender
– Integration with Hadoop/Spark
– Integration of content-based information
– Custom similarities, Custom recommenders, Re-
scoring functions
• Content-based Recommender Systems
through Classification Algorithms!
117

Mahout Tutorial and Hands-on (version 2015)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Mahout Tutorial and Hands-on (version 2015)

Similar to Mahout Tutorial and Hands-on (version 2015) (20)

More from Cataldo Musto

More from Cataldo Musto (20)

Recently uploaded

Recently uploaded (20)

Mahout Tutorial and Hands-on (version 2015)