Spark MLlib - Training Material

• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME

Agenda
• Introduction to Machine Learning
• Why MLlib
• Basic Statistic
• K-means
• Logistic Regression
• ALS
• Demo

Introduction to Machine
Learning

Definition
• Machine learning is a study of computer algorithms
that improve automatically through experience.

Terminology
• Observation
- The basic unit of data.
• Features
- Features is that can be used to describe each observation
in a quantitative manner.
• Feature vector
- is an n-dimensional vector of numerical features that
represents some object.
• Training/ Testing/ Evaluation set
- Set of data to discover potential predictive relationships

Learning(Training)
• Features:
1. Color: Red/ Green
2. Type: Fruit
3. Shape: nearly Circle
etc…

Learning(Training)
ID Color Type Shape
is
apple (
Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo
nearly
Cirle
N
4 Blue N/A Cirle N

Categories of Machine Learning
• Classification: predict class from observations.
• Clustering: group observations into meaningful
groups.
• Regression: predict value from observations.

Cluster the observations with no Labels
https://en.wikipedia.org/wiki/Cluster_analysis

Cut the observations
http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression

Find a model to describe the
observations
https://commons.wikimedia.org/wiki/File:Linear_regression.svg

Machine Learning
Pipelines
http://www.nltk.org/book/ch06.html

What is MLlib
• MLlib is an Apache Spark component focusing on
machine learning:
• MLlib is Spark’s core ML library
• Developed by MLbase team in AMPLab
• 80+ contributions from various organization
• Support Scala, Python, and Java APIs

Algorithms in MLlib
• Statistics: Description, correlation
• Clustering: k-means
• Collaborative filtering: ALS
• Classification: SVMs, naive Bayes, decision tree.
• Regression: linear regression, logistic regression
• Dimensionality: SVD, PCA

Why Mllib
• Scalability
• Performance
• user-friendly documentation and APIs
• Cost of maintenance

Data Type
• Dense vector
• Sparse vector
• Labeled point

Dense & Sparse
• Raw Data:
ID
A
B C D E F
1 1 0 0 0 0 3
2 0 1 0 1 0 2
3 1 1 1 0 1 1

Dense vs Sparse
• Training Set:
- number of example: 12 million
- number of features: 500
- sparsity: 10%
• Not only save storage, but also received a 4x speed
up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s

Labeled Point
• Dummy variable (1,0)
• Categorical variable (0, 1, 2, …)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

Descriptive Statistics
• Supported function:
- count
- max
- min
- mean
- variance
…
• Supported data types
- Dense
- Sparse
- Labeled Point

Example
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import numpy as np
## example data（2 x 2 matrix at least）
data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDD
distData = sc.parallelize(data)
## Compute Statistic Value
summary = Statistics.colStats(distData)
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

Quiz 1
• Which statement is wrong purpose of descriptive statistic？
1. Find the extreme value of the data.
2. Understanding the data properties preliminary.
3. Describe the full properties of the data.

Correlation
https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples

Example
• Data Set
from pyspark.mllib.stat import Statistics
## 選擇統計方法
corrType='pearson'
## 建立測試資料及
group1 = sc.parallelize([1,3,4,3,7,5,6,8])
group2 = sc.parallelize([1,2,3,4,5,6,7,8])
corr = Statistics.corr(group1,group2, corrType)
print corr
# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8

Shape is important
http://2012books.lardbucket.org/books/beginning-psychology/

Quiz 2
• Which situation below can not use pearson
correlation?
1. When two variables have no correlation
2. When two variables have correlation of indices
3. When count of data rows > 30
4. When variables are normal distribution.

K-means
• K-means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a
prototype of the cluster.

Summary
https://en.wikipedia.org/wiki/K-means_clustering

K-Means: Python
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

• Which data is not suitable for K-means analysis?
Quiz 3

Classification Model
Logistic Regression

linear regression
https://commons.wikimedia.org/wiki/File:Linear_regression.svg

When outcome is only 1/0
http://www.slideshare.net/21_venkat/logistic-regression-17406472

Hypotheses function
• hypotheses:
• sigmoid function:

• Maximum Likelihood estimation
Cost Function
if y = 0if y = 1 regularized

Sample Code
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))

Quiz 4
• What situation is not suitable for logistic regression?
1. data with multiple label point
2. data with no label
3. data with more the 1,000 features
4. data with dummy features

Recommendation Model
Alternating Least Squares

Definition
• Collaborative Filtering(CF) is a subset of algorithms
that exploit other users and items along with their
ratings(selection, purchase information could be
also used) and target user history to recommend an
item that target user does not have ratings for.
• Fundamental assumption behind this approach is
that other users preference over the items could be
used recommending an item to the user who did not
see the item or purchase before.

Definition
• CF differs itself from content-based methods in the
sense that user or the item itself does not play a
role in recommeniation but rather how(rating) and
which users(user) rated a particular item.
(Preference of users is shared through a group of
users who selected, purchased or rate items
similarly)

ALS: Python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Quiz 5
• Search for other recommendation algorithms, point
out the difference with ALS.
- Item Base
- User Base
- Content Base

Reference
• Machine Learning Library (MLlib) Guide
http://spark.apache.org/docs/1.4.1/mllib-guide.html
• MLlib: Spark's Machine Learning Library
http://www.slideshare.net/jeykottalam/mllib
• Recent Developments in Spark MLlib and Beyond
http://www.slideshare.net/Hadoop_Summit
• Introduction to Machine Learning
http://www.slideshare.net/rahuldausa/introduction-to-machine-
learning

Spark MLlib - Training Material

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Spark MLlib - Training Material

Similar to Spark MLlib - Training Material (20)

More from Bryan Yang

More from Bryan Yang (10)

Recently uploaded

Recently uploaded (20)

Spark MLlib - Training Material