SlideShare a Scribd company logo
1 of 64
Spark MLlib
• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
Agenda
• Introduction to Machine Learning
• Why MLlib
• Basic Statistic
• K-means
• Logistic Regression
• ALS
• Demo
Introduction to Machine
Learning
Definition
• Machine learning is a study of computer algorithms
that improve automatically through experience.
Terminology
• Observation
- The basic unit of data.
• Features
- Features is that can be used to describe each observation
in a quantitative manner.
• Feature vector
- is an n-dimensional vector of numerical features that
represents some object.
• Training/ Testing/ Evaluation set
- Set of data to discover potential predictive relationships
Learning(Training)
• Features:
1. Color: Red/ Green
2. Type: Fruit
3. Shape: nearly Circle
etc…
Learning(Training)
ID Color Type Shape
is
apple (
Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo
nearly
Cirle
N
4 Blue N/A Cirle N
Categories of Machine Learning
• Classification: predict class from observations.
• Clustering: group observations into meaningful
groups.
• Regression: predict value from observations.
Cluster the observations with no Labels
https://en.wikipedia.org/wiki/Cluster_analysis
Cut the observations
http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression
Find a model to describe the
observations
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
Machine Learning
Pipelines
http://www.nltk.org/book/ch06.html
What is MLlib
What is MLlib
• MLlib is an Apache Spark component focusing on
machine learning:
• MLlib is Spark’s core ML library
• Developed by MLbase team in AMPLab
• 80+ contributions from various organization
• Support Scala, Python, and Java APIs
Algorithms in MLlib
• Statistics: Description, correlation
• Clustering: k-means
• Collaborative filtering: ALS
• Classification: SVMs, naive Bayes, decision tree.
• Regression: linear regression, logistic regression
• Dimensionality: SVD, PCA
Why Mllib
• Scalability
• Performance
• user-friendly documentation and APIs
• Cost of maintenance
Performance
Data Type
• Dense vector
• Sparse vector
• Labeled point
Dense & Sparse
• Raw Data:
ID
A
B C D E F
1 1 0 0 0 0 3
2 0 1 0 1 0 2
3 1 1 1 0 1 1
Dense vs Sparse
• Training Set:
- number of example: 12 million
- number of features: 500
- sparsity: 10%
• Not only save storage, but also received a 4x speed
up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s
Labeled Point
• Dummy variable (1,0)
• Categorical variable (0, 1, 2, …)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
DEMO
Descriptive Statistics
• Supported function:
- count
- max
- min
- mean
- variance
…
• Supported data types
- Dense
- Sparse
- Labeled Point
Example
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import numpy as np
## example data(2 x 2 matrix at least)
data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDD
distData = sc.parallelize(data)
## Compute Statistic Value
summary = Statistics.colStats(distData)
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
DEMO
Quiz 1
• Which statement is wrong purpose of descriptive statistic?
1. Find the extreme value of the data.
2. Understanding the data properties preliminary.
3. Describe the full properties of the data.
Correlation
https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples
Example
• Data Set
from pyspark.mllib.stat import Statistics
## 選擇統計方法
corrType='pearson'
## 建立測試資料及
group1 = sc.parallelize([1,3,4,3,7,5,6,8])
group2 = sc.parallelize([1,2,3,4,5,6,7,8])
corr = Statistics.corr(group1,group2, corrType)
print corr
# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8
Shape is important
http://2012books.lardbucket.org/books/beginning-psychology/
DEMO
Quiz 2
• Which situation below can not use pearson
correlation?
1. When two variables have no correlation
2. When two variables have correlation of indices
3. When count of data rows > 30
4. When variables are normal distribution.
Clustering Model
K-Means
K-means
• K-means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a
prototype of the cluster.
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Summary
https://en.wikipedia.org/wiki/K-means_clustering
K-Means: Python
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
DEMO
• Which data is not suitable for K-means analysis?
Quiz 3
Classification Model
Logistic Regression
linear regression
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
When outcome is only 1/0
http://www.slideshare.net/21_venkat/logistic-regression-17406472
Hypotheses function
• hypotheses:
• sigmoid function:
• Maximum Likelihood estimation
Cost Function
if y = 0if y = 1 regularized
Sample Code
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))
Quiz 4
• What situation is not suitable for logistic regression?
1. data with multiple label point
2. data with no label
3. data with more the 1,000 features
4. data with dummy features
DEMO
Recommendation Model
Alternating Least Squares
Definition
• Collaborative Filtering(CF) is a subset of algorithms
that exploit other users and items along with their
ratings(selection, purchase information could be
also used) and target user history to recommend an
item that target user does not have ratings for.
• Fundamental assumption behind this approach is
that other users preference over the items could be
used recommending an item to the user who did not
see the item or purchase before.
Definition
• CF differs itself from content-based methods in the
sense that user or the item itself does not play a
role in recommeniation but rather how(rating) and
which users(user) rated a particular item.
(Preference of users is shared through a group of
users who selected, purchased or rate items
similarly)
Recommendation
Recommendation
Recommendation
Recommendation
ALS: Python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
DEMO
Quiz 5
• Search for other recommendation algorithms, point
out the difference with ALS.
- Item Base
- User Base
- Content Base
Reference
• Machine Learning Library (MLlib) Guide
http://spark.apache.org/docs/1.4.1/mllib-guide.html
• MLlib: Spark's Machine Learning Library
http://www.slideshare.net/jeykottalam/mllib
• Recent Developments in Spark MLlib and Beyond
http://www.slideshare.net/Hadoop_Summit
• Introduction to Machine Learning
http://www.slideshare.net/rahuldausa/introduction-to-machine-
learning

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Base de données graphe et Neo4j
Base de données graphe et Neo4jBase de données graphe et Neo4j
Base de données graphe et Neo4jBoris Guarisma
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Elastic Search
Elastic SearchElastic Search
Elastic SearchNavule Rao
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for GraphsDeepLearningBlr
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 

What's hot (20)

Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Base de données graphe et Neo4j
Base de données graphe et Neo4jBase de données graphe et Neo4j
Base de données graphe et Neo4j
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for Graphs
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 

Viewers also liked

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regressionnszakir
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression ModelsStephen Senn
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBryan Yang
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctureszukun
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Bryan Yang
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily LifeBryan Yang
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介codefortomorrow
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 

Viewers also liked (17)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
02 newton-raphson
02 newton-raphson02 newton-raphson
02 newton-raphson
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression Models
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by step
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctures
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

Similar to Spark MLlib - Training Material

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 

Similar to Spark MLlib - Training Material (20)

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Ember
EmberEmber
Ember
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 

More from Bryan Yang

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法Bryan Yang
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essentialBryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥Bryan Yang
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101Bryan Yang
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven businessBryan Yang
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例Bryan Yang
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETLBryan Yang
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法Bryan Yang
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to dockerBryan Yang
 

More from Bryan Yang (10)

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essential
 
Docker 101
Docker 101Docker 101
Docker 101
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven business
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETL
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 

Recently uploaded

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Spark MLlib - Training Material

  • 2. • Experience Vpon Data Engineer TWM, Keywear, Nielsen • Bryan’s notes for data analysis http://bryannotes.blogspot.tw • Spark.TW • Linikedin https://tw.linkedin.com/pub/bryan-yang/7b/763/a79 ABOUT ME
  • 3. Agenda • Introduction to Machine Learning • Why MLlib • Basic Statistic • K-means • Logistic Regression • ALS • Demo
  • 5. Definition • Machine learning is a study of computer algorithms that improve automatically through experience.
  • 6. Terminology • Observation - The basic unit of data. • Features - Features is that can be used to describe each observation in a quantitative manner. • Feature vector - is an n-dimensional vector of numerical features that represents some object. • Training/ Testing/ Evaluation set - Set of data to discover potential predictive relationships
  • 7. Learning(Training) • Features: 1. Color: Red/ Green 2. Type: Fruit 3. Shape: nearly Circle etc…
  • 8. Learning(Training) ID Color Type Shape is apple ( Label) 1 Red Fruit Cirle Y 2 Red Fruit Cirle Y 3 Black Logo nearly Cirle N 4 Blue N/A Cirle N
  • 9. Categories of Machine Learning • Classification: predict class from observations. • Clustering: group observations into meaningful groups. • Regression: predict value from observations.
  • 10. Cluster the observations with no Labels https://en.wikipedia.org/wiki/Cluster_analysis
  • 12. Find a model to describe the observations https://commons.wikimedia.org/wiki/File:Linear_regression.svg
  • 15. What is MLlib • MLlib is an Apache Spark component focusing on machine learning: • MLlib is Spark’s core ML library • Developed by MLbase team in AMPLab • 80+ contributions from various organization • Support Scala, Python, and Java APIs
  • 16. Algorithms in MLlib • Statistics: Description, correlation • Clustering: k-means • Collaborative filtering: ALS • Classification: SVMs, naive Bayes, decision tree. • Regression: linear regression, logistic regression • Dimensionality: SVD, PCA
  • 17. Why Mllib • Scalability • Performance • user-friendly documentation and APIs • Cost of maintenance
  • 19. Data Type • Dense vector • Sparse vector • Labeled point
  • 20. Dense & Sparse • Raw Data: ID A B C D E F 1 1 0 0 0 0 3 2 0 1 0 1 0 2 3 1 1 1 0 1 1
  • 21. Dense vs Sparse • Training Set: - number of example: 12 million - number of features: 500 - sparsity: 10% • Not only save storage, but also received a 4x speed up Dense Sparse Storge 47GB 7GB Time 240s 58s
  • 22. Labeled Point • Dummy variable (1,0) • Categorical variable (0, 1, 2, …) from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint # Create a labeled point with a positive label and a dense feature vector. pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
  • 23. DEMO
  • 24. Descriptive Statistics • Supported function: - count - max - min - mean - variance … • Supported data types - Dense - Sparse - Labeled Point
  • 25. Example from pyspark.mllib.stat import Statistics from pyspark.mllib.linalg import Vectors import numpy as np ## example data(2 x 2 matrix at least) data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]]) ## to RDD distData = sc.parallelize(data) ## Compute Statistic Value summary = Statistics.colStats(distData) print "Duration Statistics:" print " Mean: {}".format(round(summary.mean()[0],3)) print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3)) print " Max value: {}".format(round(summary.max()[0],3)) print " Min value: {}".format(round(summary.min()[0],3)) print " Total value count: {}".format(summary.count()) print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
  • 26. DEMO
  • 27. Quiz 1 • Which statement is wrong purpose of descriptive statistic? 1. Find the extreme value of the data. 2. Understanding the data properties preliminary. 3. Describe the full properties of the data.
  • 29. Example • Data Set from pyspark.mllib.stat import Statistics ## 選擇統計方法 corrType='pearson' ## 建立測試資料及 group1 = sc.parallelize([1,3,4,3,7,5,6,8]) group2 = sc.parallelize([1,2,3,4,5,6,7,8]) corr = Statistics.corr(group1,group2, corrType) print corr # 0.87 A B C D E F G H Case1 1 3 4 3 7 5 6 8 Case2 1 2 3 4 5 6 7 8
  • 31. DEMO
  • 32. Quiz 2 • Which situation below can not use pearson correlation? 1. When two variables have no correlation 2. When two variables have correlation of indices 3. When count of data rows > 30 4. When variables are normal distribution.
  • 34. K-means • K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 42. K-Means: Python from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data data = sc.textFile("data/mllib/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • 43. DEMO
  • 44. • Which data is not suitable for K-means analysis? Quiz 3
  • 47. When outcome is only 1/0 http://www.slideshare.net/21_venkat/logistic-regression-17406472
  • 49. • Maximum Likelihood estimation Cost Function if y = 0if y = 1 regularized
  • 50. Sample Code from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textFile("data/mllib/sample_svm_data.txt") parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) # Evaluating the model on training data labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr))
  • 51. Quiz 4 • What situation is not suitable for logistic regression? 1. data with multiple label point 2. data with no label 3. data with more the 1,000 features 4. data with dummy features
  • 52. DEMO
  • 54. Definition • Collaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. • Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.
  • 55. Definition • CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)
  • 60.
  • 61. ALS: Python from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 20 model = ALS.train(ratings, rank, numIterations) # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean() print("Mean Squared Error = " + str(MSE))
  • 62. DEMO
  • 63. Quiz 5 • Search for other recommendation algorithms, point out the difference with ALS. - Item Base - User Base - Content Base
  • 64. Reference • Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/1.4.1/mllib-guide.html • MLlib: Spark's Machine Learning Library http://www.slideshare.net/jeykottalam/mllib • Recent Developments in Spark MLlib and Beyond http://www.slideshare.net/Hadoop_Summit • Introduction to Machine Learning http://www.slideshare.net/rahuldausa/introduction-to-machine- learning