SlideShare a Scribd company logo
1 of 52
Download to read offline
Introduction to Machine
Learning in Spark
Machine learning at Scale
https://github.com/phatak-dev/introduction_to_ml_with_spark
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Introduction to Machine learning
● Machine learning in Big data
● Understanding mathematics
● Vector manipulation in Scala/Spark
● Implementing ML in Spark RDD
● Using MLLib
● ML beyond MLLib
3 D’s of big data processing
● Data Scientist
Models simple world view from chaotic complex real
world data
● Data Engineer
Implements simple model on complex toolset and data
● Data Artist
Explains complex results in simple visualizations
Introduction to Machine Learning
What is machine learning?
A computer program is said to learn from
experience E with respect to some task T and
some performance measure P, if it’s
performance on T, as measured P, improves
with experience E - Tom Mitchell (1998)
How human learn?
● We repeat the same task (T) over and over again to
gain experience (E)
● This doing over and over again same task is known as
practice
● With practice and experience, we get better at that task
● Once we achieve some level,we have learnt
● Everything in life is learnt
Learning example - Music instrument
● We practice each day the instrument with same song
again and again
● We will be pretty bad to start with, but with practice we
improve
● Our teacher will measure performance to understand
did we done progress
● Once your teacher approves, you are a player
Why it’s hard to learn?
● Practicing over and over same task is often boring and
frustrating
● Without proper measurement, we may be stuck at same
level without making much progress
● Making progress initially is pretty easy but it’s starts to
get harder and harder
● Sometime natural talent also dicates cap on how best
we can be
How machine learn?
● You start with an assumption about solution for given
problem
● Make machine go over data and verify the assumption
● Run the program over and over again on same data
updating assumption on the way
● Measure improvement after each round, adjust
accordingly
● Once you have no more assumption changing, stop
running
Brute force machine learning
Problem : Find relation between x , y and z values , given
data set
x y z
1 2 5
2 3 6
2 4 7
What human approach to this?
ML approach
● Let’s assume that x,y,z are connected as below
z = c0 + c1x + c2y where c0, c1, c2 factors which we
need to learn
● Let’s start with following values
c0 = 0 c1 = 1 c2 = 1
● Find out the error for each row
● Adjust the values for c0,c1,c2 from error
● Repeat
ML history
● Sub branch of AI
● Impossible to model brain because it's not an ideal
world
● Most of the time we don’t need all complexity of real
world ex : ideal gas equation from thermodynamics
● Approximation of real world is more than enough to
solve real problems
ML and Data
● The quality of a model of ML is determined by
○ learning algorithm
○ Amount of data it has seen
● More the data, a simple algorithm can do better than
sophisticated algorithm with less data
● So for long ML was focused coming up with better
algorithms as amount of data was limited
● Then we landed on big data
Machine learning in Big data
ML and Big data
● Ability to learn on large corpus of data is a real boon for
ML
● Even simplistic ML models shine when they are trained
on huge amount of data
● With big data toolsets, a wide variety of ML application
have started to emerge other than academic specifics
one
● Big data democratising ML for general public
ML vs Big data
Machine Learning Big data
Optimized for iterative computations Optimized for single pass computations
Maintains state between stages Stateless
CPU intensive Data intensive
Vector/Matrix based or multiple rows/cols at
a time
Single column/ row at time
Machine learning in Hadoop
● In Hadoop M/R each iteration of ML translates into
single M/R job.
● Each of these jobs need to store data in HDFS which
creates a lot of overhead
● Keeping state across jobs is not directly available in
M/R
● Constant fight between quality of result vs performance
Machine learning in Spark
● Spark is first general purpose big data processing
engine build for ML from day one
● The initial design in Spark was driven by ML
optimization
○ Caching - For running on data multiple times
○ Accumulator - To keep state across multiple
iterations in memory
○ Good support for cpu intensive tasks with laziness
● One of the examples in Spark first version was of ML
First ML Example in Spark
Understanding Mathematics behind
ML
Major types of ML
● Supervised learning
Training data contains both input vector and desired
output. We also called it as labeled data
Ex : Linear Regression, Logistic Regression
● Unsupervised learning
Training data sets without labels.
Ex : K-means clustering
Process of Supervised Learning
Training Set
Learning Algorithm
Hypothesis/model
New data Result
(Prediction/Classification)
ML Terminology
● Training Set
Set of data used to teach machine. In supervised learning
both input vector and output will be available.
● Learning algorithm
Algorithm which consumes the training set to infer relation
between input vectors which optimises for known output
labels
● Model
Learnt function of input parameters
Linear regression
● A learning algorithm to learn relation between
dependent variable y with one or more explanatory
variable x which are connected by linear relation.
● Given multiple dimensional data, the hypothesis for
linear regression looks as below
h(x) = c0 + c1x1 + c2x2 + ….
● x1, x2 are the explanatory variables and y is the
dependent variable. h(x) is the hypothesis.
● Linear regression goal is to learn c0,c1,c2
Linear Regression Example
● Price of the house is dependent on size of the house
● What will be price of the house if size of the house is
1200?
Size of the house(sft) Price(in Rs)
1000 5 lakh
2000 15 lakh
800 4 lakh
Linear regression for housing data
● Price of the house is dependent on size of the house
● So in linear regression
y -> price of the house
x -> size of the house
● Then model we have to learn is
Price_of_the_house = c0+ c1* size_of_the_house
● We assume relationship is linear
Curve fitting
Choosing values
● Our model is
h(x) = c0 + c1x1 + c2x2 + ….
● How to choose c0, c1, c2?
● We want to choose c0,c1,c2 such way which gives us y
values which are close to the one in training set
● But how we know they are close to y?
● Cost function helps to find out they are close or not.
Cost function
● Cost function for linear regression is
J(c) = (h(x) - y)^2 / 2m - know as squared error
● J(c) is a function of c0,c1,c2 because cost changes
depending upon the different c1, c2
● m is number of rows in training data
● Goal of learning algorithm is to learn a model which
minimizes this error
● How to minimize J(c)?
Curve of cost function
Minimising cost function
● Start with some random value of c0, c1
● Calculate the cost
● Keep on changing c0, c1 till you find the minimum cost
● Gradient descent in one of the algorithm to find
minimum any mathematical function
● Also known as convex optimizer
● The name gradient comes from, use of gradient of the
function to decide which way to walk
Gradient descent
● The way to update c values in gradient descent
c(j) = c(j) - alpha * derivative( J(c))
● alpha is known as learning rate
● For linear regression
○ cost function - J(c) = (h(x) - y)^2 / 2m
○ Derivative is =
■ c0 - (h(x) - y) / m
■ c1,c2 .. - ((h(x) - y) * x ) / m
Understanding Gradient Descent
Linear regression algorithm
● Start with a training set with x1,x2, x3 .. and y
● Start with parameters c0,c1,c3 with random values
● Start with a learning rate alpha
● Then repeat the following update
c0 = c0 - alpha * h(x) - y
c1 = c1 - alpha * (h(x) - y) * x
● Repeat this process till it converges
Vectors in ML
● All information in machine learning is represented using
vectors of numbers which represent features
● Collection of vectors are represented as matrices
● All the calculation in machine learning are expressed
using vector manipulation
● So Understanding vector manipulation is very important
● We will understand how Scala and Spark represents the
vector
Vector manipulation in Scala/Spark
Breeze - Vector library for Scala
● Breeze is library for numerical processing for scala
● Option to use fortran based native numeric library using
netlib-java
● Spark uses breeze internally for vector manipulation in
MLLib library
● Many data structures of MLLib are modeled around
breeze data structures
● https://github.com/scalanlp/breeze
Representing data in breeze vector
● Two types of vectors
○ Dense vectors
○ Sparse vectors
● All vectors in breeze are column vectors. Take
transpose for row vector
● Normally row vectors are used to represent the data
and column vectors for representing weights or
coefficients of learning algorithm
● VectorExample.scala
Vector/Matrix manipulation
● Multiplying vector from a constant
○ Multiply using netlib
○ Multiply using constant vector elementwise
MultiplyVectorByConstant.scala
● Multiplying matrix from a constant
○ value
○ vector
MultiplyMatrixByConstant.scala
Dot product
● Dot product of two vectors A = [ a1, a2 , a3 ..] and B=
[b1,b2,b3 .. ] is
A.B = sum ( a1*b1 + a2*b2 + a3*b3)
● Two ways to implement in breeze
○ A.dot(B)
○ Transpose(A) * B)
● DotProductExample.scala
In-place computations
● Creating a breeze vector/matrix is a costly operations
● So if each transformation in our computation creates
new vector, it may hurt our performance
● We can use in place operations, which can update
existing vectors rather than creating new vector
● When we do many vector manipulations, in-place is
preferred for better performance
● Ex : InPlaceExample.scala
Representing data as RDD[Vector]
● In Spark ML, each row is represented using vectors
● Representing row in vector allows us to easily
manipulate them using earlier discussed vector
manipulations
● We broadcast vector for efficiency
● We can manipulate partition at a time using represent
them as the matrices
● Manipulating as partition can give good performance
● RDDVectorExample.scala
Implementing Linear Regression in
Spark
Implementing LR in Spark
● Represent that data as DataPoint which has
○ x - feature vector
○ y - value to be predicted
● Use accumulator to keep track of the cost
● Use reduce to aggregate gradient across multiple rows
● Uses gradient descent to work on this
● Ex : LinearRegressionExample.scala
Stochastic gradient descent
● Use mini batch to do rather complete dataset
● Using sampling we can achieve this
● Mini batches help to speed up the gradient descent in
multiple level
● By default batch size is 1.0
● Ex : LRWithSGD.scala
Using MLLib
MLLib
● Machine learning library shipped with standard
distribution of Spark
● Supports popular machine learning algorithms like
Linear regression, Logistic regression, decision trees
etc out of the box
● Every release new algorithms are added
● Supports multiple optimization techniques SGD,
LBFGS
Linear Regression in MLLib
● org.apache.spark.mllib.linalg.DenseVector wraps
breeze vector for MLLib library
● Use LabeledPoint to represent the data of a given row
● Built in support for Linear regression with SGD
● Ex :LinearRegressionInMLLib.scala
LR for housing data
● We are going to predict housing price using house size
● Size and price are in different scale
● Need to scale both of them to same scale
● We use StandardScaler to scale RDD[Vector]
● Scaled Data will be used for LinearRegression
● Ex : LRForHousingData.scala
Machine learning beyond MLLib
● MLLib Pipelines API
● MLLib feature framework
● Sparkling water
http://h2o.ai/product/sparkling-water/
● SparkR
● http://prediction.io/
References
● https://www.coursera.org/learn/machine-learning
● https://github.com/zinniasystems/spark-ml-class
● https://github.com/scalanlp/breeze

More Related Content

What's hot

Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLOSInet
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaData Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsDr Sulaimon Afolabi
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 

What's hot (20)

What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Word embedding
Word embedding Word embedding
Word embedding
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaData Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Boosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning ProblemsBoosting Approach to Solving Machine Learning Problems
Boosting Approach to Solving Machine Learning Problems
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 

Similar to Introduction to Machine Learning with Spark

Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersZoran Sevarac, PhD
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fittingWush Wu
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemMarsan Ma
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 

Similar to Introduction to Machine Learning with Spark (20)

Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java Developers
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
ML in Android
ML in AndroidML in Android
ML in Android
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking system
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

Introduction to Machine Learning with Spark

  • 1. Introduction to Machine Learning in Spark Machine learning at Scale https://github.com/phatak-dev/introduction_to_ml_with_spark
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Introduction to Machine learning ● Machine learning in Big data ● Understanding mathematics ● Vector manipulation in Scala/Spark ● Implementing ML in Spark RDD ● Using MLLib ● ML beyond MLLib
  • 4. 3 D’s of big data processing ● Data Scientist Models simple world view from chaotic complex real world data ● Data Engineer Implements simple model on complex toolset and data ● Data Artist Explains complex results in simple visualizations
  • 6. What is machine learning? A computer program is said to learn from experience E with respect to some task T and some performance measure P, if it’s performance on T, as measured P, improves with experience E - Tom Mitchell (1998)
  • 7. How human learn? ● We repeat the same task (T) over and over again to gain experience (E) ● This doing over and over again same task is known as practice ● With practice and experience, we get better at that task ● Once we achieve some level,we have learnt ● Everything in life is learnt
  • 8. Learning example - Music instrument ● We practice each day the instrument with same song again and again ● We will be pretty bad to start with, but with practice we improve ● Our teacher will measure performance to understand did we done progress ● Once your teacher approves, you are a player
  • 9. Why it’s hard to learn? ● Practicing over and over same task is often boring and frustrating ● Without proper measurement, we may be stuck at same level without making much progress ● Making progress initially is pretty easy but it’s starts to get harder and harder ● Sometime natural talent also dicates cap on how best we can be
  • 10. How machine learn? ● You start with an assumption about solution for given problem ● Make machine go over data and verify the assumption ● Run the program over and over again on same data updating assumption on the way ● Measure improvement after each round, adjust accordingly ● Once you have no more assumption changing, stop running
  • 11. Brute force machine learning Problem : Find relation between x , y and z values , given data set x y z 1 2 5 2 3 6 2 4 7 What human approach to this?
  • 12. ML approach ● Let’s assume that x,y,z are connected as below z = c0 + c1x + c2y where c0, c1, c2 factors which we need to learn ● Let’s start with following values c0 = 0 c1 = 1 c2 = 1 ● Find out the error for each row ● Adjust the values for c0,c1,c2 from error ● Repeat
  • 13. ML history ● Sub branch of AI ● Impossible to model brain because it's not an ideal world ● Most of the time we don’t need all complexity of real world ex : ideal gas equation from thermodynamics ● Approximation of real world is more than enough to solve real problems
  • 14. ML and Data ● The quality of a model of ML is determined by ○ learning algorithm ○ Amount of data it has seen ● More the data, a simple algorithm can do better than sophisticated algorithm with less data ● So for long ML was focused coming up with better algorithms as amount of data was limited ● Then we landed on big data
  • 16. ML and Big data ● Ability to learn on large corpus of data is a real boon for ML ● Even simplistic ML models shine when they are trained on huge amount of data ● With big data toolsets, a wide variety of ML application have started to emerge other than academic specifics one ● Big data democratising ML for general public
  • 17. ML vs Big data Machine Learning Big data Optimized for iterative computations Optimized for single pass computations Maintains state between stages Stateless CPU intensive Data intensive Vector/Matrix based or multiple rows/cols at a time Single column/ row at time
  • 18. Machine learning in Hadoop ● In Hadoop M/R each iteration of ML translates into single M/R job. ● Each of these jobs need to store data in HDFS which creates a lot of overhead ● Keeping state across jobs is not directly available in M/R ● Constant fight between quality of result vs performance
  • 19. Machine learning in Spark ● Spark is first general purpose big data processing engine build for ML from day one ● The initial design in Spark was driven by ML optimization ○ Caching - For running on data multiple times ○ Accumulator - To keep state across multiple iterations in memory ○ Good support for cpu intensive tasks with laziness ● One of the examples in Spark first version was of ML
  • 20. First ML Example in Spark
  • 22. Major types of ML ● Supervised learning Training data contains both input vector and desired output. We also called it as labeled data Ex : Linear Regression, Logistic Regression ● Unsupervised learning Training data sets without labels. Ex : K-means clustering
  • 23. Process of Supervised Learning Training Set Learning Algorithm Hypothesis/model New data Result (Prediction/Classification)
  • 24. ML Terminology ● Training Set Set of data used to teach machine. In supervised learning both input vector and output will be available. ● Learning algorithm Algorithm which consumes the training set to infer relation between input vectors which optimises for known output labels ● Model Learnt function of input parameters
  • 25. Linear regression ● A learning algorithm to learn relation between dependent variable y with one or more explanatory variable x which are connected by linear relation. ● Given multiple dimensional data, the hypothesis for linear regression looks as below h(x) = c0 + c1x1 + c2x2 + …. ● x1, x2 are the explanatory variables and y is the dependent variable. h(x) is the hypothesis. ● Linear regression goal is to learn c0,c1,c2
  • 26. Linear Regression Example ● Price of the house is dependent on size of the house ● What will be price of the house if size of the house is 1200? Size of the house(sft) Price(in Rs) 1000 5 lakh 2000 15 lakh 800 4 lakh
  • 27. Linear regression for housing data ● Price of the house is dependent on size of the house ● So in linear regression y -> price of the house x -> size of the house ● Then model we have to learn is Price_of_the_house = c0+ c1* size_of_the_house ● We assume relationship is linear
  • 29. Choosing values ● Our model is h(x) = c0 + c1x1 + c2x2 + …. ● How to choose c0, c1, c2? ● We want to choose c0,c1,c2 such way which gives us y values which are close to the one in training set ● But how we know they are close to y? ● Cost function helps to find out they are close or not.
  • 30. Cost function ● Cost function for linear regression is J(c) = (h(x) - y)^2 / 2m - know as squared error ● J(c) is a function of c0,c1,c2 because cost changes depending upon the different c1, c2 ● m is number of rows in training data ● Goal of learning algorithm is to learn a model which minimizes this error ● How to minimize J(c)?
  • 31. Curve of cost function
  • 32. Minimising cost function ● Start with some random value of c0, c1 ● Calculate the cost ● Keep on changing c0, c1 till you find the minimum cost ● Gradient descent in one of the algorithm to find minimum any mathematical function ● Also known as convex optimizer ● The name gradient comes from, use of gradient of the function to decide which way to walk
  • 33. Gradient descent ● The way to update c values in gradient descent c(j) = c(j) - alpha * derivative( J(c)) ● alpha is known as learning rate ● For linear regression ○ cost function - J(c) = (h(x) - y)^2 / 2m ○ Derivative is = ■ c0 - (h(x) - y) / m ■ c1,c2 .. - ((h(x) - y) * x ) / m
  • 35. Linear regression algorithm ● Start with a training set with x1,x2, x3 .. and y ● Start with parameters c0,c1,c3 with random values ● Start with a learning rate alpha ● Then repeat the following update c0 = c0 - alpha * h(x) - y c1 = c1 - alpha * (h(x) - y) * x ● Repeat this process till it converges
  • 36. Vectors in ML ● All information in machine learning is represented using vectors of numbers which represent features ● Collection of vectors are represented as matrices ● All the calculation in machine learning are expressed using vector manipulation ● So Understanding vector manipulation is very important ● We will understand how Scala and Spark represents the vector
  • 37. Vector manipulation in Scala/Spark
  • 38. Breeze - Vector library for Scala ● Breeze is library for numerical processing for scala ● Option to use fortran based native numeric library using netlib-java ● Spark uses breeze internally for vector manipulation in MLLib library ● Many data structures of MLLib are modeled around breeze data structures ● https://github.com/scalanlp/breeze
  • 39. Representing data in breeze vector ● Two types of vectors ○ Dense vectors ○ Sparse vectors ● All vectors in breeze are column vectors. Take transpose for row vector ● Normally row vectors are used to represent the data and column vectors for representing weights or coefficients of learning algorithm ● VectorExample.scala
  • 40. Vector/Matrix manipulation ● Multiplying vector from a constant ○ Multiply using netlib ○ Multiply using constant vector elementwise MultiplyVectorByConstant.scala ● Multiplying matrix from a constant ○ value ○ vector MultiplyMatrixByConstant.scala
  • 41. Dot product ● Dot product of two vectors A = [ a1, a2 , a3 ..] and B= [b1,b2,b3 .. ] is A.B = sum ( a1*b1 + a2*b2 + a3*b3) ● Two ways to implement in breeze ○ A.dot(B) ○ Transpose(A) * B) ● DotProductExample.scala
  • 42. In-place computations ● Creating a breeze vector/matrix is a costly operations ● So if each transformation in our computation creates new vector, it may hurt our performance ● We can use in place operations, which can update existing vectors rather than creating new vector ● When we do many vector manipulations, in-place is preferred for better performance ● Ex : InPlaceExample.scala
  • 43. Representing data as RDD[Vector] ● In Spark ML, each row is represented using vectors ● Representing row in vector allows us to easily manipulate them using earlier discussed vector manipulations ● We broadcast vector for efficiency ● We can manipulate partition at a time using represent them as the matrices ● Manipulating as partition can give good performance ● RDDVectorExample.scala
  • 45. Implementing LR in Spark ● Represent that data as DataPoint which has ○ x - feature vector ○ y - value to be predicted ● Use accumulator to keep track of the cost ● Use reduce to aggregate gradient across multiple rows ● Uses gradient descent to work on this ● Ex : LinearRegressionExample.scala
  • 46. Stochastic gradient descent ● Use mini batch to do rather complete dataset ● Using sampling we can achieve this ● Mini batches help to speed up the gradient descent in multiple level ● By default batch size is 1.0 ● Ex : LRWithSGD.scala
  • 48. MLLib ● Machine learning library shipped with standard distribution of Spark ● Supports popular machine learning algorithms like Linear regression, Logistic regression, decision trees etc out of the box ● Every release new algorithms are added ● Supports multiple optimization techniques SGD, LBFGS
  • 49. Linear Regression in MLLib ● org.apache.spark.mllib.linalg.DenseVector wraps breeze vector for MLLib library ● Use LabeledPoint to represent the data of a given row ● Built in support for Linear regression with SGD ● Ex :LinearRegressionInMLLib.scala
  • 50. LR for housing data ● We are going to predict housing price using house size ● Size and price are in different scale ● Need to scale both of them to same scale ● We use StandardScaler to scale RDD[Vector] ● Scaled Data will be used for LinearRegression ● Ex : LRForHousingData.scala
  • 51. Machine learning beyond MLLib ● MLLib Pipelines API ● MLLib feature framework ● Sparkling water http://h2o.ai/product/sparkling-water/ ● SparkR ● http://prediction.io/