Machine learning for Data Science

Machine Learning for Data
Science
A Brief Introduction
By
Vaibhav Kumar
Assistant Professor
DIT University, Dehradun
Email: Vaibhav.kumar@dituniversity.edu.in, vaibhav05cse@gmail.com
GitHub: https://github.com/vaibhav05cse/
Vaibhav Kumar@DIT University

Contents
• Introduction to Data Science
• Applications of Data Science
• Foundations of Data Science
• Machine Learning
• Supervised Learning
• Classification
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine
• Regression
• Simple Linear Regression
• Multiple Linear Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Unsupervised Learning
• Cluster Analysis
• Principal Component Analysis Vaibhav Kumar@DIT University

Introduction to Data Science
• Data science is a multi-disciplinary field which uses scientific
methods, processes, algorithms and systems to extract knowledge
and insights from structured and unstructured data [1].
• It is a blend of computer Science, Mathematics and business/domain
expertise.

Need of Data Science
• Size of data is growing at a faster rate [2].
• To find insights from this huge amount of data, perfect analytics
techniques are required.
• Data Science has the capacity to cater this requirement.

What Data Science can Do?
• It unifies statistics, data analysis, machine learning and their related
methods in order to understand and analyze actual phenomena with data
[3].
• It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science [4].
• It does predictive analytics to predict the possibilities of a particular event
in the future.
• It does prescriptive analytics to find the best course of action for a given
situation.
• It employs machine learning techniques to discover patterns from the data.

Applications of Data Science

Foundations of Data Science
• Statistics: Descriptive, Inferential.
• Linear Algebra: Matrices, Planes, Vectors, etc.
• Computer Science: Algorithm, Graph Theory, Data Structure, DBMS, etc.
• Machine Learning: Supervised, Unsupervised, Reinforcement.
• Business Analytics: Predictive, Prescriptive, Descriptive, Decision.
• Programming: R/Python, SQL, NoSQL.

Machine Learning
• Machine learning is a subfield of computer science which focuses to
develop the computer algorithm to learn from examples and improve
the performance of a task.
• The algorithms in machine learning use training data which is the set
of past observations.
• There are three broad categories of machine learning:
 Supervised Learning: Which learns from labeled examples.
 Unsupervised Learning: Which learns from unlabeled examples.
 Reinforcement Learning: Which learns from environment through feedbacks.
• It develops predictive analytics models which allow researchers, data
scientists to predict about future based on past and current data.

Supervised Learning
• It is a category of machine learning algorithms. As name indicates, it is
supervised by the presence of output in the training data.
• It learns from the labelled data – input for which output is known.
• It builds a mathematical model of a set of data that contains both the
inputs and the desired outputs.
• A supervised learning algorithm analyzes the training data and
produces an inferred function, which can be used for mapping new
examples.
• Generally, all the supervised learning problems are classified into
Classification and Regression problems.

Classification
• Classification in machine learning is a supervised learning problem
where the output variable is a category, such as “yes” or “no” or
“disease” and “no disease”.
• In this problem, the dependent variable is categorical whose category
is predicted based on several independent variables.
• A classification model attempts to draw some conclusion from
observed values.
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes.
• There are a number of classification models.

Classification through machine learning algorithms
Following are the popular machine learning algorithms which are used
in classification problems:-
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine

Logistic Regression
• This regression model is used when the dependent variable is
categorical.
• There are binary outputs of categories in this case.

Decision Tree
• A Decision tree is a flowchart like tree structure, where each internal
node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node holds a class label.
Example:-

Random Forest
• Random forests or random decision forest is an ensemble learning
method that consists a large number of decision trees.
• Each individual tree in the random forest spits out a class prediction
and the class with the most votes becomes our model’s prediction.
Example:

K-Nearest Neighbor
• In k-NN classification, the output is a class membership of a new
observation.
• An object is classified by a plurality vote of its neighbors, with the
object being assigned to the class most common among its k nearest
neighbors.
• Example:

Support Vector Machine
• In Support Vector Machine (SVM), we plot each data item as a point
in n-dimensional space (where n is the number of features you have)
with the value of each feature being the value of a particular
coordinate.
• Then, we perform classification by finding the hyperplane
that differentiate the two classes very well.
• To identify the hyperplane, we try to maximize the distance between
boundary elements of separated classes.
• Variety of kernel functions are used to separate observations based
on whether they are linear separable or non-linearly separable.

Regression
• Regression in machine learning is supervised learning problem where
the output variable is a real or continuous value, such as “salary” or
“weight”.
• Many different models can be used, the simplest is the linear
regression.
• It tries to fit data with the best hyper-plane which goes through the
points.
• There are various techniques used for regression analysis such as
Linear Regression, Decision Tree Regression, Random Forest
Regression etc.

Simple Linear Regression
• Simple linear regression allows us to summarize and study relationships
between two continuous variables where,
• One variable, denoted by x, is regarded as the predictor, explanatory,
or independent variable.
• The other variable, denoted by y, is regarded as the response, outcome,
or dependent variable.
• Mathematically, it is expressed as:
y = b0 + b1*x + e, where:
•b0 and b1 are known as the regression beta coefficients or parameters:
•b0 is the intercept of the regression line; that is the predicted value when x = 0.
•b1 is the slope of the regression line.
•e is the error term.

Multiple Linear Regression
• The multiple linear regression is used to explain the relationship
between one continuous dependent variable and two or more
independent variables.

Support Vector Regression
• In case of regression, where continuous value to be generated as
output, a non-linear function is learned by linear learning machine
mapping into high dimensional kernel induced feature space.
• The capacity of the system is controlled by parameters that do not
depend on the dimensionality of feature space.

Decision Tree Regression
• The core algorithm for building decision trees called ID3.
• This ID3 algorithm uses the method of Standard Deviation Reduction in
case of regression.
• The standard deviation reduction is based on the decrease in standard
deviation after a dataset is split on an attribute.
• Constructing a decision tree is all about finding attribute that returns the
highest standard deviation reduction.
• The dataset is divided based on the values of the selected attribute. This
process is run recursively on the non-leaf branches, until all data is
processed.
• When the number of instances is more than one at a leaf node we calculate
the average as the final value for the target.

Random Forest Regression
• The random forest model is ensemble learning method where
multiple decision trees are used to generate an output.
• As we seen in the decision tree regression, a decision tree generates
the output as average of all the values generated by its leaf nodes.
• In random forest model, the output is generated by taking the mean
of all the outputs generated by decision trees used in this ensemble
model.

Unsupervised Learning
• Unsupervised learning is performed on the unlabeled data – there are
no input output labels (categories) are given in the data.
• Here the task of machine is to group unsorted information according
to similarities, patterns and differences without any prior training of
data.
• Two of the main methods used in unsupervised learning are:
• Principal component Analysis, and
• Cluster analysis.

Cluster Analysis
• Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
• Cluster analysis can be achieved by various algorithms that differ
significantly in their understanding of what constitutes a cluster and
how to efficiently find them.
Example:

Principal Component Analysis
• Principal component analysis is a method of extracting important
variables from a large set of variables available in a data set.
• It extracts low dimensional set of features from a high dimensional
data set with a motive to capture as much information as possible.

References
1. Dhar, V. (2013). "Data science and prediction". Communications of the
ACM. 56 (12): 64–73.
2. Seth Familian (2016), “Context: What’s Big Data? Big in Growth too”,
slideshare.net.
3. Hayashi, Chikio (1 January 1998). "What is Data Science? Fundamental
Concepts and a Heuristic Example". In Hayashi, Chikio; Yajima, Keiji; Bock,
Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa (eds.).
Data Science, Classification, and Related Methods. Studies in
Classification, Data Analysis, and Knowledge Organization. Springer
Japan. pp. 40–51.
4. Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-
intensive Scientific Discovery. Microsoft Research. ISBN 978-0-9825442-
0-4.

Thanking You

Machine learning for Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Machine learning for Data Science

Similar to Machine learning for Data Science (20)

Recently uploaded

Recently uploaded (20)

Machine learning for Data Science