A Brief Introduction to Machine Learning techniques applied in data science. Definitions and applications of machine learning algorithms. Classification and Regression Techniques.
1. Machine Learning for Data
Science
A Brief Introduction
By
Vaibhav Kumar
Assistant Professor
DIT University, Dehradun
Email: Vaibhav.kumar@dituniversity.edu.in, vaibhav05cse@gmail.com
GitHub: https://github.com/vaibhav05cse/
Vaibhav Kumar@DIT University
2. Contents
• Introduction to Data Science
• Applications of Data Science
• Foundations of Data Science
• Machine Learning
• Supervised Learning
• Classification
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine
• Regression
• Simple Linear Regression
• Multiple Linear Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Unsupervised Learning
• Cluster Analysis
• Principal Component Analysis Vaibhav Kumar@DIT University
3. Introduction to Data Science
• Data science is a multi-disciplinary field which uses scientific
methods, processes, algorithms and systems to extract knowledge
and insights from structured and unstructured data [1].
• It is a blend of computer Science, Mathematics and business/domain
expertise.
Vaibhav Kumar@DIT University
4. Need of Data Science
• Size of data is growing at a faster rate [2].
• To find insights from this huge amount of data, perfect analytics
techniques are required.
• Data Science has the capacity to cater this requirement.
Vaibhav Kumar@DIT University
5. What Data Science can Do?
• It unifies statistics, data analysis, machine learning and their related
methods in order to understand and analyze actual phenomena with data
[3].
• It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science [4].
• It does predictive analytics to predict the possibilities of a particular event
in the future.
• It does prescriptive analytics to find the best course of action for a given
situation.
• It employs machine learning techniques to discover patterns from the data.
Vaibhav Kumar@DIT University
7. Foundations of Data Science
• Statistics: Descriptive, Inferential.
• Linear Algebra: Matrices, Planes, Vectors, etc.
• Computer Science: Algorithm, Graph Theory, Data Structure, DBMS, etc.
• Machine Learning: Supervised, Unsupervised, Reinforcement.
• Business Analytics: Predictive, Prescriptive, Descriptive, Decision.
• Programming: R/Python, SQL, NoSQL.
Vaibhav Kumar@DIT University
8. Machine Learning
• Machine learning is a subfield of computer science which focuses to
develop the computer algorithm to learn from examples and improve
the performance of a task.
• The algorithms in machine learning use training data which is the set
of past observations.
• There are three broad categories of machine learning:
Supervised Learning: Which learns from labeled examples.
Unsupervised Learning: Which learns from unlabeled examples.
Reinforcement Learning: Which learns from environment through feedbacks.
• It develops predictive analytics models which allow researchers, data
scientists to predict about future based on past and current data.
Vaibhav Kumar@DIT University
9. Supervised Learning
• It is a category of machine learning algorithms. As name indicates, it is
supervised by the presence of output in the training data.
• It learns from the labelled data – input for which output is known.
• It builds a mathematical model of a set of data that contains both the
inputs and the desired outputs.
• A supervised learning algorithm analyzes the training data and
produces an inferred function, which can be used for mapping new
examples.
• Generally, all the supervised learning problems are classified into
Classification and Regression problems.
Vaibhav Kumar@DIT University
10. Classification
• Classification in machine learning is a supervised learning problem
where the output variable is a category, such as “yes” or “no” or
“disease” and “no disease”.
• In this problem, the dependent variable is categorical whose category
is predicted based on several independent variables.
• A classification model attempts to draw some conclusion from
observed values.
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes.
• There are a number of classification models.
Vaibhav Kumar@DIT University
11. Classification through machine learning algorithms
Following are the popular machine learning algorithms which are used
in classification problems:-
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine
Vaibhav Kumar@DIT University
12. Logistic Regression
• This regression model is used when the dependent variable is
categorical.
• There are binary outputs of categories in this case.
Vaibhav Kumar@DIT University
13. Decision Tree
• A Decision tree is a flowchart like tree structure, where each internal
node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node holds a class label.
Example:-
Vaibhav Kumar@DIT University
14. Random Forest
• Random forests or random decision forest is an ensemble learning
method that consists a large number of decision trees.
• Each individual tree in the random forest spits out a class prediction
and the class with the most votes becomes our model’s prediction.
Example:
Vaibhav Kumar@DIT University
15. K-Nearest Neighbor
• In k-NN classification, the output is a class membership of a new
observation.
• An object is classified by a plurality vote of its neighbors, with the
object being assigned to the class most common among its k nearest
neighbors.
• Example:
Vaibhav Kumar@DIT University
16. Support Vector Machine
• In Support Vector Machine (SVM), we plot each data item as a point
in n-dimensional space (where n is the number of features you have)
with the value of each feature being the value of a particular
coordinate.
• Then, we perform classification by finding the hyperplane
that differentiate the two classes very well.
• To identify the hyperplane, we try to maximize the distance between
boundary elements of separated classes.
• Variety of kernel functions are used to separate observations based
on whether they are linear separable or non-linearly separable.
Vaibhav Kumar@DIT University
18. Regression
• Regression in machine learning is supervised learning problem where
the output variable is a real or continuous value, such as “salary” or
“weight”.
• Many different models can be used, the simplest is the linear
regression.
• It tries to fit data with the best hyper-plane which goes through the
points.
• There are various techniques used for regression analysis such as
Linear Regression, Decision Tree Regression, Random Forest
Regression etc.
Vaibhav Kumar@DIT University
19. Simple Linear Regression
• Simple linear regression allows us to summarize and study relationships
between two continuous variables where,
• One variable, denoted by x, is regarded as the predictor, explanatory,
or independent variable.
• The other variable, denoted by y, is regarded as the response, outcome,
or dependent variable.
• Mathematically, it is expressed as:
y = b0 + b1*x + e, where:
•b0 and b1 are known as the regression beta coefficients or parameters:
•b0 is the intercept of the regression line; that is the predicted value when x = 0.
•b1 is the slope of the regression line.
•e is the error term.
Vaibhav Kumar@DIT University
21. Multiple Linear Regression
• The multiple linear regression is used to explain the relationship
between one continuous dependent variable and two or more
independent variables.
Vaibhav Kumar@DIT University
22. Support Vector Regression
• In case of regression, where continuous value to be generated as
output, a non-linear function is learned by linear learning machine
mapping into high dimensional kernel induced feature space.
• The capacity of the system is controlled by parameters that do not
depend on the dimensionality of feature space.
Vaibhav Kumar@DIT University
24. Decision Tree Regression
• The core algorithm for building decision trees called ID3.
• This ID3 algorithm uses the method of Standard Deviation Reduction in
case of regression.
• The standard deviation reduction is based on the decrease in standard
deviation after a dataset is split on an attribute.
• Constructing a decision tree is all about finding attribute that returns the
highest standard deviation reduction.
• The dataset is divided based on the values of the selected attribute. This
process is run recursively on the non-leaf branches, until all data is
processed.
• When the number of instances is more than one at a leaf node we calculate
the average as the final value for the target.
Vaibhav Kumar@DIT University
25. Random Forest Regression
• The random forest model is ensemble learning method where
multiple decision trees are used to generate an output.
• As we seen in the decision tree regression, a decision tree generates
the output as average of all the values generated by its leaf nodes.
• In random forest model, the output is generated by taking the mean
of all the outputs generated by decision trees used in this ensemble
model.
Vaibhav Kumar@DIT University
26. Unsupervised Learning
• Unsupervised learning is performed on the unlabeled data – there are
no input output labels (categories) are given in the data.
• Here the task of machine is to group unsorted information according
to similarities, patterns and differences without any prior training of
data.
• Two of the main methods used in unsupervised learning are:
• Principal component Analysis, and
• Cluster analysis.
Vaibhav Kumar@DIT University
27. Cluster Analysis
• Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
• Cluster analysis can be achieved by various algorithms that differ
significantly in their understanding of what constitutes a cluster and
how to efficiently find them.
Example:
Vaibhav Kumar@DIT University
28. Principal Component Analysis
• Principal component analysis is a method of extracting important
variables from a large set of variables available in a data set.
• It extracts low dimensional set of features from a high dimensional
data set with a motive to capture as much information as possible.
Vaibhav Kumar@DIT University
29. References
1. Dhar, V. (2013). "Data science and prediction". Communications of the
ACM. 56 (12): 64–73.
2. Seth Familian (2016), “Context: What’s Big Data? Big in Growth too”,
slideshare.net.
3. Hayashi, Chikio (1 January 1998). "What is Data Science? Fundamental
Concepts and a Heuristic Example". In Hayashi, Chikio; Yajima, Keiji; Bock,
Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa (eds.).
Data Science, Classification, and Related Methods. Studies in
Classification, Data Analysis, and Knowledge Organization. Springer
Japan. pp. 40–51.
4. Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-
intensive Scientific Discovery. Microsoft Research. ISBN 978-0-9825442-
0-4.
Vaibhav Kumar@DIT University