Introduction to Machine Learning

© 2019 Cloudera, Inc. All rights reserved. 1
Introduction to Machine Learning

Topics
• What is Data?
• What is machine learning?
• Learning system model
• Training and testing
• Performance
• Applications
• Types of Machine learning
• Learning techniques
• Linear Regression – Step By Step

What is Data
• Traditionally, Data is collection of Raw facts and Figures.
• Now we are getting data from web logs, mobile devices, sensors,
instruments, and transactions.
• IBM estimates that 90 percent of the data in the world today has been created
in the past two years.
• Businesses today are accumulating new data at a rate that exceeds their
capacity to extract value from it.
• how to use this data effectively — not just their own data, but all of the data
that's available and relevant.

What is Data continued
• The data is more heterogeneous than data of the past.
• Like: Digitized text, audio, and visual content, like sensor and blog data
• Working with this data requires distinctive new skills and tools

How to get Meaning out of Data
• Look at data with a mathematical mind-set. Learning skills such as machine
learning, data mining, data analysis and statistics are crucial. A data scientist
will need to interpret and represent data mathematically.
• Use a common language to access, explore and model data. Languages
like R, Python, Matlab, SparkML and a database querying language like SQL
are some of the most popular skills in demand.
• Data extraction, exploration and hypothesis testing are central to the data
science practice
• Develop strong computer science and software engineering
backgrounds. This involves developing a skill set which could include Java,
C++ or knowledge of algorithms and Hadoop. These skills will be used to
leverage data to architect systems.

What is Machine Learning
• A branch of artificial intelligence, concerned with the design and development of algorithms
that allow computers to evolve behaviors based on empirical data.
• It is very hard to write programs that solve problems like recognizing a face. We don’t know
what program to write because we don’t know how our brain does it. Even if we had a good
idea about how to do it, the program might be horrendously complicated.
• Instead of writing a program by hand, we collect lots of examples that specify the correct
output for a given input.
• A machine learning algorithm then takes these examples and produces a program that does
the job.
• The program produced by the learning algorithm may look very different from a typical hand-
written program. It may contain millions of numbers.
• If we do it right, the program works for new cases as well as the ones we trained it on

Traditional Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Program

Role of Data Engineer
• They use technology and skills to increase awareness, clarity and direction for
those working with data.
• Data scientists use their data and analytical ability to find and interpret rich data
sources;
• manage large amounts of data despite hardware, software, and bandwidth
constraints
• merge data sources.
• ensure consistency of datasets.
• create visualizations to aid in understanding data.
• build mathematical models using the data.
• present and communicate the data insights/findings.
• Conduct undirected research and frame open-ended industry questions
• Extract huge volumes of data from multiple internal and external sources

Role of Data Engineer
• Employ sophisticated analytics programs, machine learning and statistical methods to prepare data
for use in predictive and prescriptive modelling
• Thoroughly clean and prune data to discard irrelevant information
• Explore and examine data from a variety of angles to determine hidden weaknesses, trends and/or
opportunities
• Devise data-driven solutions to the most pressing challenges
• Invent new algorithms to solve problems and build new tools to automate work
• Communicate predictions and findings to management and IT departments through effective data
visualizations and reports
• Recommend cost-effective changes to existing procedures and strategies

Required Skills for Machine Learning
• SAS and/or R – In-depth knowledge of at least one of these analytical tools,
for data science R is generally preferred.
• Python Coding – Python is the most common coding language I typically see
required in data science roles, along with Java, Perl, or C/C++.
• Hadoop Platform – Although this isn’t always a requirement, it is heavily
preferred in many cases. Having experience with Hive or Pig is also a strong
selling point.
• SQL Database/Coding –it is expected that a candidate will be able to write
and execute complex queries in SQL.

Applications of Machine Learning
Recognizing patterns:
• Facial identities or facial expressions
• Handwritten or spoken words
• Medical images
Generating patterns:
• Generating images or motion sequences
Recognizing anomalies:
• Unusual sequences of credit card transactions
• Unusual patterns of sensor readings in a nuclear power plant or unusual
sound in your car engine.
Prediction:
• Future stock prices or currency exchange rates

Applications of Machine Learning
Spam filtering, fraud detection:
• The enemy adapts so we must adapt too.
Recommendation systems:
• Lots of noisy data. Million dollar prize!
Information retrieval:
• Find documents or images with similar content.
Data Visualization:
• Display a huge database in a revealing way

How machine learns?
Input Samples Learning Method
System
Training
Testing

How to use available Data
Training set
(observed)
Universal set
(unobserved)
Testing set
(unobserved)
Data acquisition Practical usage

Training and Testing Data
Training is the process of making the system able to learn.
Training set and testing set come from the same distribution
Need to make some assumptions or bias

Performance of Analysis
There are several factors affecting the performance:
• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used

Types of Algorithms
Supervised learning Unsupervised learning
Semi-supervised learning

Types of Algorithms
The success of machine learning system also depends on the algorithms.
The algorithms control the search to find and build the knowledge structures.
The learning algorithms should extract useful information from training examples.
There are 4 types of Machine Learning
Supervised - Training data includes desired outputs
Unsupervised - Training data does not include desired outputs
Semi-Supervised - Training data includes a few desired outputs
Reinforcement - Rewards from sequence of actions

© 2019 Cloudera, Inc. All rights
reserved.
19

Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
It infers a function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object and a
desired output value.

Supervised Learning
Classification : output is having discrete value Regression: output is having continuous value.
Example of Supervised Learning Algorithms:
• Linear Regression
• Nearest Neighbor
• Guassian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest

Supervised Learning
• Learn to predict output when given an input vector

T
Clustering: finding a structure or pattern in a
collection of uncategorized data
• K-means clustering
• K-NN (k nearest neighbors)
• Principal Component Analysis
Association: Finding association between
finite elements of a dataset.
• Association rules

Unsupervised Learning
• Create an internal representation of the input e.g. form clusters; extract features

Semi-Supervised Learning
Some data is labeled but most of it is unlabeled and a mixture of supervised and
unsupervised techniques can be used.
• Speech Recognition
• Internet Content Classification
• DNA Sequence Classification

Reinforcement Learning
Output depends on the state of the current input and the next input depends on the
output of the previous input
Example:
• Robotics for industrial automation
• Driverless cars
Types:
• Q-Learning;
• Temporal Difference (TD);
• Monte-Carlo Tree Search (MCTS)

Machine Learning in Action

Linear Regression
• In statistics, linear
regression is an
approach for
modelling the
relationship between a
scalar dependent
variable y and one or
more explanatory
variables (independent
variables) denoted X.
In Statistical experiment, the dependent variable is the event expected to change when
the independent variable is manipulated

Linear RegressionApplications
• Effect of fertilizer on plant growth:
• Effect of drug dosage on symptom severity:
• Effect of temperature on pigmentation:
• Per capita crime rate by town
• Average number of rooms per dwelling
• Student teacher ratio by town

Types of Linear Regression
simple linear regression : The case
of one explanatory variable
multiple linear regression :
more than one explanatory
variable

Linear Regression - Implementation
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
Input table with sample Data Output Regression Analysis with Regression Line

T
• A simple scatter plot for given Data will look like

T
• However, the regression line is typically computed with statistical
software.
• But the calculations are based on
• MX = mean of X
• MY = mean of Y
• sX = standard deviation of X
• sY = standard deviation of Y
• r = correlation between X and Y.
• Result statistics are will be
MX MY sX sY r
3 2.06 1.581 1.072 0.627

T
• The formula for a regression line is
Y' = bX + A
where
Y' = predicted score
b = slope of the line
A = Y intercept
Slope (b) can be calculated as:
b = r sY/sX
Intercept (A) can be calculated as:
A = MY - bMX
b = (0.627)(1.072)/1.581 = 0.425
A = 2.06 - (0.425)(3) = 0.785
For Given Table
MX MY sX sY r
3 2.06 1.581 1.072 0.627

• We know that formula for a regression line is
• Y' = bX + A
• Then predicted values will be
• Y' = 0.425X + 0.785 (Putting values of X)
• Like X = 1,
• Y' = (0.425)(1) + 0.785 = 1.21.
• and X = 2,
• Y' = (0.425)(2) + 0.785 = 1.64.
predicted values (Y') are as shown in following table
X Y Y'
1.00 1.00 1.210
2.00 2.00 1.635
3.00 1.30 2.060
4.00 3.75 2.485
5.00 2.25 2.910

The error of prediction for a point is the value of the point minus the predicted value (the value on the
line). Following Table shows the predicted values (Y') and the errors of prediction (Y-Y').
X Y Y' Y-Y' (Y-Y')2
1.00 1.00 1.210 -0.210 0.044
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 -0.760 0.578
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 -0.660 0.436

Test Your Understanding
Problem Statement: Last year, five randomly selected students took a math aptitude test before
they began their statistics course. If a student made an 80 on the aptitude test, what grade would
we expect her to make in statistics?
Student xi yi
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Xi = scores on the aptitude test.
Yi = statistics grades

• Required Values
Student xi yi (xi - x) (yi - y) (xi - x)2
(yi - y)2
(xi - x)(yi - y)
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77

The regression equation is a linear equation of the form:
ŷ = b0 + b1x
we need to find values for b0 and b1
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = 470/730 = 0.644
b0 = y - b1 * x
b0 = 77 - (0.644)(78) = 26.768
So
Regression Equation is: ŷ = 26.768 + 0.644x

If a student scored 80 on the aptitude test, the estimated statistics grade would be:
– ŷ = 26.768 + 0.644 * x
=> 26.768 + 0.644 * 80
=> 26.768 + 51.52
= 78.288

Introduction to Machine Learning

Recommended

Recommended

More Related Content

Similar to Introduction to Machine Learning

Similar to Introduction to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning

Editor's Notes