SlideShare a Scribd company logo
1 of 80
1
The World of Loss Function
- Part I -
2018. 8.
김 홍 배
2
3
4
1. Classification with NN
2. Linear Classification : Support Vector Machine
* Non-linear SVM ?
3. Logistic Regression :
Binary Classifier, Cross entropy,
Information Theory
4. What is “Maximum A Posterior Estimator” ?
5. Kullback-Leibler(KL) divergence
6. Softmax Regression : Multi-class Classifier
7. Focal Loss
8. Discriminative Feature Learning
9. Learning by Association
Part 1
5
Classification with NN
2D Example
6
Let’s imagine a simple case
To classify the given classes,
we only need to define a straight line
Human
Dog
Classification with NN
7
f= 𝑖=0
2
𝑤𝑖 𝑋𝑖
+1
-1
Human
Dog
At Decision boundary
f(x)= 0
slope
offset
Single cell is enough for a simple case !
Classification with NN
8
X1 : No. of straight lines in image
X2:Blackpixelsratio(%)
100
50
Decision boundary
w1/w2 ≈ 12, w0/w2 ≈-60
ex, Get computer to classify input image as Chinese or Japanese
Classification with NN
9
Human
Monkey
Let’s imagine a more complex case
Can’t classify with a straight line  needs more complex boundary
Classification with NN
10
We need many cells and layers generally
We can create extra features that allow more
complex decision boundaries
Classification with NN
• Select a network architecture
• Randomly initialize weights
• Observe features “x” with reference “y”
• Push “x” through NN  output is “ŷ”
• Calculate Error : (y- ŷ)2, least squared error for example
• While error is too large
– Calculate errors and backpropagate error signals
– Adjust weights
• Evaluate performance using the test set
Network Training by Backpropagation
y
ŷ
11
• How should we update the weights to improve ?
 To minimize error or loss function, J=(y- ŷ)2,
Gradient descent algorithm is generally used
Network Training by Backpropagation
J(w)
w
Start point
Final point
Sensitivity w.r.t Cost ft’n
wn+1 = wn - η
𝜕J(wn)
𝜕wn
w1 w2 wf
12
13
LINEAR CLASSIFIER :
SUPPORT VECTOR MACHINES
14
The main idea of the SVMs may be summed up as follows:
• “Given a training samples, the SVM constructs a hyperplane
as decision surface in such a way the margin of separation
between positive and negative examples is maximized.”
Introduction
15
Linearly Separable Patterns
SVM is a binary learning machine.
• Binary classification is the task of separating classes
in feature space.
16
Which of the linear separators is optimal?
Linearly Separable Patterns
17
• The optimal decision boundary is the one that maximize
the margin ρ
Optimal Decision Boundary
18
𝑤
𝑥0
𝑥
𝑃0P
0
𝑥 − 𝑥0 Define vectors : 𝑥0 = 𝑂𝑃0 𝑎𝑛𝑑 𝑥 = 𝑂𝑃
𝑤ℎ𝑒𝑟𝑒 P is an arbitrary point on a hyperplane.
A condition for P to be on the plane is that the vector 𝑥 − 𝑥0 is
perpendicular to 𝑤
vector 𝑤 · ( 𝑥 − 𝑥0) = 0 or 𝑤 · 𝑥 + b = 0
Equation of a Hyperplane
19
𝑤
𝑛
Understanding the basics
g(x) is a linear function :
• A hyperplane in the feature space
• Normal vector of the hyperplane :
𝑛=
𝑤
𝑤
20
𝑤
𝑛
Margin
Safe zone
• The linear discriminant function
function (classifier) with the
maximum margin is the best
• Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
• Why it is the best ?
Robust to outliers and thus
strong generalization ability
: denotes +1
: denotes -1
Understanding the basics
21
𝑤
𝑛
: denotes +1
: denotes -1
Understanding the basics
• Given a set of data points
( 𝑥𝑖, 𝑦𝑖) , i=1,2, ···, n
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > 0
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < 0
• With a scale transformation on
both 𝑊 𝑎𝑛𝑑 𝑏,
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1
22
𝑤
𝑛
Margin
: denotes +1
: denotes -1
Understanding the basics
W𝑥+
+ 𝑏 = +1
Support vector
X+
X-
X-W𝑥−
+ 𝑏 = −1
• At extreme points 𝑥+
, 𝑥−
• The margin width is :
M = (𝑥+
− 𝑥−
) ∙ 𝑛
=
2
𝑤
2
𝑤
Maximize
Such that
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1
1
2
𝑤 2Or Minimize
𝑦𝑖(𝑊𝑥𝑖 + 𝑏) ≥ 1
23
The Optimization Problem
Introduce Lagrange multipliers αi,
• That is, the Lagrange function:
(𝑦𝑖
Is to be minimized with respect to w and b, i.e,
• The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane.
• However, there are situations where a nonlinear region can separate the
groups more efficiently.
• The kernel function transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
Non-Linear SVM(Support Vector Machines)
kernel trick
To Map from input space to feature space to simplify classification task
Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is
adopted
Non-Linear SVM(Support Vector Machines)
Feature space에서의 inner product(a measure of similarity)
Key Idea of Kernel Methods
K(𝑥𝑖, 𝑥𝑗)
K(𝑥𝑖, 𝑥𝑗) = Φ(𝑥𝑖)· Φ(𝑥𝑗)
Normal Condition :
Cluster bound :
exp{−
[ 𝑥1−𝑐1
2+ 𝑥2−𝑐2
2]
2𝜎2 } ≥ {0<Threshold<<1}
𝑥1 − 𝑐1
2
+ 𝑥2 − 𝑐2
2
≤ r2
𝐾1 + 𝐾2 ≤ r2
x1
x2
.(c1,c2)
r
K1
K2
r2
r2
Key Idea of Kernel Methods
RBFN architecture
Σ
Input layer
Hidden layer
(RBFs)
Output layer
W1 W2 WM
x1 x2 xn
No weight
f(x)
Each of n components of
the input vector x feeds
forward to m basis
functions whose outputs
are linearly combined with
weights w (i.e. dot product
x∙w) into the network
output f(x).
The output layer performs a simple weighted sum (i.e. w ∙x).
If the RBFN is used for regression then this output is fine.
However, if pattern classification is required, then a hard-
limiter or sigmoid function could be placed on the output
neurons to give 0/1 output values
Input data set ∶ 𝑋 = { 𝑥1 𝑥2 … 𝑥 𝑁}
Σ
query
0.2
query
0.9
Radial Basis Function Detector
Architecture for Anomaly detection
Normal data
Unusual or
Abnomaly data
Σ
- Anomaly or Unusual event detection
query
query
0.9
query
query
0.1
Radial Basis Function Detector
Architecture for Anomaly detection
Σ Σ
Category 1 Category 2
Category 1
Category 2
- Classification Problem
Radial Basis Function Detector
Architecture for Anomaly detection
 For Gaussian basis functions
 s x w w x c
w w
x c
p i i p i
i
M
i
pj ij
ijj
n
i
M
( )
exp
( )
  
  
  










0
1
0
2
2
11 2


 Assume the variance  across each dimension are
equal
s x w w x cp i
i
pj ij
j
n
i
M
( ) exp ( )

   






0 2
2
11
1
2
→ → →
→
Architecture for Anomaly detection
• Design decision
• number of hidden neurons
• max of neurons = number of input patterns
• more neurons – more complex, smaller tolerance
• Parameters to be learnt
• centers
• radii
• A hidden neuron is more sensitive to data points near its center.
This sensitivity may be tuned by adjusting the radius.
• smaller radius  fits training data better (overfitting)
• larger radius  less sensitivity, less overfitting, network of
smaller size, faster execution
• weights between hidden and output layers
Architecture for Anomaly detection
The question now is:
How to train the RBF network?
In other words, how to find:
 The number and the parameters of hidden units (the basis functions)
using unlabeled data (unsupervised learning).
 K-Mean Clustering Algorithm
 The weights between the hidden layer and the output layer.
 Recursive Least-Squares Estimation Algorithm
RBFN Learning
xp
K-means
K-Nearest
Neighbor
Basis
Functions
Linear
Regression
ci
ci
i
A w
RBFN Learning
 Use the K-mean algorithm to find ci
RBFN Learning
K-mean Algorithm
step1: K initial clusters are chosen randomly from the samples
to form K groups.
step2: Each new sample is added to the group whose mean is
the closest to this sample.
step3: Adjust the mean of the group to take account of the new
points.
step4: Repeat step2 until the distance between the old means
and the new means of all clusters is smaller than a
predefined tolerance.
Outcome: There are K clusters with means representing
the centroid of each clusters.
Advantages: (1) A fast and simple algorithm.
(2) Reduce the effects of noisy samples.
 Use K nearest neighbor rule to find the function
width 
k-th nearest neighbor of ci
 The objective is to cover the training points so that a
smooth fit of the training samples can be achieved
2
1
1


K
k
iki cc
K
→ →
 RBF learning by gradient descent
 Let andi p
pj ij
ijj
n
p p px
x c
e x d x s x( ) exp ( ) ( ) ( )
   
 








 


1
2
2
2
1 
 E e xp
p
N



1
2 1
2
( ) .

we have






E
w
E E
ci ij ij
, , and
Apply
→ → → →
→
N : No. of batch
we have the following update equations
 RBF learning by gradient descent
Logistic regression
Needs to classify the output (y) as either 0 or 1.  Binary classification
A liner equation may not be a good fit for classification problem.
Logistic regression uses the sigmoid function to plot the hypothesis
z
Sigmoid ft’n
1
0
Linear ft’n
Want 0 ≤ ℎ 𝜃 𝑥 ≤ 1, 𝑓𝑜𝑟 𝑎𝑛 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥
Logistic Regression Model
ℎ 𝜃 𝑥 = σ(ϴ 𝑇 𝑥)
Binary Classification
42
J(w)= 𝑘 𝑦 𝑘 − 𝑦 𝑘
2
Linear regression
𝜎 ϴ 𝑇 𝑥 =
1
1 + 𝑒−ϴ 𝑇 𝑥
𝜎′
𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))𝜎 𝑧 =
1
1 + 𝑒−𝑧
Logistic regression
Binary Classification
Satisfy condition
Sum of probability of y, given x, parameterized by ϴ
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
hq(x) : estimate of probability that y=1 for a given x
with model parameter, θ
hq(x) = P(y = 1 | x; θ)
Loss function of Logistic Regression
J(ϴ)=- logP(y = 1 | x; θ)= - log𝒉 𝜽(𝒙)
Take negative logarithm
44
Logistic regression
Binary Classification
Maximum (log) likelihood estimator (MLE)
𝜃∗
= argmax 𝜃 logP(y = 1 | x; θ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
44
- logP(y = 1 | x; θ)
P(y = 1 | x; θ)
To make J(ϴ)  0 as P(y = 1 | x; θ)  1
J(ϴ)  ∞ as P(y = 1 | x; θ)  0
Why taking negative logarithm ?
Likelihood function
Maximize Likelihood = Minimize Loss
Likelihood : estimate unknown parameters
based on known outcomes : L ϴ 𝑦 =P 𝑦 ϴ
• Maximum Likelihood Approach for Binary classification
Data set {𝑥(𝑖)
,𝑦(𝑖)
}
where 𝑦(𝑖)
 {0,1} and 𝑥(𝑖)
, i=1,..,m
Since 𝑦(𝑖)
is binary we can use Bernoulli
• Likelihood function associated with m observations
Generalization of Binary classification
L(ϴ)=P 𝑦 𝑥; ϴ = 𝑖=1
𝑚
P 𝑦𝑖 𝑥𝑖; ϴ = 𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖
) 𝑦(𝑖)
1 − ℎ 𝜃(𝑥 𝑖
)
1−𝑦(𝑖)
Logistic regression
45
For a single observation case,
P 𝑦 = 1 𝑥; ϴ =ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ =1 − ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
Same as previous page
46
• By taking negative logarithm we get the
Cross-entropy Error Function
J(ϴ) = −
1
𝑚
𝑙𝑜g(P 𝑦 𝑥; ϴ )
Generalization of Binary classification
Logistic regression
Likelihood
Maximum likelihood estimator (MLE)
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
Loss function is defined as
For 𝑦(𝑖)
= 1 𝑐𝑎𝑠𝑒
J(ϴ)= - logℎ 𝜃(𝑥 𝑖 )
ℎ 𝜃(𝑥 𝑖
)
1
J(ϴ)
For 𝑦(𝑖) = 0 𝑐𝑎𝑠𝑒
J(ϴ)= - log(1-ℎ 𝜃(𝑥 𝑖 ))
ℎ 𝜃(𝑥 𝑖 )
1
J(ϴ)
As ℎ 𝜃(𝑥 𝑖
) approaches to 1, J(ϴ) becomes 0 As ℎ 𝜃(𝑥 𝑖
) approaches to 0, J(ϴ) becomes 0
Z
Sigmoid ft’n
1
0
47
Generalization of Binary classification
Logistic regression
Gradient :
convex quadratic function
ℎ 𝜃 𝑥 = σ(ϴ 𝑇
𝑥)
𝜕J(ϴ)
𝜕𝜃𝑗
=
1
𝑚 𝑖
𝑦(𝑖)
− ℎ 𝜃(𝑥 𝑖
) 𝑥𝑗
Alternative Cost function for Binary classification
Logistic regression
48
Regularized Logistic Regression
Adds numerical damping to prevent overshoot or over-fitting
Fidelity term Regularization term
Logistic regression
49
+
λ
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
?
50
Regularized Logistic Regression
Logistic regression
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists :
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
Data likelihood for specific parameters
(could be modeled with Deep Network!)
Prior distribution over parameters
(describes our prior knowledge or /
and our desires for the model)
Bayesian evidence
A powerful method for model selection!
As a rule this integral is intractable :(
(You can never integrate this)
51
Regularized Logistic Regression
Logistic regression
The core idea of Maximum a Posteriori Estimator:
Maximum a posteriori estimator
𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦
=𝐽 𝑀𝐿𝐸 ϴ +
1
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
+ 𝑐𝑜𝑛𝑠𝑡
𝜃 𝑀𝐴𝑃
∗
= argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ }
Loss function of Posterior distribution over model parameters
assuming a Gaussian prior for the weights
Kullback-Leibler (KL) divergence
• A measure of the difference between two
probability distributions: and
We can measure the
difference according
to an objective and
numerical value.
P (x) Q (x)
𝐷 𝑃 𝑄 ≡ P (x) log
Q (x)
P (x)
𝑑𝑥
P (x) Q (x)
Note: KL divergence is not a metric.
𝐷 𝑃 𝑄 ≠ 𝐷 𝑄 𝑃
Kullback-Leibler (KL) divergence
Kullback-Leibler (KL) divergence
KL divergence = Conditional Entropy H(P|Q)
Minimize KL divergence
• Random events are drawn from the real
distribution
true distribution
data set
Using the observed
data, we want to
estimate the true
distribution using a trial
distribution.
trial distribution
minimize
divergence
The smaller the KL divergence , the better an estimate.
Minimize KL divergence
• KL divergence between the two distributions
Constant: independent
of parameter
To minimize KL divergence, we have only
to maximize the second term with respect
to the parameter .
Likelihood and KL divergence
• The second term is approximated by the sample
mean:
data set
Log likelihood
They are the same:
• Minimizing the KL divergence
• Maximizing the likelihood
Softmax Regression
• Softmax Regression ( or multinomial logistic regression) is a
classification method that generalizes logistic regression to
multiclass problems. (i.e. with more than two possible discrete
outcomes.)
• Used to predict the probabilities of the different possible
outcomes of a categorically distributed dependent variable, given
a set of independent variables (which may be real-valued, binary-
valued, categorical-valued, etc.).
generalized logistic regression
to multiclass problems
58
• Used in classification problem in which response variable y can take on any one of
k values.
𝑦 ∈ 1,2, … , 𝑘 .
• To derive General Linear Model for multinomial data
 we begin by expressing the multinomial as an exponential family distribution.
 then computes the multinomial logistic loss (-log likelihood)
ℎ 𝜃 𝑥 𝑖
=
𝑝 𝑦(𝑖)
= 1 𝑥(𝑖)
; 𝜃
⋮
𝑝 𝑦(𝑖)
= 𝑘 𝑥(𝑖)
; 𝜃
=
1
𝑗=1
𝑘
e
θj
T 𝑥(𝑖)
eθ1
T 𝑥(𝑖)
⋮
eθk
T
𝑥(𝑖)
Softmax Regression
59
60
Softmax Regression
2.0
1.0
-1.0
-3.0
Logits,
Scores
0.7
0.2
0.05
0.01
Probabilities
• Remember that for logistic regression, we had:
which can be written similarly as:
Softmax Regression
61
Cross Entropy !
Binary Classification
m events or dataset
• The softmax cost function is similar, except that we now sum
over the k different possible values of the class label.
• Gradient
.
: logistic
: softmax
Softmax Regression
62
Cross Entropy !
K Category Classification
m events or dataset
63
Probability vs. Likelihood
• Probability : likelihood that y will occur based on given parameters, ϴ :
Known parameters, ϴ : μ = 32 𝑔, σ = 2.5
μ = 32 𝑔
σ = 2.5
24 g 40 g
P 𝑦 ϴ
P 𝑦 = 34 𝑔 ϴ = 0.15
P 𝑦 = 32 𝑔 ϴ = 0.2
Probability=0.15
Probability=0.2
P Variable Fixed
y=34y=32
64
• Likelihood : Parameters of a statistical model, ϴ based on given observed data, y :
 Find a best fitting model
L ϴ 𝑦
Probability vs. Likelihood
24 g 40 g
The likelihood of weighs 34 g for a parameter is
Assumed parameters, ϴ : μ = 32 𝑔, σ = 2.5
Probability is 0.12
0.12
L ϴ weighs 34 g =
𝑦 = 34 𝑔
μ = 34 𝑔
Assumed parameters, ϴ : μ = 34 𝑔, σ = 2.5
0.2
Probability is 0.2
True data distribution: μ = 34 𝑔, σ = 2.5
65
Fitting normal distribution: ML
L 𝜇, 𝜎2 𝑥1,…𝐼 =
Assumed model
Given dataset
66
Fitting normal distribution: ML
Plotted surface of likelihoods
as a function of possible
parameter values
ML Solution is at peak
67
Information Theory
Information
• It is quantitative measure of information
• Most UN-expected events give maximum information
• Average uncertainty of a random variable
Relation between Information and its probability :
• Information is inversely proportional to its probability of occurrence
• Information is continuous function of its probability
• Total information of two or more independent message is the sum of
individual information
I(𝑥) = 𝑙𝑜g(
1
P(𝑥)
) = −𝑙𝑜𝑔P(𝑥)
I(𝑥) ∶ 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛, 𝑃 𝑥 : 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
 Increase entropy
 Loss
68
Let’s see observation as X= 𝑥1, … , 𝑥 𝑚 with probability P(X)= 𝑝1, … , 𝑝 𝑚
• Total N observations are occurred
• 𝑥1 occurs N*𝑝1 times and so ….
• Single occurrence of 𝑥1 conveys information = −𝑙𝑜𝑔𝑝1
• N*𝑝1 times occurrence conveys information = -N*𝑝1 ∗ 𝑙𝑜𝑔𝑝1
• Total information = -N 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖
• Averaged information = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖  H(x)
Marginal Entropy
H(x) : Marginal Entropy
Cross Entropy Loss = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔 𝑝i
𝑝𝑖 : Given, usually 1
𝑝I : Estimated, usually less than 1
69
Joint Entropy
Venn diagram for definition of entropies
70
Joint Entropy
Let’s see two observations as X= 𝑥1, … , 𝑥 𝑚 and Y= 𝑦1, … , 𝑦 𝑛
• Can be a reference and a query for Anomaly detection problem
• Should have complete probability scheme i.e. sum of all
possible combinations of joint observation of X and Y should be “1”
𝑖=1
𝑚
𝑗=1
𝑛
𝑝( 𝑥𝑖, 𝑦𝑗) = 1
• Entropy calculated same as marginal entropy
• Information delivered when one pair (𝑥𝑖, 𝑦𝑗) occur once is − log 𝑝(𝑥𝑖, 𝑦𝑗)
• Number of times this can happen is Nij out of total N
• Information for Nij times for this particular combination is - Nijlog 𝑝(𝑥𝑖, 𝑦𝑗)
• Total information for all combination of i and j is - 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)
71
-
1
𝑁 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy
Averaged information : Joint Entropy H(X, Y)
Nij = 𝑝(𝑥𝑖, 𝑦𝑗)·N
H(X, Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy H(X, Y)
72
Conditional Entropy H(X|Y)
73
Conditional Entropy H(X|Y)
• Bay’s theorem : 𝑝(𝑥𝑖, 𝑦𝑗) = 𝑝(𝑥𝑖)· 𝑝(𝑦𝑗|𝑥𝑖) = 𝑝(𝑦𝑗)· 𝑝(𝑥𝑖|𝑦𝑗)
• For a particular 𝑦𝑗 observed, it can be only from one of X= 𝑥1, … , 𝑥 𝑚
• Similarly
𝑖=1
𝑚
𝑝(𝑥𝑖 𝑦𝑗 = 1
𝑗=1
𝑛
𝑝(𝑦𝑗 𝑥𝑖 = 1
74
H(X, Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗)
Conditional Entropy H(X|Y)
• From Joint entropy H(X, Y)
• For a specific 𝑦𝑗
• Average conditional entropy is taking all such entropies for all 𝑦𝑗
• No. of times H(X|𝑦𝑗) occurs = no. of times 𝑦𝑗 occurs = 𝑁 𝑦 𝑗
• H(X|Y) =
1
𝑁 𝑗=1
𝑛
𝑁 𝑦 𝑗
H(X|𝑦𝑗) = 𝑗=1
𝑛
p(𝑦𝑗)H(X|𝑦𝑗)
Similarly H(Y|X)= - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑦𝑗|𝑥𝑖)
H(X|𝑦𝑗) = - 𝑖=1
𝑚
𝑝(𝑥𝑖|𝑦𝑗) log 𝑝(𝑥𝑖|𝑦𝑗)
H(X|Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
p(𝑦𝑗)∙𝑝(𝑥𝑖|𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗)
= - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗)
75
Relation among Entropies
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X|Y) = H(X,Y) - H(X)
KL Divergence
H(X|Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗)
= 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log
𝑝(𝑦 𝑗)
𝑝(𝑥 𝑖|𝑦 𝑗)
Bay’s theorem
76
Mutual Information I(X;Y)


yx ypxp
yxp
yxp
XYHYHYXHXHYXI
, )()(
),(
log),(
)|()()|()();(
• The reduction in uncertainty of one random variable due to knowing about
another  Information Gain
• The amount of information one random variable contains about another
• Measure of independence
: two variables are independent
 grows according to ...
- the degree of dependence
- the entropy of the variables
0);( YXI
77
Mutual Information I(X;Y)
Symmetrical Uncertainty Measure @ Remaining Useful Lifetime Prediction
• Not all variables measured from the critical component are useful for
predicting the RUL.
• The assumption is that variables that have non-random relationship
carry information about the system behaviour.
• The goal is to group the variables, which have non-random relationship.
• To do so, a method based on mutual information has been applied on
the dataset for feature selection:
0 ≤ 𝑆𝑈 𝑋, 𝑌 = 2
𝐼(𝑋;𝑌)
𝐻 𝑋 +𝐻(𝑌)
≤ 1
78








yx
yxyxyx
yxyx
ypxp
yxp
yxp
yxpyxp
yp
yxp
xp
yxp
yxpyxp
yp
yp
xp
xp
YXHYHXH
YXHXHYXI
,
,,,
,
)()(
),(
log),(
),(log),(
)(
1
log),(
)(
1
log),(
),(log),(
)(
1
log)(
)(
1
log)(
),()()(
)|()();(
Mutual Information I(X;Y)
Derivation of Mutual Information(i)
79
Mutual Information I(X;Y)
Derivation of Mutual Information(ii)
• xi is occurred with probability of p(xi) : Priori entropy of xi
• Initial uncertainty of xi is - log 𝑝(𝑥𝑖)
• Reduction in uncertainty of one random variable, xi due to
knowing about another, yj  Information Gain
• Final uncertainty of xi is - log 𝑝(𝑥𝑖|𝑦𝑗) : Posteriori entropy of xi
• Information gain = Net reduction in the uncertainties
• I(xi ; yj) = Initial uncertainty of xi - Final uncertainty of xi
= - log 𝑝(𝑥𝑖)+ log 𝑝(𝑥𝑖|𝑦𝑗) = log
𝑝(𝑥 𝑖|𝑦 𝑗)
𝑝(𝑥 𝑖)
80
Mutual Information I(X;Y)
Derivation of Mutual Information(ii)
I(X;Y) = 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log
𝑝(𝑥 𝑖|𝑦 𝑗)
𝑝(𝑥 𝑖)
• I(X;Y) : averaging I(xi ; yj) for all values of i and j
I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
= H(X) +H(Y) – H(X,Y)

More Related Content

What's hot

What's hot (20)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
07 regularization
07 regularization07 regularization
07 regularization
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Activation function
Activation functionActivation function
Activation function
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 

Similar to The world of loss function

2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
nozomuhamada
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
GayathriRHICETCSESTA
 

Similar to The world of loss function (20)

Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptx
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVM
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedMachine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 

More from 홍배 김

More from 홍배 김 (20)

Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Lecture Summary : Camera Projection
Lecture Summary : Camera Projection Lecture Summary : Camera Projection
Lecture Summary : Camera Projection
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
 
Robotics of Quadruped Robot
Robotics of Quadruped RobotRobotics of Quadruped Robot
Robotics of Quadruped Robot
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
 
Convolution 종류 설명
Convolution 종류 설명Convolution 종류 설명
Convolution 종류 설명
 
Learning by association
Learning by associationLearning by association
Learning by association
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

The world of loss function

  • 1. 1 The World of Loss Function - Part I - 2018. 8. 김 홍 배
  • 2. 2
  • 3. 3
  • 4. 4 1. Classification with NN 2. Linear Classification : Support Vector Machine * Non-linear SVM ? 3. Logistic Regression : Binary Classifier, Cross entropy, Information Theory 4. What is “Maximum A Posterior Estimator” ? 5. Kullback-Leibler(KL) divergence 6. Softmax Regression : Multi-class Classifier 7. Focal Loss 8. Discriminative Feature Learning 9. Learning by Association Part 1
  • 6. 6 Let’s imagine a simple case To classify the given classes, we only need to define a straight line Human Dog Classification with NN
  • 7. 7 f= 𝑖=0 2 𝑤𝑖 𝑋𝑖 +1 -1 Human Dog At Decision boundary f(x)= 0 slope offset Single cell is enough for a simple case ! Classification with NN
  • 8. 8 X1 : No. of straight lines in image X2:Blackpixelsratio(%) 100 50 Decision boundary w1/w2 ≈ 12, w0/w2 ≈-60 ex, Get computer to classify input image as Chinese or Japanese Classification with NN
  • 9. 9 Human Monkey Let’s imagine a more complex case Can’t classify with a straight line  needs more complex boundary Classification with NN
  • 10. 10 We need many cells and layers generally We can create extra features that allow more complex decision boundaries Classification with NN
  • 11. • Select a network architecture • Randomly initialize weights • Observe features “x” with reference “y” • Push “x” through NN  output is “ŷ” • Calculate Error : (y- ŷ)2, least squared error for example • While error is too large – Calculate errors and backpropagate error signals – Adjust weights • Evaluate performance using the test set Network Training by Backpropagation y ŷ 11
  • 12. • How should we update the weights to improve ?  To minimize error or loss function, J=(y- ŷ)2, Gradient descent algorithm is generally used Network Training by Backpropagation J(w) w Start point Final point Sensitivity w.r.t Cost ft’n wn+1 = wn - η 𝜕J(wn) 𝜕wn w1 w2 wf 12
  • 14. 14 The main idea of the SVMs may be summed up as follows: • “Given a training samples, the SVM constructs a hyperplane as decision surface in such a way the margin of separation between positive and negative examples is maximized.” Introduction
  • 15. 15 Linearly Separable Patterns SVM is a binary learning machine. • Binary classification is the task of separating classes in feature space.
  • 16. 16 Which of the linear separators is optimal? Linearly Separable Patterns
  • 17. 17 • The optimal decision boundary is the one that maximize the margin ρ Optimal Decision Boundary
  • 18. 18 𝑤 𝑥0 𝑥 𝑃0P 0 𝑥 − 𝑥0 Define vectors : 𝑥0 = 𝑂𝑃0 𝑎𝑛𝑑 𝑥 = 𝑂𝑃 𝑤ℎ𝑒𝑟𝑒 P is an arbitrary point on a hyperplane. A condition for P to be on the plane is that the vector 𝑥 − 𝑥0 is perpendicular to 𝑤 vector 𝑤 · ( 𝑥 − 𝑥0) = 0 or 𝑤 · 𝑥 + b = 0 Equation of a Hyperplane
  • 19. 19 𝑤 𝑛 Understanding the basics g(x) is a linear function : • A hyperplane in the feature space • Normal vector of the hyperplane : 𝑛= 𝑤 𝑤
  • 20. 20 𝑤 𝑛 Margin Safe zone • The linear discriminant function function (classifier) with the maximum margin is the best • Margin is defined as the width that the boundary could be increased by before hitting a data point • Why it is the best ? Robust to outliers and thus strong generalization ability : denotes +1 : denotes -1 Understanding the basics
  • 21. 21 𝑤 𝑛 : denotes +1 : denotes -1 Understanding the basics • Given a set of data points ( 𝑥𝑖, 𝑦𝑖) , i=1,2, ···, n 𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > 0 𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < 0 • With a scale transformation on both 𝑊 𝑎𝑛𝑑 𝑏, 𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1 𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1
  • 22. 22 𝑤 𝑛 Margin : denotes +1 : denotes -1 Understanding the basics W𝑥+ + 𝑏 = +1 Support vector X+ X- X-W𝑥− + 𝑏 = −1 • At extreme points 𝑥+ , 𝑥− • The margin width is : M = (𝑥+ − 𝑥− ) ∙ 𝑛 = 2 𝑤 2 𝑤 Maximize Such that 𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1 𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1 1 2 𝑤 2Or Minimize 𝑦𝑖(𝑊𝑥𝑖 + 𝑏) ≥ 1
  • 23. 23 The Optimization Problem Introduce Lagrange multipliers αi, • That is, the Lagrange function: (𝑦𝑖 Is to be minimized with respect to w and b, i.e,
  • 24. • The simplest way to separate two groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. • However, there are situations where a nonlinear region can separate the groups more efficiently. • The kernel function transform the data into a higher dimensional feature space to make it possible to perform the linear separation. Non-Linear SVM(Support Vector Machines) kernel trick
  • 25. To Map from input space to feature space to simplify classification task Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is adopted Non-Linear SVM(Support Vector Machines) Feature space에서의 inner product(a measure of similarity)
  • 26. Key Idea of Kernel Methods K(𝑥𝑖, 𝑥𝑗) K(𝑥𝑖, 𝑥𝑗) = Φ(𝑥𝑖)· Φ(𝑥𝑗)
  • 27. Normal Condition : Cluster bound : exp{− [ 𝑥1−𝑐1 2+ 𝑥2−𝑐2 2] 2𝜎2 } ≥ {0<Threshold<<1} 𝑥1 − 𝑐1 2 + 𝑥2 − 𝑐2 2 ≤ r2 𝐾1 + 𝐾2 ≤ r2 x1 x2 .(c1,c2) r K1 K2 r2 r2 Key Idea of Kernel Methods
  • 28. RBFN architecture Σ Input layer Hidden layer (RBFs) Output layer W1 W2 WM x1 x2 xn No weight f(x) Each of n components of the input vector x feeds forward to m basis functions whose outputs are linearly combined with weights w (i.e. dot product x∙w) into the network output f(x). The output layer performs a simple weighted sum (i.e. w ∙x). If the RBFN is used for regression then this output is fine. However, if pattern classification is required, then a hard- limiter or sigmoid function could be placed on the output neurons to give 0/1 output values Input data set ∶ 𝑋 = { 𝑥1 𝑥2 … 𝑥 𝑁}
  • 29. Σ query 0.2 query 0.9 Radial Basis Function Detector Architecture for Anomaly detection
  • 30. Normal data Unusual or Abnomaly data Σ - Anomaly or Unusual event detection query query 0.9 query query 0.1 Radial Basis Function Detector Architecture for Anomaly detection
  • 31. Σ Σ Category 1 Category 2 Category 1 Category 2 - Classification Problem Radial Basis Function Detector Architecture for Anomaly detection
  • 32.  For Gaussian basis functions  s x w w x c w w x c p i i p i i M i pj ij ijj n i M ( ) exp ( )                    0 1 0 2 2 11 2    Assume the variance  across each dimension are equal s x w w x cp i i pj ij j n i M ( ) exp ( )            0 2 2 11 1 2 → → → → Architecture for Anomaly detection
  • 33. • Design decision • number of hidden neurons • max of neurons = number of input patterns • more neurons – more complex, smaller tolerance • Parameters to be learnt • centers • radii • A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the radius. • smaller radius  fits training data better (overfitting) • larger radius  less sensitivity, less overfitting, network of smaller size, faster execution • weights between hidden and output layers Architecture for Anomaly detection
  • 34. The question now is: How to train the RBF network? In other words, how to find:  The number and the parameters of hidden units (the basis functions) using unlabeled data (unsupervised learning).  K-Mean Clustering Algorithm  The weights between the hidden layer and the output layer.  Recursive Least-Squares Estimation Algorithm RBFN Learning
  • 36.  Use the K-mean algorithm to find ci RBFN Learning
  • 37. K-mean Algorithm step1: K initial clusters are chosen randomly from the samples to form K groups. step2: Each new sample is added to the group whose mean is the closest to this sample. step3: Adjust the mean of the group to take account of the new points. step4: Repeat step2 until the distance between the old means and the new means of all clusters is smaller than a predefined tolerance.
  • 38. Outcome: There are K clusters with means representing the centroid of each clusters. Advantages: (1) A fast and simple algorithm. (2) Reduce the effects of noisy samples.
  • 39.  Use K nearest neighbor rule to find the function width  k-th nearest neighbor of ci  The objective is to cover the training points so that a smooth fit of the training samples can be achieved 2 1 1   K k iki cc K → →
  • 40.  RBF learning by gradient descent  Let andi p pj ij ijj n p p px x c e x d x s x( ) exp ( ) ( ) ( )                   1 2 2 2 1   E e xp p N    1 2 1 2 ( ) .  we have       E w E E ci ij ij , , and Apply → → → → → N : No. of batch
  • 41. we have the following update equations  RBF learning by gradient descent
  • 42. Logistic regression Needs to classify the output (y) as either 0 or 1.  Binary classification A liner equation may not be a good fit for classification problem. Logistic regression uses the sigmoid function to plot the hypothesis z Sigmoid ft’n 1 0 Linear ft’n Want 0 ≤ ℎ 𝜃 𝑥 ≤ 1, 𝑓𝑜𝑟 𝑎𝑛 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥 Logistic Regression Model ℎ 𝜃 𝑥 = σ(ϴ 𝑇 𝑥) Binary Classification 42 J(w)= 𝑘 𝑦 𝑘 − 𝑦 𝑘 2 Linear regression 𝜎 ϴ 𝑇 𝑥 = 1 1 + 𝑒−ϴ 𝑇 𝑥 𝜎′ 𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))𝜎 𝑧 = 1 1 + 𝑒−𝑧
  • 43. Logistic regression Binary Classification Satisfy condition Sum of probability of y, given x, parameterized by ϴ P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1 hq(x) : estimate of probability that y=1 for a given x with model parameter, θ hq(x) = P(y = 1 | x; θ) Loss function of Logistic Regression J(ϴ)=- logP(y = 1 | x; θ)= - log𝒉 𝜽(𝒙) Take negative logarithm
  • 44. 44 Logistic regression Binary Classification Maximum (log) likelihood estimator (MLE) 𝜃∗ = argmax 𝜃 logP(y = 1 | x; θ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)} 44 - logP(y = 1 | x; θ) P(y = 1 | x; θ) To make J(ϴ)  0 as P(y = 1 | x; θ)  1 J(ϴ)  ∞ as P(y = 1 | x; θ)  0 Why taking negative logarithm ? Likelihood function Maximize Likelihood = Minimize Loss Likelihood : estimate unknown parameters based on known outcomes : L ϴ 𝑦 =P 𝑦 ϴ
  • 45. • Maximum Likelihood Approach for Binary classification Data set {𝑥(𝑖) ,𝑦(𝑖) } where 𝑦(𝑖)  {0,1} and 𝑥(𝑖) , i=1,..,m Since 𝑦(𝑖) is binary we can use Bernoulli • Likelihood function associated with m observations Generalization of Binary classification L(ϴ)=P 𝑦 𝑥; ϴ = 𝑖=1 𝑚 P 𝑦𝑖 𝑥𝑖; ϴ = 𝑖=1 𝑚 ℎ 𝜃(𝑥 𝑖 ) 𝑦(𝑖) 1 − ℎ 𝜃(𝑥 𝑖 ) 1−𝑦(𝑖) Logistic regression 45 For a single observation case, P 𝑦 = 1 𝑥; ϴ =ℎ 𝜃 𝑥 P 𝑦 = 0 𝑥; ϴ =1 − ℎ 𝜃 𝑥 P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1 Same as previous page
  • 46. 46 • By taking negative logarithm we get the Cross-entropy Error Function J(ϴ) = − 1 𝑚 𝑙𝑜g(P 𝑦 𝑥; ϴ ) Generalization of Binary classification Logistic regression Likelihood Maximum likelihood estimator (MLE) 𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
  • 47. Loss function is defined as For 𝑦(𝑖) = 1 𝑐𝑎𝑠𝑒 J(ϴ)= - logℎ 𝜃(𝑥 𝑖 ) ℎ 𝜃(𝑥 𝑖 ) 1 J(ϴ) For 𝑦(𝑖) = 0 𝑐𝑎𝑠𝑒 J(ϴ)= - log(1-ℎ 𝜃(𝑥 𝑖 )) ℎ 𝜃(𝑥 𝑖 ) 1 J(ϴ) As ℎ 𝜃(𝑥 𝑖 ) approaches to 1, J(ϴ) becomes 0 As ℎ 𝜃(𝑥 𝑖 ) approaches to 0, J(ϴ) becomes 0 Z Sigmoid ft’n 1 0 47 Generalization of Binary classification Logistic regression
  • 48. Gradient : convex quadratic function ℎ 𝜃 𝑥 = σ(ϴ 𝑇 𝑥) 𝜕J(ϴ) 𝜕𝜃𝑗 = 1 𝑚 𝑖 𝑦(𝑖) − ℎ 𝜃(𝑥 𝑖 ) 𝑥𝑗 Alternative Cost function for Binary classification Logistic regression 48
  • 49. Regularized Logistic Regression Adds numerical damping to prevent overshoot or over-fitting Fidelity term Regularization term Logistic regression 49 + λ 2𝜎 𝑤 2 𝑖 ϴ𝑖 2 ?
  • 50. 50 Regularized Logistic Regression Logistic regression Until now, we’ve been talking about Maximum Likelihood Estimator: Now assume that prior distribution over parameters exists : 𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)} Then we can apply Bayes Rule: Posterior distribution over model parameters Data likelihood for specific parameters (could be modeled with Deep Network!) Prior distribution over parameters (describes our prior knowledge or / and our desires for the model) Bayesian evidence A powerful method for model selection! As a rule this integral is intractable :( (You can never integrate this)
  • 51. 51 Regularized Logistic Regression Logistic regression The core idea of Maximum a Posteriori Estimator: Maximum a posteriori estimator 𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦 =𝐽 𝑀𝐿𝐸 ϴ + 1 2𝜎 𝑤 2 𝑖 ϴ𝑖 2 + 𝑐𝑜𝑛𝑠𝑡 𝜃 𝑀𝐴𝑃 ∗ = argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ } Loss function of Posterior distribution over model parameters assuming a Gaussian prior for the weights
  • 52. Kullback-Leibler (KL) divergence • A measure of the difference between two probability distributions: and We can measure the difference according to an objective and numerical value. P (x) Q (x) 𝐷 𝑃 𝑄 ≡ P (x) log Q (x) P (x) 𝑑𝑥 P (x) Q (x) Note: KL divergence is not a metric. 𝐷 𝑃 𝑄 ≠ 𝐷 𝑄 𝑃
  • 54. Kullback-Leibler (KL) divergence KL divergence = Conditional Entropy H(P|Q)
  • 55. Minimize KL divergence • Random events are drawn from the real distribution true distribution data set Using the observed data, we want to estimate the true distribution using a trial distribution. trial distribution minimize divergence The smaller the KL divergence , the better an estimate.
  • 56. Minimize KL divergence • KL divergence between the two distributions Constant: independent of parameter To minimize KL divergence, we have only to maximize the second term with respect to the parameter .
  • 57. Likelihood and KL divergence • The second term is approximated by the sample mean: data set Log likelihood They are the same: • Minimizing the KL divergence • Maximizing the likelihood
  • 58. Softmax Regression • Softmax Regression ( or multinomial logistic regression) is a classification method that generalizes logistic regression to multiclass problems. (i.e. with more than two possible discrete outcomes.) • Used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary- valued, categorical-valued, etc.). generalized logistic regression to multiclass problems 58
  • 59. • Used in classification problem in which response variable y can take on any one of k values. 𝑦 ∈ 1,2, … , 𝑘 . • To derive General Linear Model for multinomial data  we begin by expressing the multinomial as an exponential family distribution.  then computes the multinomial logistic loss (-log likelihood) ℎ 𝜃 𝑥 𝑖 = 𝑝 𝑦(𝑖) = 1 𝑥(𝑖) ; 𝜃 ⋮ 𝑝 𝑦(𝑖) = 𝑘 𝑥(𝑖) ; 𝜃 = 1 𝑗=1 𝑘 e θj T 𝑥(𝑖) eθ1 T 𝑥(𝑖) ⋮ eθk T 𝑥(𝑖) Softmax Regression 59
  • 61. • Remember that for logistic regression, we had: which can be written similarly as: Softmax Regression 61 Cross Entropy ! Binary Classification m events or dataset
  • 62. • The softmax cost function is similar, except that we now sum over the k different possible values of the class label. • Gradient . : logistic : softmax Softmax Regression 62 Cross Entropy ! K Category Classification m events or dataset
  • 63. 63 Probability vs. Likelihood • Probability : likelihood that y will occur based on given parameters, ϴ : Known parameters, ϴ : μ = 32 𝑔, σ = 2.5 μ = 32 𝑔 σ = 2.5 24 g 40 g P 𝑦 ϴ P 𝑦 = 34 𝑔 ϴ = 0.15 P 𝑦 = 32 𝑔 ϴ = 0.2 Probability=0.15 Probability=0.2 P Variable Fixed y=34y=32
  • 64. 64 • Likelihood : Parameters of a statistical model, ϴ based on given observed data, y :  Find a best fitting model L ϴ 𝑦 Probability vs. Likelihood 24 g 40 g The likelihood of weighs 34 g for a parameter is Assumed parameters, ϴ : μ = 32 𝑔, σ = 2.5 Probability is 0.12 0.12 L ϴ weighs 34 g = 𝑦 = 34 𝑔 μ = 34 𝑔 Assumed parameters, ϴ : μ = 34 𝑔, σ = 2.5 0.2 Probability is 0.2 True data distribution: μ = 34 𝑔, σ = 2.5
  • 65. 65 Fitting normal distribution: ML L 𝜇, 𝜎2 𝑥1,…𝐼 = Assumed model Given dataset
  • 66. 66 Fitting normal distribution: ML Plotted surface of likelihoods as a function of possible parameter values ML Solution is at peak
  • 67. 67 Information Theory Information • It is quantitative measure of information • Most UN-expected events give maximum information • Average uncertainty of a random variable Relation between Information and its probability : • Information is inversely proportional to its probability of occurrence • Information is continuous function of its probability • Total information of two or more independent message is the sum of individual information I(𝑥) = 𝑙𝑜g( 1 P(𝑥) ) = −𝑙𝑜𝑔P(𝑥) I(𝑥) ∶ 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛, 𝑃 𝑥 : 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦  Increase entropy  Loss
  • 68. 68 Let’s see observation as X= 𝑥1, … , 𝑥 𝑚 with probability P(X)= 𝑝1, … , 𝑝 𝑚 • Total N observations are occurred • 𝑥1 occurs N*𝑝1 times and so …. • Single occurrence of 𝑥1 conveys information = −𝑙𝑜𝑔𝑝1 • N*𝑝1 times occurrence conveys information = -N*𝑝1 ∗ 𝑙𝑜𝑔𝑝1 • Total information = -N 𝑖=1 𝑚 𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖 • Averaged information = - 𝑖=1 𝑚 𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖  H(x) Marginal Entropy H(x) : Marginal Entropy Cross Entropy Loss = - 𝑖=1 𝑚 𝑝𝑖 ∗ 𝑙𝑜𝑔 𝑝i 𝑝𝑖 : Given, usually 1 𝑝I : Estimated, usually less than 1
  • 69. 69 Joint Entropy Venn diagram for definition of entropies
  • 70. 70 Joint Entropy Let’s see two observations as X= 𝑥1, … , 𝑥 𝑚 and Y= 𝑦1, … , 𝑦 𝑛 • Can be a reference and a query for Anomaly detection problem • Should have complete probability scheme i.e. sum of all possible combinations of joint observation of X and Y should be “1” 𝑖=1 𝑚 𝑗=1 𝑛 𝑝( 𝑥𝑖, 𝑦𝑗) = 1 • Entropy calculated same as marginal entropy • Information delivered when one pair (𝑥𝑖, 𝑦𝑗) occur once is − log 𝑝(𝑥𝑖, 𝑦𝑗) • Number of times this can happen is Nij out of total N • Information for Nij times for this particular combination is - Nijlog 𝑝(𝑥𝑖, 𝑦𝑗) • Total information for all combination of i and j is - 𝑖=1 𝑚 𝑗=1 𝑛 Nij log 𝑝(𝑥𝑖, 𝑦𝑗)
  • 71. 71 - 1 𝑁 𝑖=1 𝑚 𝑗=1 𝑛 Nij log 𝑝(𝑥𝑖, 𝑦𝑗) Joint Entropy Averaged information : Joint Entropy H(X, Y) Nij = 𝑝(𝑥𝑖, 𝑦𝑗)·N H(X, Y) = - 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗) Joint Entropy H(X, Y)
  • 73. 73 Conditional Entropy H(X|Y) • Bay’s theorem : 𝑝(𝑥𝑖, 𝑦𝑗) = 𝑝(𝑥𝑖)· 𝑝(𝑦𝑗|𝑥𝑖) = 𝑝(𝑦𝑗)· 𝑝(𝑥𝑖|𝑦𝑗) • For a particular 𝑦𝑗 observed, it can be only from one of X= 𝑥1, … , 𝑥 𝑚 • Similarly 𝑖=1 𝑚 𝑝(𝑥𝑖 𝑦𝑗 = 1 𝑗=1 𝑛 𝑝(𝑦𝑗 𝑥𝑖 = 1
  • 74. 74 H(X, Y) = - 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗) Conditional Entropy H(X|Y) • From Joint entropy H(X, Y) • For a specific 𝑦𝑗 • Average conditional entropy is taking all such entropies for all 𝑦𝑗 • No. of times H(X|𝑦𝑗) occurs = no. of times 𝑦𝑗 occurs = 𝑁 𝑦 𝑗 • H(X|Y) = 1 𝑁 𝑗=1 𝑛 𝑁 𝑦 𝑗 H(X|𝑦𝑗) = 𝑗=1 𝑛 p(𝑦𝑗)H(X|𝑦𝑗) Similarly H(Y|X)= - 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑦𝑗|𝑥𝑖) H(X|𝑦𝑗) = - 𝑖=1 𝑚 𝑝(𝑥𝑖|𝑦𝑗) log 𝑝(𝑥𝑖|𝑦𝑗) H(X|Y) = - 𝑖=1 𝑚 𝑗=1 𝑛 p(𝑦𝑗)∙𝑝(𝑥𝑖|𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗) = - 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗)
  • 75. 75 Relation among Entropies H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) H(X|Y) = H(X,Y) - H(X) KL Divergence H(X|Y) = - 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖|𝑦𝑗) = 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑦 𝑗) 𝑝(𝑥 𝑖|𝑦 𝑗) Bay’s theorem
  • 76. 76 Mutual Information I(X;Y)   yx ypxp yxp yxp XYHYHYXHXHYXI , )()( ),( log),( )|()()|()();( • The reduction in uncertainty of one random variable due to knowing about another  Information Gain • The amount of information one random variable contains about another • Measure of independence : two variables are independent  grows according to ... - the degree of dependence - the entropy of the variables 0);( YXI
  • 77. 77 Mutual Information I(X;Y) Symmetrical Uncertainty Measure @ Remaining Useful Lifetime Prediction • Not all variables measured from the critical component are useful for predicting the RUL. • The assumption is that variables that have non-random relationship carry information about the system behaviour. • The goal is to group the variables, which have non-random relationship. • To do so, a method based on mutual information has been applied on the dataset for feature selection: 0 ≤ 𝑆𝑈 𝑋, 𝑌 = 2 𝐼(𝑋;𝑌) 𝐻 𝑋 +𝐻(𝑌) ≤ 1
  • 79. 79 Mutual Information I(X;Y) Derivation of Mutual Information(ii) • xi is occurred with probability of p(xi) : Priori entropy of xi • Initial uncertainty of xi is - log 𝑝(𝑥𝑖) • Reduction in uncertainty of one random variable, xi due to knowing about another, yj  Information Gain • Final uncertainty of xi is - log 𝑝(𝑥𝑖|𝑦𝑗) : Posteriori entropy of xi • Information gain = Net reduction in the uncertainties • I(xi ; yj) = Initial uncertainty of xi - Final uncertainty of xi = - log 𝑝(𝑥𝑖)+ log 𝑝(𝑥𝑖|𝑦𝑗) = log 𝑝(𝑥 𝑖|𝑦 𝑗) 𝑝(𝑥 𝑖)
  • 80. 80 Mutual Information I(X;Y) Derivation of Mutual Information(ii) I(X;Y) = 𝑖=1 𝑚 𝑗=1 𝑛 𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥 𝑖|𝑦 𝑗) 𝑝(𝑥 𝑖) • I(X;Y) : averaging I(xi ; yj) for all values of i and j I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) = H(X) +H(Y) – H(X,Y)