The world of loss function

1
The World of Loss Function
- Part I -
2018. 8.
김 홍 배

4
1. Classification with NN
2. Linear Classification : Support Vector Machine
* Non-linear SVM ?
3. Logistic Regression :
Binary Classifier, Cross entropy,
Information Theory
4. What is “Maximum A Posterior Estimator” ?
5. Kullback-Leibler(KL) divergence
6. Softmax Regression : Multi-class Classifier
7. Focal Loss
8. Discriminative Feature Learning
9. Learning by Association
Part 1

5
Classification with NN
2D Example

6
Let’s imagine a simple case
To classify the given classes,
we only need to define a straight line
Human
Dog

7
f= 𝑖=0
2
𝑤𝑖 𝑋𝑖
+1
-1
Human
Dog
At Decision boundary
f(x)= 0
slope
offset
Single cell is enough for a simple case !

8
X1 : No. of straight lines in image
X2:Blackpixelsratio(%)
100
50
Decision boundary
w1/w2 ≈ 12, w0/w2 ≈-60
ex, Get computer to classify input image as Chinese or Japanese

9
Human
Monkey
Let’s imagine a more complex case
Can’t classify with a straight line  needs more complex boundary

10
We need many cells and layers generally
We can create extra features that allow more
complex decision boundaries

• Select a network architecture
• Randomly initialize weights
• Observe features “x” with reference “y”
• Push “x” through NN  output is “ŷ”
• Calculate Error : (y- ŷ)2, least squared error for example
• While error is too large
– Calculate errors and backpropagate error signals
– Adjust weights
• Evaluate performance using the test set
Network Training by Backpropagation
y
ŷ
11

• How should we update the weights to improve ?
 To minimize error or loss function, J=(y- ŷ)2,
Gradient descent algorithm is generally used
Network Training by Backpropagation
J(w)
w
Start point
Final point
Sensitivity w.r.t Cost ft’n
wn+1 = wn - η
𝜕J(wn)
𝜕wn
w1 w2 wf
12

13
LINEAR CLASSIFIER :
SUPPORT VECTOR MACHINES

14
The main idea of the SVMs may be summed up as follows:
• “Given a training samples, the SVM constructs a hyperplane
as decision surface in such a way the margin of separation
between positive and negative examples is maximized.”
Introduction

15
Linearly Separable Patterns
SVM is a binary learning machine.
• Binary classification is the task of separating classes
in feature space.

16
Which of the linear separators is optimal?
Linearly Separable Patterns

17
• The optimal decision boundary is the one that maximize
the margin ρ
Optimal Decision Boundary

18
𝑤
𝑥0
𝑥
𝑃0P
0
𝑥 − 𝑥0 Define vectors : 𝑥0 = 𝑂𝑃0 𝑎𝑛𝑑 𝑥 = 𝑂𝑃
𝑤ℎ𝑒𝑟𝑒 P is an arbitrary point on a hyperplane.
A condition for P to be on the plane is that the vector 𝑥 − 𝑥0 is
perpendicular to 𝑤
vector 𝑤 · ( 𝑥 − 𝑥0) = 0 or 𝑤 · 𝑥 + b = 0
Equation of a Hyperplane

19
𝑤
𝑛
Understanding the basics
g(x) is a linear function :
• A hyperplane in the feature space
• Normal vector of the hyperplane :
𝑛=
𝑤
𝑤

20
𝑤
𝑛
Margin
Safe zone
• The linear discriminant function
function (classifier) with the
maximum margin is the best
• Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
• Why it is the best ?
Robust to outliers and thus
strong generalization ability
: denotes +1
: denotes -1

21
𝑤
𝑛
: denotes +1
: denotes -1
• Given a set of data points
( 𝑥𝑖, 𝑦𝑖) , i=1,2, ···, n
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > 0
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < 0
• With a scale transformation on
both 𝑊 𝑎𝑛𝑑 𝑏,
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1

22
𝑤
𝑛
Margin
: denotes +1
: denotes -1
W𝑥+
+ 𝑏 = +1
Support vector
X+
X-
X-W𝑥−
+ 𝑏 = −1
• At extreme points 𝑥+
, 𝑥−
• The margin width is :
M = (𝑥+
− 𝑥−
) ∙ 𝑛
=
2
𝑤
2
𝑤
Maximize
Such that
𝑓𝑜𝑟 𝑦𝑖 = +1, 𝑊𝑥𝑖 + 𝑏 > +1
𝑓𝑜𝑟 𝑦𝑖 = −1, 𝑊𝑥𝑖 + 𝑏 < −1
1
2
𝑤 2Or Minimize
𝑦𝑖(𝑊𝑥𝑖 + 𝑏) ≥ 1

23
The Optimization Problem
Introduce Lagrange multipliers αi,
• That is, the Lagrange function:
(𝑦𝑖
Is to be minimized with respect to w and b, i.e,

• The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane.
• However, there are situations where a nonlinear region can separate the
groups more efficiently.
• The kernel function transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
Non-Linear SVM(Support Vector Machines)
kernel trick

To Map from input space to feature space to simplify classification task
Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is
adopted
Non-Linear SVM(Support Vector Machines)
Feature space에서의 inner product(a measure of similarity)

Key Idea of Kernel Methods
K(𝑥𝑖, 𝑥𝑗)
K(𝑥𝑖, 𝑥𝑗) = Φ(𝑥𝑖)· Φ(𝑥𝑗)

Normal Condition :
Cluster bound :
exp{−
[ 𝑥1−𝑐1
2+ 𝑥2−𝑐2
2]
2𝜎2 } ≥ {0<Threshold<<1}
𝑥1 − 𝑐1
2
+ 𝑥2 − 𝑐2
2
≤ r2
𝐾1 + 𝐾2 ≤ r2
x1
x2
.(c1,c2)
r
K1
K2
r2
r2
Key Idea of Kernel Methods

RBFN architecture
Σ
Input layer
Hidden layer
(RBFs)
Output layer
W1 W2 WM
x1 x2 xn
No weight
f(x)
Each of n components of
the input vector x feeds
forward to m basis
functions whose outputs
are linearly combined with
weights w (i.e. dot product
x∙w) into the network
output f(x).
The output layer performs a simple weighted sum (i.e. w ∙x).
If the RBFN is used for regression then this output is fine.
However, if pattern classification is required, then a hard-
limiter or sigmoid function could be placed on the output
neurons to give 0/1 output values
Input data set ∶ 𝑋 = { 𝑥1 𝑥2 … 𝑥 𝑁}

Σ
query
0.2
query
0.9
Radial Basis Function Detector
Architecture for Anomaly detection

Normal data
Unusual or
Abnomaly data
Σ
- Anomaly or Unusual event detection
query
query
0.9
query
query
0.1

Σ Σ
Category 1 Category 2
Category 1
Category 2
- Classification Problem

 For Gaussian basis functions
 s x w w x c
w w
x c
p i i p i
i
M
i
pj ij
ijj
n
i
M
( )
exp
( )
  
  
  










0
1
0
2
2
11 2


 Assume the variance  across each dimension are
equal
s x w w x cp i
i
pj ij
j
n
i
M
( ) exp ( )

   






0 2
2
11
1
2
→ → →
→

• Design decision
• number of hidden neurons
• max of neurons = number of input patterns
• more neurons – more complex, smaller tolerance
• Parameters to be learnt
• centers
• radii
• A hidden neuron is more sensitive to data points near its center.
This sensitivity may be tuned by adjusting the radius.
• smaller radius  fits training data better (overfitting)
• larger radius  less sensitivity, less overfitting, network of
smaller size, faster execution
• weights between hidden and output layers

The question now is:
How to train the RBF network?
In other words, how to find:
 The number and the parameters of hidden units (the basis functions)
using unlabeled data (unsupervised learning).
 K-Mean Clustering Algorithm
 The weights between the hidden layer and the output layer.
 Recursive Least-Squares Estimation Algorithm
RBFN Learning

xp
K-means
K-Nearest
Neighbor
Basis
Functions
Linear
Regression
ci
ci
i
A w
RBFN Learning

 Use the K-mean algorithm to find ci
RBFN Learning

K-mean Algorithm
step1: K initial clusters are chosen randomly from the samples
to form K groups.
step2: Each new sample is added to the group whose mean is
the closest to this sample.
step3: Adjust the mean of the group to take account of the new
points.
step4: Repeat step2 until the distance between the old means
and the new means of all clusters is smaller than a
predefined tolerance.

Outcome: There are K clusters with means representing
the centroid of each clusters.
Advantages: (1) A fast and simple algorithm.
(2) Reduce the effects of noisy samples.

 Use K nearest neighbor rule to find the function
width 
k-th nearest neighbor of ci
 The objective is to cover the training points so that a
smooth fit of the training samples can be achieved
2
1
1


K
k
iki cc
K
→ →

 RBF learning by gradient descent
 Let andi p
pj ij
ijj
n
p p px
x c
e x d x s x( ) exp ( ) ( ) ( )
   
 








 


1
2
2
2
1 
 E e xp
p
N



1
2 1
2
( ) .

we have






E
w
E E
ci ij ij
, , and
Apply
→ → → →
→
N : No. of batch

we have the following update equations
 RBF learning by gradient descent

Logistic regression
Needs to classify the output (y) as either 0 or 1.  Binary classification
A liner equation may not be a good fit for classification problem.
Logistic regression uses the sigmoid function to plot the hypothesis
z
Sigmoid ft’n
1
0
Linear ft’n
Want 0 ≤ ℎ 𝜃 𝑥 ≤ 1, 𝑓𝑜𝑟 𝑎𝑛 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥
Logistic Regression Model
ℎ 𝜃 𝑥 = σ(ϴ 𝑇 𝑥)
Binary Classification
42
J(w)= 𝑘 𝑦 𝑘 − 𝑦 𝑘
2
Linear regression
𝜎 ϴ 𝑇 𝑥 =
1
1 + 𝑒−ϴ 𝑇 𝑥
𝜎′
𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))𝜎 𝑧 =
1
1 + 𝑒−𝑧

Logistic regression
Satisfy condition
Sum of probability of y, given x, parameterized by ϴ
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
hq(x) : estimate of probability that y=1 for a given x
with model parameter, θ
hq(x) = P(y = 1 | x; θ)
Loss function of Logistic Regression
J(ϴ)=- logP(y = 1 | x; θ)= - log𝒉 𝜽(𝒙)
Take negative logarithm

44
Logistic regression
Maximum (log) likelihood estimator (MLE)
𝜃∗
= argmax 𝜃 logP(y = 1 | x; θ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
44
- logP(y = 1 | x; θ)
P(y = 1 | x; θ)
To make J(ϴ)  0 as P(y = 1 | x; θ)  1
J(ϴ)  ∞ as P(y = 1 | x; θ)  0
Why taking negative logarithm ?
Likelihood function
Maximize Likelihood = Minimize Loss
Likelihood : estimate unknown parameters
based on known outcomes : L ϴ 𝑦 =P 𝑦 ϴ

• Maximum Likelihood Approach for Binary classification
Data set {𝑥(𝑖)
,𝑦(𝑖)
}
where 𝑦(𝑖)
 {0,1} and 𝑥(𝑖)
, i=1,..,m
Since 𝑦(𝑖)
is binary we can use Bernoulli
• Likelihood function associated with m observations
Generalization of Binary classification
L(ϴ)=P 𝑦 𝑥; ϴ = 𝑖=1
𝑚
P 𝑦𝑖 𝑥𝑖; ϴ = 𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖
) 𝑦(𝑖)
1 − ℎ 𝜃(𝑥 𝑖
)
1−𝑦(𝑖)
Logistic regression
45
For a single observation case,
P 𝑦 = 1 𝑥; ϴ =ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ =1 − ℎ 𝜃 𝑥
P 𝑦 = 0 𝑥; ϴ + P 𝑦 = 1 𝑥; ϴ = 1
Same as previous page

46
• By taking negative logarithm we get the
Cross-entropy Error Function
J(ϴ) = −
1
𝑚
𝑙𝑜g(P 𝑦 𝑥; ϴ )
Logistic regression
Likelihood
Maximum likelihood estimator (MLE)
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}

Loss function is defined as
For 𝑦(𝑖)
= 1 𝑐𝑎𝑠𝑒
J(ϴ)= - logℎ 𝜃(𝑥 𝑖 )
ℎ 𝜃(𝑥 𝑖
)
1
J(ϴ)
For 𝑦(𝑖) = 0 𝑐𝑎𝑠𝑒
J(ϴ)= - log(1-ℎ 𝜃(𝑥 𝑖 ))
ℎ 𝜃(𝑥 𝑖 )
1
J(ϴ)
As ℎ 𝜃(𝑥 𝑖
) approaches to 1, J(ϴ) becomes 0 As ℎ 𝜃(𝑥 𝑖
) approaches to 0, J(ϴ) becomes 0
Z
Sigmoid ft’n
1
0
47
Logistic regression

Gradient :
convex quadratic function
ℎ 𝜃 𝑥 = σ(ϴ 𝑇
𝑥)
𝜕J(ϴ)
𝜕𝜃𝑗
=
1
𝑚 𝑖
𝑦(𝑖)
− ℎ 𝜃(𝑥 𝑖
) 𝑥𝑗
Alternative Cost function for Binary classification
Logistic regression
48

Regularized Logistic Regression
Adds numerical damping to prevent overshoot or over-fitting
Fidelity term Regularization term
Logistic regression
49
+
λ
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
?

50
Logistic regression
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists :
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
Data likelihood for specific parameters
(could be modeled with Deep Network!)
Prior distribution over parameters
(describes our prior knowledge or /
and our desires for the model)
Bayesian evidence
A powerful method for model selection!
As a rule this integral is intractable :(
(You can never integrate this)

51
Logistic regression
The core idea of Maximum a Posteriori Estimator:
Maximum a posteriori estimator
𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦
=𝐽 𝑀𝐿𝐸 ϴ +
1
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
+ 𝑐𝑜𝑛𝑠𝑡
𝜃 𝑀𝐴𝑃
∗
= argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ }
Loss function of Posterior distribution over model parameters
assuming a Gaussian prior for the weights

Kullback-Leibler (KL) divergence
• A measure of the difference between two
probability distributions: and
We can measure the
difference according
to an objective and
numerical value.
P (x) Q (x)
𝐷 𝑃 𝑄 ≡ P (x) log
Q (x)
P (x)
𝑑𝑥
P (x) Q (x)
Note: KL divergence is not a metric.
𝐷 𝑃 𝑄 ≠ 𝐷 𝑄 𝑃

KL divergence = Conditional Entropy H(P|Q)

Minimize KL divergence
• Random events are drawn from the real
distribution
true distribution
data set
Using the observed
data, we want to
estimate the true
distribution using a trial
distribution.
trial distribution
minimize
divergence
The smaller the KL divergence , the better an estimate.

Minimize KL divergence
• KL divergence between the two distributions
Constant: independent
of parameter
To minimize KL divergence, we have only
to maximize the second term with respect
to the parameter .

Likelihood and KL divergence
• The second term is approximated by the sample
mean:
data set
Log likelihood
They are the same:
• Minimizing the KL divergence
• Maximizing the likelihood

Softmax Regression
• Softmax Regression ( or multinomial logistic regression) is a
classification method that generalizes logistic regression to
multiclass problems. (i.e. with more than two possible discrete
outcomes.)
• Used to predict the probabilities of the different possible
outcomes of a categorically distributed dependent variable, given
a set of independent variables (which may be real-valued, binary-
valued, categorical-valued, etc.).
generalized logistic regression
to multiclass problems
58

• Used in classification problem in which response variable y can take on any one of
k values.
𝑦 ∈ 1,2, … , 𝑘 .
• To derive General Linear Model for multinomial data
 we begin by expressing the multinomial as an exponential family distribution.
 then computes the multinomial logistic loss (-log likelihood)
ℎ 𝜃 𝑥 𝑖
=
𝑝 𝑦(𝑖)
= 1 𝑥(𝑖)
; 𝜃
⋮
𝑝 𝑦(𝑖)
= 𝑘 𝑥(𝑖)
; 𝜃
=
1
𝑗=1
𝑘
e
θj
T 𝑥(𝑖)
eθ1
T 𝑥(𝑖)
⋮
eθk
T
𝑥(𝑖)
Softmax Regression
59

60
Softmax Regression
2.0
1.0
-1.0
-3.0
Logits,
Scores
0.7
0.2
0.05
0.01
Probabilities

• Remember that for logistic regression, we had:
which can be written similarly as:
Softmax Regression
61
Cross Entropy !
m events or dataset

• The softmax cost function is similar, except that we now sum
over the k different possible values of the class label.
• Gradient
.
: logistic
: softmax
Softmax Regression
62
Cross Entropy !
K Category Classification
m events or dataset

63
Probability vs. Likelihood
• Probability : likelihood that y will occur based on given parameters, ϴ :
Known parameters, ϴ : μ = 32 𝑔, σ = 2.5
μ = 32 𝑔
σ = 2.5
24 g 40 g
P 𝑦 ϴ
P 𝑦 = 34 𝑔 ϴ = 0.15
P 𝑦 = 32 𝑔 ϴ = 0.2
Probability=0.15
Probability=0.2
P Variable Fixed
y=34y=32

64
• Likelihood : Parameters of a statistical model, ϴ based on given observed data, y :
 Find a best fitting model
L ϴ 𝑦
Probability vs. Likelihood
24 g 40 g
The likelihood of weighs 34 g for a parameter is
Assumed parameters, ϴ : μ = 32 𝑔, σ = 2.5
Probability is 0.12
0.12
L ϴ weighs 34 g =
𝑦 = 34 𝑔
μ = 34 𝑔
Assumed parameters, ϴ : μ = 34 𝑔, σ = 2.5
0.2
Probability is 0.2
True data distribution: μ = 34 𝑔, σ = 2.5

65
Fitting normal distribution: ML
L 𝜇, 𝜎2 𝑥1,…𝐼 =
Assumed model
Given dataset

66
Fitting normal distribution: ML
Plotted surface of likelihoods
as a function of possible
parameter values
ML Solution is at peak

67
Information Theory
Information
• It is quantitative measure of information
• Most UN-expected events give maximum information
• Average uncertainty of a random variable
Relation between Information and its probability :
• Information is inversely proportional to its probability of occurrence
• Information is continuous function of its probability
• Total information of two or more independent message is the sum of
individual information
I(𝑥) = 𝑙𝑜g(
1
P(𝑥)
) = −𝑙𝑜𝑔P(𝑥)
I(𝑥) ∶ 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛, 𝑃 𝑥 : 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
 Increase entropy
 Loss

68
Let’s see observation as X= 𝑥1, … , 𝑥 𝑚 with probability P(X)= 𝑝1, … , 𝑝 𝑚
• Total N observations are occurred
• 𝑥1 occurs N*𝑝1 times and so ….
• Single occurrence of 𝑥1 conveys information = −𝑙𝑜𝑔𝑝1
• N*𝑝1 times occurrence conveys information = -N*𝑝1 ∗ 𝑙𝑜𝑔𝑝1
• Total information = -N 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖
• Averaged information = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔𝑝𝑖  H(x)
Marginal Entropy
H(x) : Marginal Entropy
Cross Entropy Loss = - 𝑖=1
𝑚
𝑝𝑖 ∗ 𝑙𝑜𝑔 𝑝i
𝑝𝑖 : Given, usually 1
𝑝I : Estimated, usually less than 1

69
Joint Entropy
Venn diagram for definition of entropies

70
Joint Entropy
Let’s see two observations as X= 𝑥1, … , 𝑥 𝑚 and Y= 𝑦1, … , 𝑦 𝑛
• Can be a reference and a query for Anomaly detection problem
• Should have complete probability scheme i.e. sum of all
possible combinations of joint observation of X and Y should be “1”
𝑖=1
𝑚
𝑗=1
𝑛
𝑝( 𝑥𝑖, 𝑦𝑗) = 1
• Entropy calculated same as marginal entropy
• Information delivered when one pair (𝑥𝑖, 𝑦𝑗) occur once is − log 𝑝(𝑥𝑖, 𝑦𝑗)
• Number of times this can happen is Nij out of total N
• Information for Nij times for this particular combination is - Nijlog 𝑝(𝑥𝑖, 𝑦𝑗)
• Total information for all combination of i and j is - 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)

71
-
1
𝑁 𝑖=1
𝑚
𝑗=1
𝑛
Nij log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy
Averaged information : Joint Entropy H(X, Y)
Nij = 𝑝(𝑥𝑖, 𝑦𝑗)·N
H(X, Y) = - 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log 𝑝(𝑥𝑖, 𝑦𝑗)
Joint Entropy H(X, Y)

73
Conditional Entropy H(X|Y)
• Bay’s theorem : 𝑝(𝑥𝑖, 𝑦𝑗) = 𝑝(𝑥𝑖)· 𝑝(𝑦𝑗|𝑥𝑖) = 𝑝(𝑦𝑗)· 𝑝(𝑥𝑖|𝑦𝑗)
• For a particular 𝑦𝑗 observed, it can be only from one of X= 𝑥1, … , 𝑥 𝑚
• Similarly
𝑖=1
𝑚
𝑝(𝑥𝑖 𝑦𝑗 = 1
𝑗=1
𝑛
𝑝(𝑦𝑗 𝑥𝑖 = 1

76
Mutual Information I(X;Y)


yx ypxp
yxp
yxp
XYHYHYXHXHYXI
, )()(
),(
log),(
)|()()|()();(
• The reduction in uncertainty of one random variable due to knowing about
another  Information Gain
• The amount of information one random variable contains about another
• Measure of independence
: two variables are independent
 grows according to ...
- the degree of dependence
- the entropy of the variables
0);( YXI

77
Symmetrical Uncertainty Measure @ Remaining Useful Lifetime Prediction
• Not all variables measured from the critical component are useful for
predicting the RUL.
• The assumption is that variables that have non-random relationship
carry information about the system behaviour.
• The goal is to group the variables, which have non-random relationship.
• To do so, a method based on mutual information has been applied on
the dataset for feature selection:
0 ≤ 𝑆𝑈 𝑋, 𝑌 = 2
𝐼(𝑋;𝑌)
𝐻 𝑋 +𝐻(𝑌)
≤ 1

78








yx
yxyxyx
yxyx
ypxp
yxp
yxp
yxpyxp
yp
yxp
xp
yxp
yxpyxp
yp
yp
xp
xp
YXHYHXH
YXHXHYXI
,
,,,
,
)()(
),(
log),(
),(log),(
)(
1
log),(
)(
1
log),(
),(log),(
)(
1
log)(
)(
1
log)(
),()()(
)|()();(
Derivation of Mutual Information(i)

79
Derivation of Mutual Information(ii)
• xi is occurred with probability of p(xi) : Priori entropy of xi
• Initial uncertainty of xi is - log 𝑝(𝑥𝑖)
• Reduction in uncertainty of one random variable, xi due to
knowing about another, yj  Information Gain
• Final uncertainty of xi is - log 𝑝(𝑥𝑖|𝑦𝑗) : Posteriori entropy of xi
• Information gain = Net reduction in the uncertainties
• I(xi ; yj) = Initial uncertainty of xi - Final uncertainty of xi
= - log 𝑝(𝑥𝑖)+ log 𝑝(𝑥𝑖|𝑦𝑗) = log
𝑝(𝑥 𝑖)

80
Derivation of Mutual Information(ii)
I(X;Y) = 𝑖=1
𝑚
𝑗=1
𝑛
𝑝(𝑥𝑖, 𝑦𝑗)∙ log
𝑝(𝑥 𝑖)
• I(X;Y) : averaging I(xi ; yj) for all values of i and j
I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
= H(X) +H(Y) – H(X,Y)

The world of loss function

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The world of loss function

Similar to The world of loss function (20)

More from 홍배 김

More from 홍배 김 (20)

Recently uploaded

Recently uploaded (20)

The world of loss function