Avoid Overfitting with Regularization

0
Avoid Overfitting with Regularization
By
Ahmed Fawzy Gad
Information Technology (IT) Department
Faculty of Computers and Information (FCI)
Menoufia University
Egypt
ahmed.fawzy@ci.menofia.edu.eg
14-Jan-2018
‫المنوفية‬ ‫جامعة‬
‫والمعلومات‬ ‫الحاسبات‬ ‫كلية‬
MENOUFIA UNIVERSITY
FACULTY OF COMPUTERS
AND INFORMATION ‫المنوفية‬ ‫جامعة‬

1
Have you ever created a machine learning model that is perfect for the training samples but gives very bad predictions
with unseen samples! Did you ever think why this happens? This article explains overfitting which is one of the reasons
for poor predictions for unseen samples. Also, regularization technique based on regression is presented by simple steps
to make it clear how to avoid overfitting.
The focus of machine learning (ML) is to train an algorithm with training data in order create a model that is able to make
the correct predictions for unseen data (test data). To create a classifier, for example, a human expert will start by
collecting the data required to train the ML algorithm. The human is responsible for finding the best types of features to
represent each class which is capable of discriminating between the different classes. Such features will be used to train
the ML algorithm. Suppose we are to build a ML model that classifies images as containing cats or not using the following
training data.
The first question we have to answer is “what are the best features to use?”. This is a critical question in ML as the better
the used features the better the predictions the trained ML model makes and vice versa. Let us try to visualize such images
and extract some features that are representative of cats. Some of the representative features may be the existence of
two dark eye pupils and two ears with a diagonal direction. Assuming that we extracted such features, somehow, from
the above training images and a trained ML model is created. Such model can work with a wide range of cat images
because the used features are existing in most of the cats. We can test the model using someunseen data as the following.
Assuming that the classification accuracy of the test data is x%.
One may want to increase the classification accuracy. The first thing to think of is by using more features than the two
ones used previously. This is because the more discriminative features to use, the better the accuracy. By inspecting the
training data again, we can find more features such as the overall image color as all training cat samples are white and the
eye irises color as the training data has a yellow iris color. The feature vector will have the 4 features shown below. They
will be used to retrain the ML model.
Feature Dark Eye Pupils Diagonal Ears White Cat Color Yellow Eye Irises
After creating the trained model next is to test it. The expected result after using the new feature vector is that the
classification accuracy will decrease to be less than x%. But why? The cause of accuracy drop is using some features that
are already existing in the training data but not existing generally in all cat images. The features are not general across all
cat images. All used training images have a while image color and a yellow eye irises but they are generalized to all cats.
In the testing data, some cats have a black or yellow color which is not white as used in training. Some cats have not the
irises color yellow.

2
Our case in which the used features are powerful for the training samples but very poor for the testing samples is known
as overfitting. Themodel is trained with some features thatare exclusive to the training data but not existing in the testing
data.
The goal of the previous discussion is to make the idea of overfitting simple by a high-level example. To get into the details
itis preferableto work with a simpler example. Thatis why therest of thediscussion will bebased on a regression example.
Understand Regularization based on a Regression Example
Assume we want to create a regression model that fits the data shown below. We can use polynomial regression.
The simplest model that we can start with is the linear model with a first-degree polynomial equation:
y1 = f1(x) = Θ1x + Θ0
Where Θ0 and Θ1 are the model parameters & 𝑥 is the only feature used.
The plot of the previous model is shown below:
Based on a loss function such as the one shown below, we can conclude that the model is not fitting the data well.
L =
∑ |f1(x 𝑖) − d𝑖|𝑁
𝑖=0
𝑁
Where f𝑖(x 𝑖) is the expected output for sample 𝑖 and d𝑖 is the desired output for the same sample.
The model is too simple and there are many predictions that are not accurate. For such reason, we should create a more
complex model that can fit the data well. For such reason, we can increase the degree of the equation from one to two.
It will be as follows:
y2 = f1(x) = Θ2x2
+ Θ1x + Θ0
By using the same feature x after being raised to power 2 (x2
), we created a new feature and we will not only capture the
linear properties of the data, but also some non-linear properties. The graph of the new model will be as follows:

3
The graph shows that the second degree polynomial fits the data better than the first degree. But also the quadratic
equation does not fit well some of the data samples. This is why we can create a more complex model of the third degree
with the following equation:
y3 = f3(x) = Θ3x3
+ Θ2x2
+ Θ1x + Θ0
The graph will be as follows:
It is noted that the model fits the data better after adding a new feature that capturing the data properties of the third
degree. To fit the data better than before, we can increase the degree of the equation to be of the fourth degree as in the
following equation:
y4 = f4(x) = Θ4x4
+ Θ3x3
+ Θ2x2
+ Θ1x + Θ0
The graph will be as follows:
It seems that the higher the degree of the polynomial equation the better it fits the data. But there are some important
questions to be answered. If increasing the degree of the polynomial equation by adding new features enhances the
results, so why not using a very high degree such as 100th
degree? What is the best degree to be used for a problem?
Model Capacity/Complexity
There is a term called model capacity or complexity. Model capacity/complexity refers to the level of variation that the
model can work with. The higher the capacity the more variation the model can cope with. The first model y1 is said to be
of a small capacity compared to y4. In our case, the capacity increases by increasing the polynomial degree.
For sure the higher the degree of the polynomial equation the more fit it will be for the data. But remember that increasing
the polynomial degree increases the complexity of the model. Using a model with a capacity higher than required may
lead to overfitting. The model becomes very complex and fits the training data very well but unfortunately, it is a very
weak for unseen data. The goal of ML is not only creating a model that is robust with the training data but also to the
unseen data samples.
The model of the fourth degree (y4) is very complex. Yes, it fits the seen data well but it will not for unseen data. For such
case, the newly used feature in y4 which is 𝑥4
captures more details than required. Because that new feature makes the
model too complex, we should get rid of it.
In this example, we actually know which features to remove. So, we can remove it and return back to the previous model
of the third-degree (Θ4x4
+ Θ3x3
+ Θ2x2
+ Θ1x + Θ0). But in actual work, we do not know which features to remove.

4
Moreover, assume that the new feature is not too bad and we do not want to completely remove it and just want to
penalize it. What should we do?
Looking back at the loss function, the only goal is to minimize/penalize the prediction error. We can set a new objective
to minimize/penalize the effect of the new feature 𝑥4
as much as possible. After modifying the loss function to penalize
x3, it will be as follows:
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + Θ4x4
]
𝑁
Our objective now is to minimize the loss function. We are now just interested in minimizing this term Θ4x4
. It is obvious
that to minimize Θ4x4
we should minimize Θ4 as it is the only free parameter we can change. We can set its value to a
value equal to zero if we want to remove that feature completely in case it is very bad one as shown below:
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0 ∗ x4
]
𝑁
By removing it, we go back to the third-degree polynomial equation (y3). y3 does not fit the seen data perfectly as in y4
but generally, it will have a better performance for unseen data than y4.
But in case it x4
is a relatively good feature and we just want to penalize it but not to remove it completely, we can set it
to a value close to zero but not to zero (say 0.1) as shown next. By doing that, we limit the effect of x4. As a result, the
new model will not be complex as before.
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0.1 ∗ x4
]
𝑁
Going back to y2, it seems that it is the simpler than y3. It can work well with both seen and unseen data samples. So, we
should remove the new feature used in y3 which is x3 or just penalize it if it relatively does well. We can modify the loss
function to do that.
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0.1 ∗ x4 + Θ3x3]
𝑁
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0.1 ∗ x4 + 0.04 ∗ x3]
𝑁
Regularization
Note that we actually knew that y2 is the best model to fit the data because the data graph is available for us. It is a very
simple task that we can solve manually. But if such information is not available for us and as the number of samples and
data complexity increases, we will not be able to reach such conclusions easily. There must be something automatic to
tell us which degree will fit the data and tell us which features to penalize to get the best predictions for unseen data. This
is regularization.
Regularization helps us to select the model complexity to fit the data. It is useful to automatically penalize features that
make the model too complex. Remember that regularization is useful if the features are not bad and relatively helps us to
get good predictions and we just need to penalize but not to remove them completely. Regularization penalizes all used
features, not a selected subset. Previously, we penalized just two features x4
and x3
not all features. But it is not the case
with regularization.
Using regularization, a new term is added to the loss function to penalize the features so the loss function will be as
follows:
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + ∑ λΘ𝑗
𝑁
𝑗=1 ]
𝑁
It can also be written as follows after moving Λ outside the summation:

5
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + λ ∑ Θ𝑗
𝑁
𝑗=1 ]
𝑁
The newly added term λ ∑ Θ𝑗
𝑁
𝑗=1 is used to penalize the features to control the level of model complexity. Our previous
goal before adding the regularization term is to minimize the prediction error as much as possible. Now our goal is to
minimize the error but to be careful of not making the model too complex and avoids overfitting.
There is a regularization parameter called lambda (λ) which controls how to penalize the features. It is a hyperparameter
with no fixed value. Its value is variable based on the task at hand. As its value increases as there will be high penalization
for the features. As a result, the model becomes simpler. When its values decrease there will be a low penalization of the
features and thus the model complexity increases. A value of zero means no removal of features at all.
When λ is zero, then the values of Θ𝑗 will not be penalized at all as shown in the next equation. This is because setting λ
to zero means the removal of the regularization term and just leaving the error term. So, our objective will return back to
just minimize the error to be close to zero. When error minimization is the objective, the model may overfit.
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0 ∗ ∑ Θ𝑗
𝑁
𝑗=1 ]
𝑁
L 𝑛𝑒𝑤 =
[∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0 + 0]
𝑁
L 𝑛𝑒𝑤 =
∑ |f4(x 𝑖) − d𝑖|𝑁
𝑖=0
𝑁
But when the value of the penalization parameter λ is very high (say 109), then there must be a very high penalization for
the parameters Θ𝑗 in order to keep the loss at its minimum value. As a result, the parameters Θ𝑗 will be zeros. As a result,
the model (y4) will have its Θ𝑖 pruned as shown below.
y4 = f4(x) = Θ4x4
+ Θ3x3
+ Θ2x2
+ Θ1x + Θ0
y4 = 0 ∗ x4 + 0 ∗ x3 + 0 ∗ x2 + 0 ∗ x + Θ0
y4 = Θ0
Please note that the regularization term starts its index 𝑗 from 1 not zero. Actually, we use the regularization term to
penalize features (x 𝑖). Because Θ0 has not associated feature, then there is no reason to penalize it. In such case, the
model will be y4 = Θ0 with the following graph:

Avoid Overfitting with Regularization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Avoid Overfitting with Regularization

Similar to Avoid Overfitting with Regularization (20)

More from Ahmed Gad

More from Ahmed Gad (20)

Recently uploaded

Recently uploaded (20)

Avoid Overfitting with Regularization