SlideShare a Scribd company logo
1 of 291
Download to read offline
Mathematics for Artificial Intelligence
Transformation and Applications
Andres Mendez-Vazquez
April 9, 2020
1 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 2 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 3 / 140
Going further than solving Ax = y
We can go further
We can think on the matrix A as a function!!!
In general
A function f whose domain Rn and defines a rule that associate x ∈ Rn
to a vector y ∈ Rm
y = f (x) equivalently f : Rn
−→ Rm
We like the second expression because
1 It is easy to identify the domain Rn
2 It is easy to find the range Rm
4 / 140
Going further than solving Ax = y
We can go further
We can think on the matrix A as a function!!!
In general
A function f whose domain Rn and defines a rule that associate x ∈ Rn
to a vector y ∈ Rm
y = f (x) equivalently f : Rn
−→ Rm
We like the second expression because
1 It is easy to identify the domain Rn
2 It is easy to find the range Rm
4 / 140
Going further than solving Ax = y
We can go further
We can think on the matrix A as a function!!!
In general
A function f whose domain Rn and defines a rule that associate x ∈ Rn
to a vector y ∈ Rm
y = f (x) equivalently f : Rn
−→ Rm
We like the second expression because
1 It is easy to identify the domain Rn
2 It is easy to find the range Rm
4 / 140
Going further than solving Ax = y
We can go further
We can think on the matrix A as a function!!!
In general
A function f whose domain Rn and defines a rule that associate x ∈ Rn
to a vector y ∈ Rm
y = f (x) equivalently f : Rn
−→ Rm
We like the second expression because
1 It is easy to identify the domain Rn
2 It is easy to find the range Rm
4 / 140
Examples
f : R → R3
f (t) =



x (t)
y (t)
z (t)


 =



t
3t2 + 1
sin (t)



This are called parametric functions
Depending on the context, it could represent the position or the
velocity of a mass point.
5 / 140
Examples
f : R → R3
f (t) =



x (t)
y (t)
z (t)


 =



t
3t2 + 1
sin (t)



This are called parametric functions
Depending on the context, it could represent the position or the
velocity of a mass point.
5 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 6 / 140
A Classic Example
We have
if A is a m × n, we can use A to define a function.
We will call them
fA : Rn
→ Rm
In other words
fA (x) = Ax
7 / 140
A Classic Example
We have
if A is a m × n, we can use A to define a function.
We will call them
fA : Rn
→ Rm
In other words
fA (x) = Ax
7 / 140
A Classic Example
We have
if A is a m × n, we can use A to define a function.
We will call them
fA : Rn
→ Rm
In other words
fA (x) = Ax
7 / 140
Example
Let
A2×3 =
1 2 3
4 5 6
This allows to define
fA



x
y
z


 =
1 2 3
4 5 6



x
y
z


 =
x + 2y + z
4x + 5y + 6z
We have
For each vector x ∈ R3 to the vector Ax ∈ R2
8 / 140
Example
Let
A2×3 =
1 2 3
4 5 6
This allows to define
fA



x
y
z


 =
1 2 3
4 5 6



x
y
z


 =
x + 2y + z
4x + 5y + 6z
We have
For each vector x ∈ R3 to the vector Ax ∈ R2
8 / 140
Example
Let
A2×3 =
1 2 3
4 5 6
This allows to define
fA



x
y
z


 =
1 2 3
4 5 6



x
y
z


 =
x + 2y + z
4x + 5y + 6z
We have
For each vector x ∈ R3 to the vector Ax ∈ R2
8 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 9 / 140
We have
Definition
A function f : Rn → Rm is said to be linear if
1 f (x1 + x2) = f (x1) + f (x2)
2 f (cx) = cf (x)
for all x1, x2 ∈ Rn and for all the scalars c.
Thus
A linear function f is also known as a linear transformation.
10 / 140
We have
Definition
A function f : Rn → Rm is said to be linear if
1 f (x1 + x2) = f (x1) + f (x2)
2 f (cx) = cf (x)
for all x1, x2 ∈ Rn and for all the scalars c.
Thus
A linear function f is also known as a linear transformation.
10 / 140
We have the following proposition
Proposition
f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the
scalars c1, c2:
f (c1x1 + c2x2) = c1f (x1) + c2f (x2)
Proof
Any idea?
11 / 140
We have the following proposition
Proposition
f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the
scalars c1, c2:
f (c1x1 + c2x2) = c1f (x1) + c2f (x2)
Proof
Any idea?
11 / 140
Proof
If Am×n is a matrix, fA is a linear transformation
How?
First
fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2)
Second
What about fA (cx1)?
12 / 140
Proof
If Am×n is a matrix, fA is a linear transformation
How?
First
fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2)
Second
What about fA (cx1)?
12 / 140
Proof
If Am×n is a matrix, fA is a linear transformation
How?
First
fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2)
Second
What about fA (cx1)?
12 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 13 / 140
We have
Definition (Actually related the null-space)
If f : Rn → Rm is linear, the kernel of f is defined by
Ker (f) = {v ∈ Rn
|f (v) = 0}
Definition
If f : Rn → Rm is linear, the range of f is defined by
Range (f) = {y ∈ Rm
|y = f (x) for somer x ∈ Rn
}
14 / 140
We have
Definition (Actually related the null-space)
If f : Rn → Rm is linear, the kernel of f is defined by
Ker (f) = {v ∈ Rn
|f (v) = 0}
Definition
If f : Rn → Rm is linear, the range of f is defined by
Range (f) = {y ∈ Rm
|y = f (x) for somer x ∈ Rn
}
14 / 140
We have also the following Spaces
Row Space
We have that the span of the row vectors of A form a subspace.
Column Space
We have that the span of the column vectors of A, also, form a subspace.
15 / 140
We have also the following Spaces
Row Space
We have that the span of the row vectors of A form a subspace.
Column Space
We have that the span of the column vectors of A, also, form a subspace.
15 / 140
From This
It can be shown that
Ker (f) is a subspace of Rn
Also
Range (f) is a subspace of Rm
16 / 140
From This
It can be shown that
Ker (f) is a subspace of Rn
Also
Range (f) is a subspace of Rm
16 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 17 / 140
Assume the following
Let
e1 =






1
0
...
0






n×1
, e3 =






0
1
...
0






n×1
, ..., en =






0
0
...
1






n×1
Then any vector x ∈ Rn
x =






x1
x2
...
xn






= x1e1 + x2e2 + ... + xnen
18 / 140
Assume the following
Let
e1 =






1
0
...
0






n×1
, e3 =






0
1
...
0






n×1
, ..., en =






0
0
...
1






n×1
Then any vector x ∈ Rn
x =






x1
x2
...
xn






= x1e1 + x2e2 + ... + xnen
18 / 140
Then
Applying f
f (x) = x1f (e1) + x2f (e2) + ... + xnf (en)
A linear combination of elements
{f (e1) , f (e2) , ..., f (en)}
They are column vectors in Rm
A = (f (e1) |f (e2) |...|f (en))m×n
19 / 140
Then
Applying f
f (x) = x1f (e1) + x2f (e2) + ... + xnf (en)
A linear combination of elements
{f (e1) , f (e2) , ..., f (en)}
They are column vectors in Rm
A = (f (e1) |f (e2) |...|f (en))m×n
19 / 140
Then
Applying f
f (x) = x1f (e1) + x2f (e2) + ... + xnf (en)
A linear combination of elements
{f (e1) , f (e2) , ..., f (en)}
They are column vectors in Rm
A = (f (e1) |f (e2) |...|f (en))m×n
19 / 140
Thus, we have
Finally, we have
f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax
Definition
The matrix A defined above for the function f is called the matrix of
f in the standard basis.
20 / 140
Thus, we have
Finally, we have
f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax
Definition
The matrix A defined above for the function f is called the matrix of
f in the standard basis.
20 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 21 / 140
Given an m × n matrix A
The set of all solutions to the homogeneous equation Ax
It is a subspace V of Rn.
Ax = 0
Remember how to prove the subspaces...
x2 + x2 ∈ V and cx ∈ V
Do you remember?
22 / 140
Given an m × n matrix A
The set of all solutions to the homogeneous equation Ax
It is a subspace V of Rn.
Ax = 0
Remember how to prove the subspaces...
x2 + x2 ∈ V and cx ∈ V
Do you remember?
22 / 140
Then, we have
Definition
This important subspace is called the null space of A, and is denoted
Null(A)
It is also known as
xH = {x|Ax = 0}
23 / 140
Then, we have
Definition
This important subspace is called the null space of A, and is denoted
Null(A)
It is also known as
xH = {x|Ax = 0}
23 / 140
Knowing that Range(f) and Ker(f) are sub-spaces
Which ones they are?
Any Idea?
Range(f)
The column space of the matrix A.
Ker(f)
It is the null space of A.
24 / 140
Knowing that Range(f) and Ker(f) are sub-spaces
Which ones they are?
Any Idea?
Range(f)
The column space of the matrix A.
Ker(f)
It is the null space of A.
24 / 140
Knowing that Range(f) and Ker(f) are sub-spaces
Which ones they are?
Any Idea?
Range(f)
The column space of the matrix A.
Ker(f)
It is the null space of A.
24 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 25 / 140
We have a nice theorem
Dimension Theorem
Let f : Rn → Rm be linear. Then
dim (domain (f)) = dim (Range (f)) + dim (Ker (f))
Where
The dimension of V , written dim(V ), is the number of elements in any
basis of V .
26 / 140
We have a nice theorem
Dimension Theorem
Let f : Rn → Rm be linear. Then
dim (domain (f)) = dim (Range (f)) + dim (Ker (f))
Where
The dimension of V , written dim(V ), is the number of elements in any
basis of V .
26 / 140
Rank and Nullity of a Matrix
Definition
The rank of the matrix A is the dimension of the row space of A, and
is denoted R(A).
Example
The rank of In×n is n.
27 / 140
Rank and Nullity of a Matrix
Definition
The rank of the matrix A is the dimension of the row space of A, and
is denoted R(A).
Example
The rank of In×n is n.
27 / 140
Then
Theorem
The rank of a matrix in Gauss-Jordan form is equal to the number of
leading variables.
Proof
In the G form of a matrix, every non-zero row has a leading 1, which
is the only non-zero entry in its column.
Then
No elementary row operation can zero out a leading 1, so these
non-zero rows are linearly independent.
28 / 140
Then
Theorem
The rank of a matrix in Gauss-Jordan form is equal to the number of
leading variables.
Proof
In the G form of a matrix, every non-zero row has a leading 1, which
is the only non-zero entry in its column.
Then
No elementary row operation can zero out a leading 1, so these
non-zero rows are linearly independent.
28 / 140
Then
Theorem
The rank of a matrix in Gauss-Jordan form is equal to the number of
leading variables.
Proof
In the G form of a matrix, every non-zero row has a leading 1, which
is the only non-zero entry in its column.
Then
No elementary row operation can zero out a leading 1, so these
non-zero rows are linearly independent.
28 / 140
Thus
We have
Since all the other rows are zero, the dimension of the row space of
the Gauss-Jordan form is equal to the number of leading 1’s.
Finally
This is the same as the number of leading variables. Q.E.D.
29 / 140
Thus
We have
Since all the other rows are zero, the dimension of the row space of
the Gauss-Jordan form is equal to the number of leading 1’s.
Finally
This is the same as the number of leading variables. Q.E.D.
29 / 140
About the Nullity of the Matrix
Definition
The nullity of the matrix A is the dimension of the null space of A,
and is denoted by dim [N(A)].
Example
The nullity of I is 0.
30 / 140
About the Nullity of the Matrix
Definition
The nullity of the matrix A is the dimension of the null space of A,
and is denoted by dim [N(A)].
Example
The nullity of I is 0.
30 / 140
About the Nullity of the Matrix
Definition
The nullity of the matrix A is the dimension of the null space of A,
and is denoted by dim [N(A)].
Example
The nullity of I is 0.
30 / 140
Number of Free Variables
Theorem
The nullity of a matrix in Gauss-Jordan form is equal to the number of
free variables.
Proof
Suppose A is m × n, and that the Gauss-Jordan form has j leading
variables and k free variables:
j + k = n
31 / 140
Proof
Then, when computing the solution to the homogeneous equation
We solve for the first j (leading) variables in terms of the remaining k
free variables:
s1, s2, s3, ..., sk
Then
Then the general solution to the homogeneous equation are:
s1v1 + s2v2 + s3v3 + · · · + skvk
32 / 140
Proof
Then, when computing the solution to the homogeneous equation
We solve for the first j (leading) variables in terms of the remaining k
free variables:
s1, s2, s3, ..., sk
Then
Then the general solution to the homogeneous equation are:
s1v1 + s2v2 + s3v3 + · · · + skvk
32 / 140
Proof
Then, when computing the solution to the homogeneous equation
We solve for the first j (leading) variables in terms of the remaining k
free variables:
s1, s2, s3, ..., sk
Then
Then the general solution to the homogeneous equation are:
s1v1 + s2v2 + s3v3 + · · · + skvk
32 / 140
Where
The vectors are the Canonical Ones
Here, a trick!!!
Meaning in v1, we have 1, after many 0
It appears at position j + 1, with zeros afterwards, and so on.
Therefore the vectors are linearly independents
v1, v2, v3, · · · , vk
33 / 140
Where
The vectors are the Canonical Ones
Here, a trick!!!
Meaning in v1, we have 1, after many 0
It appears at position j + 1, with zeros afterwards, and so on.
Therefore the vectors are linearly independents
v1, v2, v3, · · · , vk
33 / 140
Where
The vectors are the Canonical Ones
Here, a trick!!!
Meaning in v1, we have 1, after many 0
It appears at position j + 1, with zeros afterwards, and so on.
Therefore the vectors are linearly independents
v1, v2, v3, · · · , vk
33 / 140
Therefore
They are a basis for the null space of A
And there are k of them, the same as the number of free variables.
34 / 140
Now
Definition
The matrix B is said to be row equivalent to A (B ∼ A) if B can be
obtained from A by a finite sequence of elementary row operations.
In matrix terms
B ∼ A ⇔ There exist elementary matrices such that
B = EkEk−1Ek−1 · · · E1A
If we write C = EkEk−1Ek−1 · · · E1
B is row equivalent to A if B = CA with C invertible.
35 / 140
Now
Definition
The matrix B is said to be row equivalent to A (B ∼ A) if B can be
obtained from A by a finite sequence of elementary row operations.
In matrix terms
B ∼ A ⇔ There exist elementary matrices such that
B = EkEk−1Ek−1 · · · E1A
If we write C = EkEk−1Ek−1 · · · E1
B is row equivalent to A if B = CA with C invertible.
35 / 140
Now
Definition
The matrix B is said to be row equivalent to A (B ∼ A) if B can be
obtained from A by a finite sequence of elementary row operations.
In matrix terms
B ∼ A ⇔ There exist elementary matrices such that
B = EkEk−1Ek−1 · · · E1A
If we write C = EkEk−1Ek−1 · · · E1
B is row equivalent to A if B = CA with C invertible.
35 / 140
Then, we have
Theorem
If B ∼ A, then Null(B) = Null(A).
Theorem
If B ∼ A, then the row space of B is identical to that of A.
Summarizing
Row operations change neither the row space nor the null space of A.
36 / 140
Then, we have
Theorem
If B ∼ A, then Null(B) = Null(A).
Theorem
If B ∼ A, then the row space of B is identical to that of A.
Summarizing
Row operations change neither the row space nor the null space of A.
36 / 140
Then, we have
Theorem
If B ∼ A, then Null(B) = Null(A).
Theorem
If B ∼ A, then the row space of B is identical to that of A.
Summarizing
Row operations change neither the row space nor the null space of A.
36 / 140
Corollaries
Corollary 1
If R is the Gauss-Jordan form of A, then R has the same null space
and row space as A.
Corollary 2
If B ∼ A, then R(B) = R(A), and N(B) = N(A).
37 / 140
Corollaries
Corollary 1
If R is the Gauss-Jordan form of A, then R has the same null space
and row space as A.
Corollary 2
If B ∼ A, then R(B) = R(A), and N(B) = N(A).
37 / 140
Then
Theorem
The number of linearly independent rows of the matrix A is equal to
the number of linearly independent columns of A.
Thus
In particular, the rank of A is also equal to the number of linearly
independent columns, and hence to the dimension of the column
space of A
Therefore
The number of linearly independent columns of A is then just the
number of leading entries in the Gauss-Jordan form of A which is, as
we know, the same as the rank of A.
38 / 140
Then
Theorem
The number of linearly independent rows of the matrix A is equal to
the number of linearly independent columns of A.
Thus
In particular, the rank of A is also equal to the number of linearly
independent columns, and hence to the dimension of the column
space of A
Therefore
The number of linearly independent columns of A is then just the
number of leading entries in the Gauss-Jordan form of A which is, as
we know, the same as the rank of A.
38 / 140
Then
Theorem
The number of linearly independent rows of the matrix A is equal to
the number of linearly independent columns of A.
Thus
In particular, the rank of A is also equal to the number of linearly
independent columns, and hence to the dimension of the column
space of A
Therefore
The number of linearly independent columns of A is then just the
number of leading entries in the Gauss-Jordan form of A which is, as
we know, the same as the rank of A.
38 / 140
Proof of the theorem (Dimension Theorem)
First
The rank of A is the same as the rank of the Gauss-Jordan form of A
which is equal to the number of leading entries in the Gauss-Jordan
form.
Additionally
The dimension of the null space is equal to the number of free
variables in the reduced echelon (Gauss-Jordan) form of A.
Then
We know further that the number of free variables plus the number of
leading entries is exactly the number of columns.
39 / 140
Proof of the theorem (Dimension Theorem)
First
The rank of A is the same as the rank of the Gauss-Jordan form of A
which is equal to the number of leading entries in the Gauss-Jordan
form.
Additionally
The dimension of the null space is equal to the number of free
variables in the reduced echelon (Gauss-Jordan) form of A.
Then
We know further that the number of free variables plus the number of
leading entries is exactly the number of columns.
39 / 140
Proof of the theorem (Dimension Theorem)
First
The rank of A is the same as the rank of the Gauss-Jordan form of A
which is equal to the number of leading entries in the Gauss-Jordan
form.
Additionally
The dimension of the null space is equal to the number of free
variables in the reduced echelon (Gauss-Jordan) form of A.
Then
We know further that the number of free variables plus the number of
leading entries is exactly the number of columns.
39 / 140
Finally
We have
dim (domain (f)) = dim (Range (f)) + dim (Ker (f))
40 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 41 / 140
As we know
Many Times
We want to obtain a maximum or a minimum of a cost function expressed
in terms of matrices....
We need then to define matrix derivatives
Thus, this discussion is useful in Machine Learning.
42 / 140
As we know
Many Times
We want to obtain a maximum or a minimum of a cost function expressed
in terms of matrices....
We need then to define matrix derivatives
Thus, this discussion is useful in Machine Learning.
42 / 140
Basic Definition
Let ψ (x) = y
Where y is an m-element vector, and x is an n-element vector
Then, we define the derivative with respect to a vector
∂y
∂x
=






∂y1
∂x1
∂y1
∂x2
· · · ∂y1
∂xn
∂y2
∂x1
∂y2
∂x2
· · · ∂y2
∂xn
...
...
...
...
∂ym
∂x1
∂ym
∂x2
· · · ∂ym
∂xn






43 / 140
Basic Definition
Let ψ (x) = y
Where y is an m-element vector, and x is an n-element vector
Then, we define the derivative with respect to a vector
∂y
∂x
=






∂y1
∂x1
∂y1
∂x2
· · · ∂y1
∂xn
∂y2
∂x1
∂y2
∂x2
· · · ∂y2
∂xn
...
...
...
...
∂ym
∂x1
∂ym
∂x2
· · · ∂ym
∂xn






43 / 140
What is this
The Matrix denotes the m × n matrix of first order partial derivatives
Such a matrix is called the Jacobian matrix of the transformation
ψ (x).
Then, we can get our first ideas on derivatives
For Linear Transformations.
44 / 140
What is this
The Matrix denotes the m × n matrix of first order partial derivatives
Such a matrix is called the Jacobian matrix of the transformation
ψ (x).
Then, we can get our first ideas on derivatives
For Linear Transformations.
44 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 45 / 140
Derivative of y = Ax
Theorem
Let y = Ax where y is a m × 1, x is a n × 1, A is a m × n and A
does not depend on x, then
∂y
∂x
= A
46 / 140
Proof
Each ith
element of y is given by
yi =
N
k=1
aikxk
We have that
∂yi
∂xj
= aij
for all i = 1, 2, ..., m and j = 1, 2, ..., n
Hence
∂y
∂x
= A
47 / 140
Proof
Each ith
element of y is given by
yi =
N
k=1
aikxk
We have that
∂yi
∂xj
= aij
for all i = 1, 2, ..., m and j = 1, 2, ..., n
Hence
∂y
∂x
= A
47 / 140
Proof
Each ith
element of y is given by
yi =
N
k=1
aikxk
We have that
∂yi
∂xj
= aij
for all i = 1, 2, ..., m and j = 1, 2, ..., n
Hence
∂y
∂x
= A
47 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 48 / 140
Derivative of yT
Ax
Theorem
Let the scalar α be defined by
α = yT
Ax
where
y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and
y, then
∂α
∂x
= yT
A and
∂α
∂y
= xT
AT
49 / 140
Derivative of yT
Ax
Theorem
Let the scalar α be defined by
α = yT
Ax
where
y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and
y, then
∂α
∂x
= yT
A and
∂α
∂y
= xT
AT
49 / 140
Proof
Define
wT
= yT
A
Note
α = wT
x
By the previous theorem
∂α
∂x
= wT
= yT
A
In a similar way, you can prove the other statement.
50 / 140
Proof
Define
wT
= yT
A
Note
α = wT
x
By the previous theorem
∂α
∂x
= wT
= yT
A
In a similar way, you can prove the other statement.
50 / 140
Proof
Define
wT
= yT
A
Note
α = wT
x
By the previous theorem
∂α
∂x
= wT
= yT
A
In a similar way, you can prove the other statement.
50 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 51 / 140
What is it?
First than anything, we have a parametric model!!!
Here, we have an hyperplane as a model:
g(x) = wT
x + w0 (1)
Note: wT x is also know as dot product
In the case of R2
We have:
g (x) = (w1, w2)
x1
x2
+ w0 = w1x1 + w2x2 + w0 (2)
52 / 140
What is it?
First than anything, we have a parametric model!!!
Here, we have an hyperplane as a model:
g(x) = wT
x + w0 (1)
Note: wT x is also know as dot product
In the case of R2
We have:
g (x) = (w1, w2)
x1
x2
+ w0 = w1x1 + w2x2 + w0 (2)
52 / 140
Example
Hyperplane in R3
53 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 54 / 140
Splitting The Space R2
Using a simple straight line (Hyperplane)
Class
Class
55 / 140
Splitting the Space?
For example, assume the following vector w and constant w0
w = (−1, 2)T
and w0 = 0
Hyperplane
56 / 140
Splitting the Space?
For example, assume the following vector w and constant w0
w = (−1, 2)T
and w0 = 0
Hyperplane
56 / 140
Then, we have
The following results
g
1
2
= (−1, 2)
1
2
= −1 × 1 + 2 × 2 = 3
g
3
1
= (−1, 2)
3
1
= −1 × 3 + 2 × 1 = −1
YES!!! We have a positive side and a negative side!!!
57 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 58 / 140
The Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Now assume x1 and x2 are both on the decision surface
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
59 / 140
The Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Now assume x1 and x2 are both on the decision surface
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
59 / 140
The Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Now assume x1 and x2 are both on the decision surface
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
59 / 140
Defining a Decision Surface
Then
wT
(x1 − x2) = 0 (4)
60 / 140
Therefore
x1 − x2 lives in the hyperplane i.e. it is perpendicular to wT
Remark: any vector in the hyperplane is a linear combination of
elements in a basis
Therefore any vector in the plane is perpendicular to wT
61 / 140
Therefore
The space is split in two regions (Example in R3
) by the hyperplane H
62 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 63 / 140
Some Properties of the Hyperplane
Given that g (x) > 0 if x ∈ R1
64 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
65 / 140
We have something like this
We have then
66 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
67 / 140
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
68 / 140
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
68 / 140
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
68 / 140
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
68 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 69 / 140
We want to solve the independence of w0
We would like w0 as part of the dot product by making x0 = 1
g (x) = w0 × 1 +
d
i=1
wixi = w0 × x0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
xaug =






1
x1
...
xd






=





1
x





Where
xaug is called an augmented feature vector.
70 / 140
We want to solve the independence of w0
We would like w0 as part of the dot product by making x0 = 1
g (x) = w0 × 1 +
d
i=1
wixi = w0 × x0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
xaug =






1
x1
...
xd






=





1
x





Where
xaug is called an augmented feature vector.
70 / 140
We want to solve the independence of w0
We would like w0 as part of the dot product by making x0 = 1
g (x) = w0 × 1 +
d
i=1
wixi = w0 × x0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
xaug =






1
x1
...
xd






=





1
x





Where
xaug is called an augmented feature vector.
70 / 140
We want to solve the independence of w0
We would like w0 as part of the dot product by making x0 = 1
g (x) = w0 × 1 +
d
i=1
wixi = w0 × x0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
xaug =






1
x1
...
xd






=





1
x





Where
xaug is called an augmented feature vector.
70 / 140
We want to solve the independence of w0
We would like w0 as part of the dot product by making x0 = 1
g (x) = w0 × 1 +
d
i=1
wixi = w0 × x0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
xaug =






1
x1
...
xd






=





1
x





Where
xaug is called an augmented feature vector.
70 / 140
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting xaug vectors, all lie in a d-dimensional subspace which
is the x-space itself.
71 / 140
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting xaug vectors, all lie in a d-dimensional subspace which
is the x-space itself.
71 / 140
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting xaug vectors, all lie in a d-dimensional subspace which
is the x-space itself.
71 / 140
More Remarks
In addition
The hyperplane decision surface H defined by
wT
augxaug = 0
passes through the origin in xaug-space.
Even Though
The corresponding hyperplane H can be in any position of the x-space.
72 / 140
More Remarks
In addition
The hyperplane decision surface H defined by
wT
augxaug = 0
passes through the origin in xaug-space.
Even Though
The corresponding hyperplane H can be in any position of the x-space.
72 / 140
More Remarks
In addition
The distance from y to H is:
wT
augxaug
waug
=
|g (xaug)|
waug
73 / 140
Now
Is w ≤ waug
Ideas?
d
i=1
w2
i ≤
d
i=1
w2
i + w2
0
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
74 / 140
Now
Is w ≤ waug
Ideas?
d
i=1
w2
i ≤
d
i=1
w2
i + w2
0
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
74 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 75 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 76 / 140
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
The name of this weight vector
It is called a separating vector or solution vector.
77 / 140
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
The name of this weight vector
It is called a separating vector or solution vector.
77 / 140
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
The name of this weight vector
It is called a separating vector or solution vector.
77 / 140
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
The name of this weight vector
It is called a separating vector or solution vector.
77 / 140
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
78 / 140
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
78 / 140
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
78 / 140
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
78 / 140
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
79 / 140
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
79 / 140
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
79 / 140
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
79 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 80 / 140
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + e (8)
Where the e
It has a e ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + e (9)
81 / 140
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + e (8)
Where the e
It has a e ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + e (9)
81 / 140
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + e (8)
Where the e
It has a e ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + e (9)
81 / 140
Thus, we have
What to do?
e = ynoise − gideal (x) (10)
Graphically
82 / 140
Thus, we have
What to do?
e = ynoise − gideal (x) (10)
Graphically
82 / 140
Then, we have
A TRICK... Quite a good one!!! Instead of using ynoise
e = ynoise − gideal (x) (11)
We use yideal
e = yideal − gideal (x) (12)
We will see
How the geometry will solve the problem with using these labels.
83 / 140
Then, we have
A TRICK... Quite a good one!!! Instead of using ynoise
e = ynoise − gideal (x) (11)
We use yideal
e = yideal − gideal (x) (12)
We will see
How the geometry will solve the problem with using these labels.
83 / 140
Then, we have
A TRICK... Quite a good one!!! Instead of using ynoise
e = ynoise − gideal (x) (11)
We use yideal
e = yideal − gideal (x) (12)
We will see
How the geometry will solve the problem with using these labels.
83 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 84 / 140
Here, we have multiple errors
What can we do?
85 / 140
Sum Over All the Errors
We can do the following
J (w) =
N
i=1
e2
i =
N
i=1
(yi − gideal (xi))2
(13)
Remark: This is know as the Least Squared Error cost function
Generalizing
The dimensionality of each sample (data point) is d.
You can extend each vector sample to be xT = (1, x ).
86 / 140
Sum Over All the Errors
We can do the following
J (w) =
N
i=1
e2
i =
N
i=1
(yi − gideal (xi))2
(13)
Remark: This is know as the Least Squared Error cost function
Generalizing
The dimensionality of each sample (data point) is d.
You can extend each vector sample to be xT = (1, x ).
86 / 140
Sum Over All the Errors
We can do the following
J (w) =
N
i=1
e2
i =
N
i=1
(yi − gideal (xi))2
(13)
Remark: This is know as the Least Squared Error cost function
Generalizing
The dimensionality of each sample (data point) is d.
You can extend each vector sample to be xT = (1, x ).
86 / 140
We can use a trick
The following function
gideal (x) = 1 x1 x2 ... xd








w0
w2
w3
...
wd








= xT
w
We can rewrite the error equation as
J (w) =
N
i=1
(yi − gideal (xi))2
=
N
i=1
yi − xT
i w
2
(14)
87 / 140
We can use a trick
The following function
gideal (x) = 1 x1 x2 ... xd








w0
w2
w3
...
wd








= xT
w
We can rewrite the error equation as
J (w) =
N
i=1
(yi − gideal (xi))2
=
N
i=1
yi − xT
i w
2
(14)
87 / 140
Furthermore
Then stacking all the possible estimations into the product Data
Matrix and weight vector
Xw =









1 (x1)1 · · · (x1)j · · · (x1)d
...
...
...
1 (xi)1 (xi)j (xi)d
...
...
...
1 (xN )1 · · · (xN )j · · · (xN )d

















w1
w2
w3
...
wd+1








88 / 140
Note about other representations
We could have xT
= (x1, x2, ..., xd, 1) thus
X =









(x1)1 · · · (x1)j · · · (x1)d 1
...
...
...
(xi)1 (xi)j (xi)d 1
...
...
...
(xN )1 · · · (xN )j · · · (xN )d 1









(15)
89 / 140
Then, we have the following trick with X
With the Data Matrix
Xw =








xT
1 w
xT
2 w
xT
3 w
...
xT
N w








(16)
90 / 140
Therefore
We have that








y1
y2
y3
...
y4








−








xT
1 w
xT
2 w
xT
3 w
...
xT
N w








=








y1 − xT
1 w
y2 − xT
2 w
y3 − xT
3 w
...
y4 − xT
N w








Then, we have the following equality
y1 − xT
1 w y2 − xT
2 w y3 − xT
3 w · · · y4 − xT
N w




y1 − xT
1 w
y2 − xT
2 w
y3 − xT
3 w
.
.
.
y4 − xT
N w



 =
N
i=1
yi − x
T
i w
2
91 / 140
Therefore
We have that








y1
y2
y3
...
y4








−








xT
1 w
xT
2 w
xT
3 w
...
xT
N w








=








y1 − xT
1 w
y2 − xT
2 w
y3 − xT
3 w
...
y4 − xT
N w








Then, we have the following equality
y1 − xT
1 w y2 − xT
2 w y3 − xT
3 w · · · y4 − xT
N w




y1 − xT
1 w
y2 − xT
2 w
y3 − xT
3 w
.
.
.
y4 − xT
N w



 =
N
i=1
yi − x
T
i w
2
91 / 140
Then, we have
The following equality
N
i=1
yi − xT
i w
2
= (y − Xw)T
(y − Xw) = y − Xw 2
2 (17)
92 / 140
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − yT
Xw − wT
XT
y + wT
XT
Xw (18)
Now
Derive with respect to w
Assume that XT
X is invertible
93 / 140
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − yT
Xw − wT
XT
y + wT
XT
Xw (18)
Now
Derive with respect to w
Assume that XT
X is invertible
93 / 140
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − yT
Xw − wT
XT
y + wT
XT
Xw (18)
Now
Derive with respect to w
Assume that XT
X is invertible
93 / 140
Therefore
We have the following equivalences
dwT Aw
dw
= wT
A + AT
,
dwT A
dw
= AT
(19)
Now given that the transpose of a number is the number itself
yT
Xw = yT
Xw
T
= wT
XT
y
94 / 140
Therefore
We have the following equivalences
dwT Aw
dw
= wT
A + AT
,
dwT A
dw
= AT
(19)
Now given that the transpose of a number is the number itself
yT
Xw = yT
Xw
T
= wT
XT
y
94 / 140
Then, when we derive by w
We have then
d yT y − 2wT XT
y + wT XT
Xw
dw
= −2yT
X + wT
XT
X + XT
X
= −2yT
X + 2wT
XT
X
Making this equal to the zero row vector
− 2yT
X + 2wT
XT
X = 0
We apply the transpose
−2yT
X + 2wT
XT
X
T
= [0]T
− 2XT
y + 2 XT
X w = 0 (column vector)
95 / 140
Then, when we derive by w
We have then
d yT y − 2wT XT
y + wT XT
Xw
dw
= −2yT
X + wT
XT
X + XT
X
= −2yT
X + 2wT
XT
X
Making this equal to the zero row vector
− 2yT
X + 2wT
XT
X = 0
We apply the transpose
−2yT
X + 2wT
XT
X
T
= [0]T
− 2XT
y + 2 XT
X w = 0 (column vector)
95 / 140
Then, when we derive by w
We have then
d yT y − 2wT XT
y + wT XT
Xw
dw
= −2yT
X + wT
XT
X + XT
X
= −2yT
X + 2wT
XT
X
Making this equal to the zero row vector
− 2yT
X + 2wT
XT
X = 0
We apply the transpose
−2yT
X + 2wT
XT
X
T
= [0]T
− 2XT
y + 2 XT
X w = 0 (column vector)
95 / 140
Then, when we derive by w
We have then
d yT y − 2wT XT
y + wT XT
Xw
dw
= −2yT
X + wT
XT
X + XT
X
= −2yT
X + 2wT
XT
X
Making this equal to the zero row vector
− 2yT
X + 2wT
XT
X = 0
We apply the transpose
−2yT
X + 2wT
XT
X
T
= [0]T
− 2XT
y + 2 XT
X w = 0 (column vector)
95 / 140
Then, when we derive by w
We have then
d yT y − 2wT XT
y + wT XT
Xw
dw
= −2yT
X + wT
XT
X + XT
X
= −2yT
X + 2wT
XT
X
Making this equal to the zero row vector
− 2yT
X + 2wT
XT
X = 0
We apply the transpose
−2yT
X + 2wT
XT
X
T
= [0]T
− 2XT
y + 2 XT
X w = 0 (column vector)
95 / 140
Solving for w
We have then
w = XT
X
−1
XT
y (20)
Note:XT
X is always positive semi-definite. If it is also invertible, it is
positive definite.
Thus, How we get the discriminant function?
Any Ideas?
96 / 140
Solving for w
We have then
w = XT
X
−1
XT
y (20)
Note:XT
X is always positive semi-definite. If it is also invertible, it is
positive definite.
Thus, How we get the discriminant function?
Any Ideas?
96 / 140
The Final Discriminant Function
Very Simple!!!
g(x) = xT
w = xT
XT
X
−1
XT
y (21)
97 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 98 / 140
Also Known as Karhunen-Loeve Transform
Setup
Consider a data set of observations {xn} with n = 1, 2, ..., N and
xn ∈ Rd.
Goal
Project data onto space with dimensionality m < d (We assume m is
given)
99 / 140
Also Known as Karhunen-Loeve Transform
Setup
Consider a data set of observations {xn} with n = 1, 2, ..., N and
xn ∈ Rd.
Goal
Project data onto space with dimensionality m < d (We assume m is
given)
99 / 140
Dimensional Variance
Remember the Variance Sample in R
V AR(X) =
N
i=1 (xi − x)2
N − 1
(22)
You can do the same in the case of two variables X and Y
COV (x, y) =
N
i=1 (xi − x) (yi − y)
N − 1
(23)
100 / 140
Dimensional Variance
Remember the Variance Sample in R
V AR(X) =
N
i=1 (xi − x)2
N − 1
(22)
You can do the same in the case of two variables X and Y
COV (x, y) =
N
i=1 (xi − x) (yi − y)
N − 1
(23)
100 / 140
Now, Define
Given the data
x1, x2, ..., xN (24)
where xi is a column vector
Construct the sample mean
x =
1
N
N
i=1
xi (25)
Center data
x1 − x, x2 − x, ..., xN − x (26)
101 / 140
Now, Define
Given the data
x1, x2, ..., xN (24)
where xi is a column vector
Construct the sample mean
x =
1
N
N
i=1
xi (25)
Center data
x1 − x, x2 − x, ..., xN − x (26)
101 / 140
Now, Define
Given the data
x1, x2, ..., xN (24)
where xi is a column vector
Construct the sample mean
x =
1
N
N
i=1
xi (25)
Center data
x1 − x, x2 − x, ..., xN − x (26)
101 / 140
Build the Sample Mean
The Covariance Matrix
S =
1
N − 1
N
i=1
(xi − x) (xi − x)T
(27)
Properties
1 The ijth value of S is equivalent to σ2
ij.
2 The iith value of S is equivalent to σ2
ii.
102 / 140
Build the Sample Mean
The Covariance Matrix
S =
1
N − 1
N
i=1
(xi − x) (xi − x)T
(27)
Properties
1 The ijth value of S is equivalent to σ2
ij.
2 The iith value of S is equivalent to σ2
ii.
102 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 103 / 140
Using S to Project Data
For this we use a u1
with uT
1 u1 = 1, an orthonormal vector
Question
What is the Sample Variance of the Projected Data?
104 / 140
Using S to Project Data
For this we use a u1
with uT
1 u1 = 1, an orthonormal vector
Question
What is the Sample Variance of the Projected Data?
104 / 140
Using S to Project Data
For this we use a u1
with uT
1 u1 = 1, an orthonormal vector
Question
What is the Sample Variance of the Projected Data?
104 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 105 / 140
Thus we have
Variance of the projected data
1
N − 1
N
i=1
[u1xi − u1x] = uT
1 Su1 (28)
Use Lagrange Multipliers to Maximize
uT
1 Su1 + λ1 1 − uT
1 u1 (29)
106 / 140
Thus we have
Variance of the projected data
1
N − 1
N
i=1
[u1xi − u1x] = uT
1 Su1 (28)
Use Lagrange Multipliers to Maximize
uT
1 Su1 + λ1 1 − uT
1 u1 (29)
106 / 140
Derive by u1
We get
Su1 = λ1u1 (30)
Then
u1 is an eigenvector of S.
If we left-multiply by u1
uT
1 Su1 = λ1 (31)
107 / 140
Derive by u1
We get
Su1 = λ1u1 (30)
Then
u1 is an eigenvector of S.
If we left-multiply by u1
uT
1 Su1 = λ1 (31)
107 / 140
Derive by u1
We get
Su1 = λ1u1 (30)
Then
u1 is an eigenvector of S.
If we left-multiply by u1
uT
1 Su1 = λ1 (31)
107 / 140
What about the second eigenvector u2
We have the following optimization problem
max uT
2 Su2
s.t. uT
2 u2 = 1
uT
2 u1 = 0
Lagrangian
L (u2, λ1, λ2) = uT
2 Su2 − λ1 uT
2 u2 − 1 − λ2 uT
2 u1 − 0
108 / 140
What about the second eigenvector u2
We have the following optimization problem
max uT
2 Su2
s.t. uT
2 u2 = 1
uT
2 u1 = 0
Lagrangian
L (u2, λ1, λ2) = uT
2 Su2 − λ1 uT
2 u2 − 1 − λ2 uT
2 u1 − 0
108 / 140
Explanation
First the constrained minimization
We want to to maximize uT
2 Su2
Given that the second eigenvector is orthonormal
We have then uT
2 u2 = 1
Under orthonormal vectors
The covariance goes to zero
cov (u1, u2) = uT
2 Su1 = u2λ1u1 = λ1uT
1 u2 = 0
109 / 140
Explanation
First the constrained minimization
We want to to maximize uT
2 Su2
Given that the second eigenvector is orthonormal
We have then uT
2 u2 = 1
Under orthonormal vectors
The covariance goes to zero
cov (u1, u2) = uT
2 Su1 = u2λ1u1 = λ1uT
1 u2 = 0
109 / 140
Explanation
First the constrained minimization
We want to to maximize uT
2 Su2
Given that the second eigenvector is orthonormal
We have then uT
2 u2 = 1
Under orthonormal vectors
The covariance goes to zero
cov (u1, u2) = uT
2 Su1 = u2λ1u1 = λ1uT
1 u2 = 0
109 / 140
Meaning
The PCA’s are perpendicular
L (u2, λ1, λ2) = uT
2 Su2 − λ1 uT
2 u2 − 1 − λ2 uT
2 u1 − 0
The the derivative with respect to u2
∂L (u2, λ1, λ2)
∂u2
= Su2 − λ1u2 − λ2u1 = 0
Then, we left multiply u1
uT
1 Su2 − λ1uT
1 u2 − λ2uT
1 u1 = 0
110 / 140
Meaning
The PCA’s are perpendicular
L (u2, λ1, λ2) = uT
2 Su2 − λ1 uT
2 u2 − 1 − λ2 uT
2 u1 − 0
The the derivative with respect to u2
∂L (u2, λ1, λ2)
∂u2
= Su2 − λ1u2 − λ2u1 = 0
Then, we left multiply u1
uT
1 Su2 − λ1uT
1 u2 − λ2uT
1 u1 = 0
110 / 140
Meaning
The PCA’s are perpendicular
L (u2, λ1, λ2) = uT
2 Su2 − λ1 uT
2 u2 − 1 − λ2 uT
2 u1 − 0
The the derivative with respect to u2
∂L (u2, λ1, λ2)
∂u2
= Su2 − λ1u2 − λ2u1 = 0
Then, we left multiply u1
uT
1 Su2 − λ1uT
1 u2 − λ2uT
1 u1 = 0
110 / 140
Then, we have that
Something Notable
0 − 0 − λ2 = 0
We have
Su2 − λ2u2 = 0
Implying
u2 is the eigenvector of S with second largest eigenvalue λ2.
111 / 140
Then, we have that
Something Notable
0 − 0 − λ2 = 0
We have
Su2 − λ2u2 = 0
Implying
u2 is the eigenvector of S with second largest eigenvalue λ2.
111 / 140
Then, we have that
Something Notable
0 − 0 − λ2 = 0
We have
Su2 − λ2u2 = 0
Implying
u2 is the eigenvector of S with second largest eigenvalue λ2.
111 / 140
Thus
Variance will be the maximum when
uT
1 Su1 = λ1 (32)
is set to the largest eigenvalue. Also know as the First Principal
Component
By Induction
It is possible for M-dimensional space to define M eigenvectors
u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM
that maximize the variance of the projected data.
Computational Cost
1 Full eigenvector decomposition O d3
2 Power Method O Md2 “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm
112 / 140
Thus
Variance will be the maximum when
uT
1 Su1 = λ1 (32)
is set to the largest eigenvalue. Also know as the First Principal
Component
By Induction
It is possible for M-dimensional space to define M eigenvectors
u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM
that maximize the variance of the projected data.
Computational Cost
1 Full eigenvector decomposition O d3
2 Power Method O Md2 “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm
112 / 140
Thus
Variance will be the maximum when
uT
1 Su1 = λ1 (32)
is set to the largest eigenvalue. Also know as the First Principal
Component
By Induction
It is possible for M-dimensional space to define M eigenvectors
u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM
that maximize the variance of the projected data.
Computational Cost
1 Full eigenvector decomposition O d3
2 Power Method O Md2 “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm
112 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 113 / 140
We have the following steps
Determine covariance matrix
S =
1
N − 1
N
i=1
(xi − x) (xi − x)T
(33)
Generate the decomposition
S = UΣUT
With
Eigenvalues in Σ and eigenvectors in the columns of U.
114 / 140
We have the following steps
Determine covariance matrix
S =
1
N − 1
N
i=1
(xi − x) (xi − x)T
(33)
Generate the decomposition
S = UΣUT
With
Eigenvalues in Σ and eigenvectors in the columns of U.
114 / 140
We have the following steps
Determine covariance matrix
S =
1
N − 1
N
i=1
(xi − x) (xi − x)T
(33)
Generate the decomposition
S = UΣUT
With
Eigenvalues in Σ and eigenvectors in the columns of U.
114 / 140
Then
Project samples xi into subspaces dim=k
zi = UT
Kxi
With Uk is a matrix with k columns
115 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 116 / 140
Example
From Bishop
117 / 140
Example
From Bishop
118 / 140
Example
From Bishop
119 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 120 / 140
What happened with no-square matrices
We can still diagonalize it
Thus, we can obtain certain properties.
We want to avoid the problems with
S−1
AS
The eigenvectors in S have three big problems
1 They are usually not orthogonal.
2 There are not always enough eigenvectors.
3 Ax = λx requires A to be square.
121 / 140
What happened with no-square matrices
We can still diagonalize it
Thus, we can obtain certain properties.
We want to avoid the problems with
S−1
AS
The eigenvectors in S have three big problems
1 They are usually not orthogonal.
2 There are not always enough eigenvectors.
3 Ax = λx requires A to be square.
121 / 140
What happened with no-square matrices
We can still diagonalize it
Thus, we can obtain certain properties.
We want to avoid the problems with
S−1
AS
The eigenvectors in S have three big problems
1 They are usually not orthogonal.
2 There are not always enough eigenvectors.
3 Ax = λx requires A to be square.
121 / 140
Therefore, we can look at the following problem
We have a series of vectors
{x1, x2, ..., xd}
Then imagine a set of projection vectors and differences
{β1, β2, ..., βd} and {α1, α2, ..., αd}
We want to know a little bit of the relations between them
After all, we are looking at the possibility of using them for our
problem
122 / 140
Therefore, we can look at the following problem
We have a series of vectors
{x1, x2, ..., xd}
Then imagine a set of projection vectors and differences
{β1, β2, ..., βd} and {α1, α2, ..., αd}
We want to know a little bit of the relations between them
After all, we are looking at the possibility of using them for our
problem
122 / 140
Therefore, we can look at the following problem
We have a series of vectors
{x1, x2, ..., xd}
Then imagine a set of projection vectors and differences
{β1, β2, ..., βd} and {α1, α2, ..., αd}
We want to know a little bit of the relations between them
After all, we are looking at the possibility of using them for our
problem
122 / 140
Using the Hypotenuse
A little bit of Geometry, we get
123 / 140
Therefore
We have two possible quantities for each j
αT
j αj = xT
j xj − aT
j aj
aT
j aj = xT
j xj − αT
j αj
Then, we can minimize and maximize given that xT
j xj is a constant
min
n
j=1
αT
j αj
max
n
j=1
aT
j aj
124 / 140
Therefore
We have two possible quantities for each j
αT
j αj = xT
j xj − aT
j aj
aT
j aj = xT
j xj − αT
j αj
Then, we can minimize and maximize given that xT
j xj is a constant
min
n
j=1
αT
j αj
max
n
j=1
aT
j aj
124 / 140
Actually this is know as the dual problem (Weak Duality)
An example of this
min wT x
s.tAx ≤b
x ≥0
Then, using what is know as slack variables
Ax + A x = b
Each row lives in the column space, but the yi lives in the column
space
Ax + A x i → yi and x ≥ 0
125 / 140
Actually this is know as the dual problem (Weak Duality)
An example of this
min wT x
s.tAx ≤b
x ≥0
Then, using what is know as slack variables
Ax + A x = b
Each row lives in the column space, but the yi lives in the column
space
Ax + A x i → yi and x ≥ 0
125 / 140
Actually this is know as the dual problem (Weak Duality)
An example of this
min wT x
s.tAx ≤b
x ≥0
Then, using what is know as slack variables
Ax + A x = b
Each row lives in the column space, but the yi lives in the column
space
Ax + A x i → yi and x ≥ 0
125 / 140
Then, we have that
Example



0 0
0 1
1 0



Element in the column space of dimensionality have three dimensions
But in the row space their dimension is 2
Properties
126 / 140
Then, we have that
Example



0 0
0 1
1 0



Element in the column space of dimensionality have three dimensions
But in the row space their dimension is 2
Properties
126 / 140
Then, we have that
Example



0 0
0 1
1 0



Element in the column space of dimensionality have three dimensions
But in the row space their dimension is 2
Properties
126 / 140
We have then
Stack such vectors that in the d-dimensional space
In a matrix A of n × d
A =






aT
1
aT
2
...
aT
n






The matrix works as a Projection Matrix
We are looking for a unit vector v such that length of the projection
is maximized.
127 / 140
We have then
Stack such vectors that in the d-dimensional space
In a matrix A of n × d
A =






aT
1
aT
2
...
aT
n






The matrix works as a Projection Matrix
We are looking for a unit vector v such that length of the projection
is maximized.
127 / 140
Why? Do you remember the Projection to a single vector
p?
Definition of the projection under unitary vector
p =
vT ai
vT v
v = vT
ai v
Therefore the length of the projected vector is
vT
ai v = vT
ai
128 / 140
Why? Do you remember the Projection to a single vector
p?
Definition of the projection under unitary vector
p =
vT ai
vT v
v = vT
ai v
Therefore the length of the projected vector is
vT
ai v = vT
ai
128 / 140
Then
Thus with a little bit of notation
Av =






aT
1
aT
2
...
aT
d






v =






aT
1 v
aT
2 v
...
aT
d v






Therefore
Av =
d
i=1
aT
i v
2
129 / 140
Then
Thus with a little bit of notation
Av =






aT
1
aT
2
...
aT
d






v =






aT
1 v
aT
2 v
...
aT
d v






Therefore
Av =
d
i=1
aT
i v
2
129 / 140
Then
It is possible to ask to maximize the longitude of such vector
(Singular Vector)
v1 = arg max
v =1
Av
Then, we can define the following singular value
σ1 (A) = Av1
130 / 140
Then
It is possible to ask to maximize the longitude of such vector
(Singular Vector)
v1 = arg max
v =1
Av
Then, we can define the following singular value
σ1 (A) = Av1
130 / 140
This is known as
Definition
The best-fit line problem describes the problem of finding the best
line for a set of data points, where the quality of the line is measured
by the sum of squared (perpendicular) distances of the points to the
line.
Remember, we are looking at the dual problem....
Generalization
This can be transferred to higher dimensions: One can find the
best-fit d-dimensional subspace, so the subspace which minimizes the
sum of the squared distances of the points to the subspace
131 / 140
This is known as
Definition
The best-fit line problem describes the problem of finding the best
line for a set of data points, where the quality of the line is measured
by the sum of squared (perpendicular) distances of the points to the
line.
Remember, we are looking at the dual problem....
Generalization
This can be transferred to higher dimensions: One can find the
best-fit d-dimensional subspace, so the subspace which minimizes the
sum of the squared distances of the points to the subspace
131 / 140
Then, in a Greedy Fashion
The second singular vector v2
v2 = arg max
v⊥v1, v =1
Av
Them you go through this process
Stop when we have found all the following vectors:
v1, v2, ..., vr
As singular vectors and
arg max
v ⊥ v1, v2, ..., vr
v = 1
Av
132 / 140
Then, in a Greedy Fashion
The second singular vector v2
v2 = arg max
v⊥v1, v =1
Av
Them you go through this process
Stop when we have found all the following vectors:
v1, v2, ..., vr
As singular vectors and
arg max
v ⊥ v1, v2, ..., vr
v = 1
Av
132 / 140
Then, in a Greedy Fashion
The second singular vector v2
v2 = arg max
v⊥v1, v =1
Av
Them you go through this process
Stop when we have found all the following vectors:
v1, v2, ..., vr
As singular vectors and
arg max
v ⊥ v1, v2, ..., vr
v = 1
Av
132 / 140
Proving that the strategy is good
Theorem
Let A be an n × d matrix where v1, v2, ..., vr are the singular vectors
defined above. For 1 ≤ k ≤ r, let Vk be the subspace spanned by
v1, v2, ..., vk. Then for each k, Vk is the best-fit k-dimensional
subspace for A.
133 / 140
Proof
For k = 1
What about k = 2? Let W be a best-fit 2- dimensional subspace for
A.
For any basis w1, w2 of W
|Aw1|2
+ |Aw2|2
is the sum of the squared lengths of the projections
of the rows of A to W.
Now, choose a basis w1, w2 so that w2 is perpendicular to v1
This can be a unit vector perpendicular to v1 projection in W.
134 / 140
Proof
For k = 1
What about k = 2? Let W be a best-fit 2- dimensional subspace for
A.
For any basis w1, w2 of W
|Aw1|2
+ |Aw2|2
is the sum of the squared lengths of the projections
of the rows of A to W.
Now, choose a basis w1, w2 so that w2 is perpendicular to v1
This can be a unit vector perpendicular to v1 projection in W.
134 / 140
Proof
For k = 1
What about k = 2? Let W be a best-fit 2- dimensional subspace for
A.
For any basis w1, w2 of W
|Aw1|2
+ |Aw2|2
is the sum of the squared lengths of the projections
of the rows of A to W.
Now, choose a basis w1, w2 so that w2 is perpendicular to v1
This can be a unit vector perpendicular to v1 projection in W.
134 / 140
Do you remember v1 = arg max v =1 Av ?
Therefore
|Aw1|2
≤ |Av1|2
and |Aw2|2
≤ |Av2|2
Then
|Aw1|2
+ |Aw2|2
≤ |Av1|2
+ |Av2|2
In a similar way for k > 2
Vk is at least as good as W and hence is optimal.
135 / 140
Do you remember v1 = arg max v =1 Av ?
Therefore
|Aw1|2
≤ |Av1|2
and |Aw2|2
≤ |Av2|2
Then
|Aw1|2
+ |Aw2|2
≤ |Av1|2
+ |Av2|2
In a similar way for k > 2
Vk is at least as good as W and hence is optimal.
135 / 140
Do you remember v1 = arg max v =1 Av ?
Therefore
|Aw1|2
≤ |Av1|2
and |Aw2|2
≤ |Av2|2
Then
|Aw1|2
+ |Aw2|2
≤ |Av1|2
+ |Av2|2
In a similar way for k > 2
Vk is at least as good as W and hence is optimal.
135 / 140
Remarks
Every Matrix has a singular value decomposition
A = UΣV T
Where
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
136 / 140
Remarks
Every Matrix has a singular value decomposition
A = UΣV T
Where
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
136 / 140
Remarks
Every Matrix has a singular value decomposition
A = UΣV T
Where
The columns of U are an orthonormal basis for the column space.
The columns of V are an orthonormal basis for the row space.
The Σ is diagonal and the entries on its diagonal σi = Σii are positive
real numbers, called the singular values of A.
136 / 140
Properties of the Singular Value Decomposition
First
The eigenvalues of the symmetric matrix AT A are equal to the square of
the singular values of A
AT A = V ΣUT UT ΣV T = V Σ2V T
Second
The rank of a matrix is equal to the number of non-zero singular values.
137 / 140
Properties of the Singular Value Decomposition
First
The eigenvalues of the symmetric matrix AT A are equal to the square of
the singular values of A
AT A = V ΣUT UT ΣV T = V Σ2V T
Second
The rank of a matrix is equal to the number of non-zero singular values.
137 / 140
Outline
1 Linear Transformation
Introduction
Functions that can be defined using matrices
Linear Functions
Kernel and Range
The Matrix of a Linear Transformation
Going Back to Homogeneous Equations
The Rank-Nullity Theorem
2 Derivative of Transformations
Introduction
Derivative of a Linear Transformation
Derivative of a Quadratic Transformation
3 Linear Regression
The Simplest Functions
Splitting the Space
Defining the Decision Surface
Properties of the Hyperplane wT
x + w0
Augmenting the Vector
Least Squared Error Procedure
The Geometry of a Two-Category Linearly-Separable Case
The Error Idea
The Final Error Equation
4 Principal Component Analysis
Karhunen-Loeve Transform
Projecting the Data
Lagrange Multipliers
The Process
Example
5 Singular Value Decomposition
Introduction
Image Compression 138 / 140
Singular Value Decomposition as Sums
The singular value decomposition can be viewed as a sum of rank 1
matrices
A = A1 + A2 + ... + AR (34)
Why?
u1A = U


σ1 0 · · · 0
0 σ2 · · · 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · σR

V
T
= u1 u2 · · · uR



σ1vT
1
σ2vT
2
.
.
.
σRvT
R



=σ1u1v
T
1 + σ2u2v
T
2 + · · · + σRuRv
T
R
139 / 140
Singular Value Decomposition as Sums
The singular value decomposition can be viewed as a sum of rank 1
matrices
A = A1 + A2 + ... + AR (34)
Why?
u1A = U


σ1 0 · · · 0
0 σ2 · · · 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · σR

V
T
= u1 u2 · · · uR



σ1vT
1
σ2vT
2
.
.
.
σRvT
R



=σ1u1v
T
1 + σ2u2v
T
2 + · · · + σRuRv
T
R
139 / 140
Truncating
Truncating the singular value decomposition allows us to represent
the matrix with less parameters
For a 512 × 512
Full Representation 512 × 512 = 262, 144
Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250
Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000
Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000
140 / 140
Truncating
Truncating the singular value decomposition allows us to represent
the matrix with less parameters
For a 512 × 512
Full Representation 512 × 512 = 262, 144
Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250
Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000
Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000
140 / 140
Truncating
Truncating the singular value decomposition allows us to represent
the matrix with less parameters
For a 512 × 512
Full Representation 512 × 512 = 262, 144
Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250
Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000
Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000
140 / 140
Truncating
Truncating the singular value decomposition allows us to represent
the matrix with less parameters
For a 512 × 512
Full Representation 512 × 512 = 262, 144
Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250
Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000
Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000
140 / 140

More Related Content

What's hot

Linear transformations and matrices
Linear transformations and matricesLinear transformations and matrices
Linear transformations and matricesEasyStudy3
 
linear transformation
linear transformationlinear transformation
linear transformationmansi acharya
 
Null space, Rank and nullity theorem
Null space, Rank and nullity theoremNull space, Rank and nullity theorem
Null space, Rank and nullity theoremRonak Machhi
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrenceBrian Burns
 
Lecture 8 nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6
Lecture 8   nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6Lecture 8   nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6
Lecture 8 nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6njit-ronbrown
 
Linear transformation.ppt
Linear transformation.pptLinear transformation.ppt
Linear transformation.pptRaj Parekh
 
My presentation all shortestpath
My presentation all shortestpathMy presentation all shortestpath
My presentation all shortestpathCarlostheran
 
Linear Combination, Span And Linearly Independent, Dependent Set
Linear Combination, Span And Linearly Independent, Dependent SetLinear Combination, Span And Linearly Independent, Dependent Set
Linear Combination, Span And Linearly Independent, Dependent SetDhaval Shukla
 
Physical Chemistry Assignment Help
Physical Chemistry Assignment HelpPhysical Chemistry Assignment Help
Physical Chemistry Assignment HelpEdu Assignment Help
 
Lecture 9 dim & rank - 4-5 & 4-6
Lecture 9   dim & rank -  4-5 & 4-6Lecture 9   dim & rank -  4-5 & 4-6
Lecture 9 dim & rank - 4-5 & 4-6njit-ronbrown
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statisticsSteve Nouri
 
08 decrease and conquer spring 15
08 decrease and conquer spring 1508 decrease and conquer spring 15
08 decrease and conquer spring 15Hira Gul
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculusSteve Nouri
 
Inner Product Space
Inner Product SpaceInner Product Space
Inner Product SpacePatel Raj
 

What's hot (20)

Linear transformations and matrices
Linear transformations and matricesLinear transformations and matrices
Linear transformations and matrices
 
03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra
 
linear transformation
linear transformationlinear transformation
linear transformation
 
Vector spaces
Vector spacesVector spaces
Vector spaces
 
Chemistry Assignment Help
Chemistry Assignment Help Chemistry Assignment Help
Chemistry Assignment Help
 
Null space, Rank and nullity theorem
Null space, Rank and nullity theoremNull space, Rank and nullity theorem
Null space, Rank and nullity theorem
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrence
 
Lecture 8 nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6
Lecture 8   nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6Lecture 8   nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6
Lecture 8 nul col bases dim & rank - section 4-2, 4-3, 4-5 & 4-6
 
Linear transformation.ppt
Linear transformation.pptLinear transformation.ppt
Linear transformation.ppt
 
My presentation all shortestpath
My presentation all shortestpathMy presentation all shortestpath
My presentation all shortestpath
 
Linear Combination, Span And Linearly Independent, Dependent Set
Linear Combination, Span And Linearly Independent, Dependent SetLinear Combination, Span And Linearly Independent, Dependent Set
Linear Combination, Span And Linearly Independent, Dependent Set
 
Inner product spaces
Inner product spacesInner product spaces
Inner product spaces
 
Physical Chemistry Assignment Help
Physical Chemistry Assignment HelpPhysical Chemistry Assignment Help
Physical Chemistry Assignment Help
 
Lecture 9 dim & rank - 4-5 & 4-6
Lecture 9   dim & rank -  4-5 & 4-6Lecture 9   dim & rank -  4-5 & 4-6
Lecture 9 dim & rank - 4-5 & 4-6
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Chua's circuit
Chua's circuitChua's circuit
Chua's circuit
 
08 decrease and conquer spring 15
08 decrease and conquer spring 1508 decrease and conquer spring 15
08 decrease and conquer spring 15
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculus
 
Inner Product Space
Inner Product SpaceInner Product Space
Inner Product Space
 

Similar to 05 linear transformations

Laplace Transform and its applications
Laplace Transform and its applicationsLaplace Transform and its applications
Laplace Transform and its applicationsDeepRaval7
 
Mba Ebooks ! Edhole
Mba Ebooks ! EdholeMba Ebooks ! Edhole
Mba Ebooks ! EdholeEdhole.com
 
Laplace Transformation & Its Application
Laplace Transformation & Its ApplicationLaplace Transformation & Its Application
Laplace Transformation & Its ApplicationChandra Kundu
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TDAshwin Rao
 
Linear approximations and_differentials
Linear approximations and_differentialsLinear approximations and_differentials
Linear approximations and_differentialsTarun Gehlot
 
Project in Calcu
Project in CalcuProject in Calcu
Project in Calcupatrickpaz
 
Mba Ebooks ! Edhole
Mba Ebooks ! EdholeMba Ebooks ! Edhole
Mba Ebooks ! EdholeEdhole.com
 
Dericavion e integracion de funciones
Dericavion e integracion de funcionesDericavion e integracion de funciones
Dericavion e integracion de funcionesdiegoalejandroalgara
 
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
Elementary Landscape Decomposition of the Hamiltonian Path Optimization ProblemElementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problemjfrchicanog
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptxMAmir23
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment HelpMath Homework Solver
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment HelpMaths Assignment Help
 

Similar to 05 linear transformations (20)

Laplace Transform and its applications
Laplace Transform and its applicationsLaplace Transform and its applications
Laplace Transform and its applications
 
Derivative rules.docx
Derivative rules.docxDerivative rules.docx
Derivative rules.docx
 
Mba Ebooks ! Edhole
Mba Ebooks ! EdholeMba Ebooks ! Edhole
Mba Ebooks ! Edhole
 
Laplace Transformation & Its Application
Laplace Transformation & Its ApplicationLaplace Transformation & Its Application
Laplace Transformation & Its Application
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
Signals and Systems Assignment Help
Signals and Systems Assignment HelpSignals and Systems Assignment Help
Signals and Systems Assignment Help
 
EPE821_Lecture3.pptx
EPE821_Lecture3.pptxEPE821_Lecture3.pptx
EPE821_Lecture3.pptx
 
Linear approximations and_differentials
Linear approximations and_differentialsLinear approximations and_differentials
Linear approximations and_differentials
 
Mamt2 u1 a1_yaco
Mamt2 u1 a1_yacoMamt2 u1 a1_yaco
Mamt2 u1 a1_yaco
 
1519 differentiation-integration-02
1519 differentiation-integration-021519 differentiation-integration-02
1519 differentiation-integration-02
 
Project in Calcu
Project in CalcuProject in Calcu
Project in Calcu
 
Mba Ebooks ! Edhole
Mba Ebooks ! EdholeMba Ebooks ! Edhole
Mba Ebooks ! Edhole
 
Dericavion e integracion de funciones
Dericavion e integracion de funcionesDericavion e integracion de funciones
Dericavion e integracion de funciones
 
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
Elementary Landscape Decomposition of the Hamiltonian Path Optimization ProblemElementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problem
 
Trabajo Leo mat 3 2021
Trabajo Leo mat 3 2021Trabajo Leo mat 3 2021
Trabajo Leo mat 3 2021
 
BP106RMT.pdf
BP106RMT.pdfBP106RMT.pdf
BP106RMT.pdf
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptx
 
Function
Function Function
Function
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 

More from Andres Mendez-Vazquez

05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine Learning
 
Schedule Classes DataLab Community
Schedule Classes DataLab CommunitySchedule Classes DataLab Community
Schedule Classes DataLab Community
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 

Recently uploaded

Module-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfModule-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfManish Kumar
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studydhruvamdhruvil123
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunicationnovrain7111
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHbirinder2
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliNimot Muili
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackinghadarpinhas1
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 

Recently uploaded (20)

Module-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfModule-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdf
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain study
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunication
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRH
 
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and tracking
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 

05 linear transformations

  • 1. Mathematics for Artificial Intelligence Transformation and Applications Andres Mendez-Vazquez April 9, 2020 1 / 140
  • 2. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 2 / 140
  • 3. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 3 / 140
  • 4. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  • 5. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  • 6. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  • 7. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  • 8. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
  • 9. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
  • 10. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 6 / 140
  • 11. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  • 12. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  • 13. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  • 14. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  • 15. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  • 16. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  • 17. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 9 / 140
  • 18. We have Definition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
  • 19. We have Definition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
  • 20. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
  • 21. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
  • 22. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  • 23. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  • 24. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  • 25. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 13 / 140
  • 26. We have Definition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is defined by Ker (f) = {v ∈ Rn |f (v) = 0} Definition If f : Rn → Rm is linear, the range of f is defined by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
  • 27. We have Definition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is defined by Ker (f) = {v ∈ Rn |f (v) = 0} Definition If f : Rn → Rm is linear, the range of f is defined by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
  • 28. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
  • 29. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
  • 30. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
  • 31. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
  • 32. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 17 / 140
  • 33. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
  • 34. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
  • 35. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  • 36. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  • 37. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  • 38. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Definition The matrix A defined above for the function f is called the matrix of f in the standard basis. 20 / 140
  • 39. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Definition The matrix A defined above for the function f is called the matrix of f in the standard basis. 20 / 140
  • 40. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 21 / 140
  • 41. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
  • 42. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
  • 43. Then, we have Definition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
  • 44. Then, we have Definition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
  • 45. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  • 46. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  • 47. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  • 48. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 25 / 140
  • 49. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
  • 50. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
  • 51. Rank and Nullity of a Matrix Definition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
  • 52. Rank and Nullity of a Matrix Definition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
  • 53. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  • 54. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  • 55. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  • 56. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
  • 57. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
  • 58. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  • 59. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  • 60. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  • 61. Number of Free Variables Theorem The nullity of a matrix in Gauss-Jordan form is equal to the number of free variables. Proof Suppose A is m × n, and that the Gauss-Jordan form has j leading variables and k free variables: j + k = n 31 / 140
  • 62. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  • 63. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  • 64. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  • 65. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  • 66. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  • 67. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  • 68. Therefore They are a basis for the null space of A And there are k of them, the same as the number of free variables. 34 / 140
  • 69. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  • 70. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  • 71. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  • 72. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  • 73. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  • 74. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  • 75. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
  • 76. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
  • 77. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  • 78. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  • 79. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  • 80. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  • 81. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  • 82. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  • 83. Finally We have dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) 40 / 140
  • 84. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 41 / 140
  • 85. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to define matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
  • 86. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to define matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
  • 87. Basic Definition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we define the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
  • 88. Basic Definition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we define the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
  • 89. What is this The Matrix denotes the m × n matrix of first order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our first ideas on derivatives For Linear Transformations. 44 / 140
  • 90. What is this The Matrix denotes the m × n matrix of first order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our first ideas on derivatives For Linear Transformations. 44 / 140
  • 91. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 45 / 140
  • 92. Derivative of y = Ax Theorem Let y = Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x, then ∂y ∂x = A 46 / 140
  • 93. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  • 94. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  • 95. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  • 96. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 48 / 140
  • 97. Derivative of yT Ax Theorem Let the scalar α be defined by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
  • 98. Derivative of yT Ax Theorem Let the scalar α be defined by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
  • 99. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  • 100. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  • 101. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  • 102. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 51 / 140
  • 103. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
  • 104. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
  • 106. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 54 / 140
  • 107. Splitting The Space R2 Using a simple straight line (Hyperplane) Class Class 55 / 140
  • 108. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
  • 109. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
  • 110. Then, we have The following results g 1 2 = (−1, 2) 1 2 = −1 × 1 + 2 × 2 = 3 g 3 1 = (−1, 2) 3 1 = −1 × 3 + 2 × 1 = −1 YES!!! We have a positive side and a negative side!!! 57 / 140
  • 111. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 58 / 140
  • 112. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  • 113. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  • 114. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  • 115. Defining a Decision Surface Then wT (x1 − x2) = 0 (4) 60 / 140
  • 116. Therefore x1 − x2 lives in the hyperplane i.e. it is perpendicular to wT Remark: any vector in the hyperplane is a linear combination of elements in a basis Therefore any vector in the plane is perpendicular to wT 61 / 140
  • 117. Therefore The space is split in two regions (Example in R3 ) by the hyperplane H 62 / 140
  • 118. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 63 / 140
  • 119. Some Properties of the Hyperplane Given that g (x) > 0 if x ∈ R1 64 / 140
  • 120. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 121. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 122. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 123. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 124. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 125. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 126. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  • 127. We have something like this We have then 66 / 140
  • 128. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 129. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 130. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 131. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 132. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 133. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  • 134. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  • 135. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  • 136. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  • 137. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  • 138. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 69 / 140
  • 139. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  • 140. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  • 141. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  • 142. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  • 143. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  • 144. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  • 145. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  • 146. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  • 147. More Remarks In addition The hyperplane decision surface H defined by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
  • 148. More Remarks In addition The hyperplane decision surface H defined by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
  • 149. More Remarks In addition The distance from y to H is: wT augxaug waug = |g (xaug)| waug 73 / 140
  • 150. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 74 / 140
  • 151. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 74 / 140
  • 152. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 75 / 140
  • 153. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 76 / 140
  • 154. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  • 155. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  • 156. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  • 157. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  • 158. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  • 159. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  • 160. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  • 161. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  • 162. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  • 163. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  • 164. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  • 165. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  • 166. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 80 / 140
  • 167. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  • 168. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  • 169. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  • 170. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
  • 171. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
  • 172. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  • 173. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  • 174. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  • 175. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 84 / 140
  • 176. Here, we have multiple errors What can we do? 85 / 140
  • 177. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  • 178. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  • 179. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  • 180. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
  • 181. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
  • 182. Furthermore Then stacking all the possible estimations into the product Data Matrix and weight vector Xw =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d                  w1 w2 w3 ... wd+1         88 / 140
  • 183. Note about other representations We could have xT = (x1, x2, ..., xd, 1) thus X =          (x1)1 · · · (x1)j · · · (x1)d 1 ... ... ... (xi)1 (xi)j (xi)d 1 ... ... ... (xN )1 · · · (xN )j · · · (xN )d 1          (15) 89 / 140
  • 184. Then, we have the following trick with X With the Data Matrix Xw =         xT 1 w xT 2 w xT 3 w ... xT N w         (16) 90 / 140
  • 185. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
  • 186. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
  • 187. Then, we have The following equality N i=1 yi − xT i w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (17) 92 / 140
  • 188. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  • 189. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  • 190. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  • 191. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
  • 192. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
  • 193. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  • 194. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  • 195. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  • 196. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  • 197. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  • 198. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
  • 199. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
  • 200. The Final Discriminant Function Very Simple!!! g(x) = xT w = xT XT X −1 XT y (21) 97 / 140
  • 201. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 98 / 140
  • 202. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
  • 203. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
  • 204. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
  • 205. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
  • 206. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  • 207. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  • 208. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  • 209. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
  • 210. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
  • 211. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 103 / 140
  • 212. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  • 213. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  • 214. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  • 215. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 105 / 140
  • 216. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
  • 217. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
  • 218. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  • 219. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  • 220. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  • 221. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
  • 222. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
  • 223. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  • 224. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  • 225. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  • 226. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  • 227. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  • 228. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  • 229. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  • 230. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  • 231. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  • 232. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  • 233. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  • 234. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  • 235. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 113 / 140
  • 236. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  • 237. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  • 238. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  • 239. Then Project samples xi into subspaces dim=k zi = UT Kxi With Uk is a matrix with k columns 115 / 140
  • 240. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 116 / 140
  • 244. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 120 / 140
  • 245. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  • 246. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  • 247. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  • 248. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  • 249. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  • 250. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  • 251. Using the Hypotenuse A little bit of Geometry, we get 123 / 140
  • 252. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
  • 253. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
  • 254. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  • 255. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  • 256. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  • 257. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  • 258. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  • 259. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  • 260. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
  • 261. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
  • 262. Why? Do you remember the Projection to a single vector p? Definition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
  • 263. Why? Do you remember the Projection to a single vector p? Definition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
  • 264. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
  • 265. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
  • 266. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can define the following singular value σ1 (A) = Av1 130 / 140
  • 267. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can define the following singular value σ1 (A) = Av1 130 / 140
  • 268. This is known as Definition The best-fit line problem describes the problem of finding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can find the best-fit d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
  • 269. This is known as Definition The best-fit line problem describes the problem of finding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can find the best-fit d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
  • 270. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  • 271. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  • 272. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  • 273. Proving that the strategy is good Theorem Let A be an n × d matrix where v1, v2, ..., vr are the singular vectors defined above. For 1 ≤ k ≤ r, let Vk be the subspace spanned by v1, v2, ..., vk. Then for each k, Vk is the best-fit k-dimensional subspace for A. 133 / 140
  • 274. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  • 275. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  • 276. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  • 277. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  • 278. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  • 279. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  • 280. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  • 281. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  • 282. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  • 283. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
  • 284. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
  • 285. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 138 / 140
  • 286. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
  • 287. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
  • 288. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  • 289. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  • 290. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  • 291. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140