Successfully reported this slideshow.
Upcoming SlideShare
×

# 05 linear transformations

83 views

Published on

A Introduction to Linear Algebra

Published in: Engineering
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### 05 linear transformations

1. 1. Mathematics for Artiﬁcial Intelligence Transformation and Applications Andres Mendez-Vazquez April 9, 2020 1 / 140
2. 2. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 2 / 140
3. 3. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 3 / 140
4. 4. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and deﬁnes a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to ﬁnd the range Rm 4 / 140
5. 5. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and deﬁnes a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to ﬁnd the range Rm 4 / 140
6. 6. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and deﬁnes a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to ﬁnd the range Rm 4 / 140
7. 7. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and deﬁnes a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to ﬁnd the range Rm 4 / 140
8. 8. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
9. 9. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
10. 10. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 6 / 140
11. 11. A Classic Example We have if A is a m × n, we can use A to deﬁne a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
12. 12. A Classic Example We have if A is a m × n, we can use A to deﬁne a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
13. 13. A Classic Example We have if A is a m × n, we can use A to deﬁne a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
14. 14. Example Let A2×3 = 1 2 3 4 5 6 This allows to deﬁne fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
15. 15. Example Let A2×3 = 1 2 3 4 5 6 This allows to deﬁne fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
16. 16. Example Let A2×3 = 1 2 3 4 5 6 This allows to deﬁne fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
17. 17. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 9 / 140
18. 18. We have Deﬁnition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
19. 19. We have Deﬁnition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
20. 20. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
21. 21. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
22. 22. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
23. 23. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
24. 24. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
25. 25. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 13 / 140
26. 26. We have Deﬁnition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is deﬁned by Ker (f) = {v ∈ Rn |f (v) = 0} Deﬁnition If f : Rn → Rm is linear, the range of f is deﬁned by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
27. 27. We have Deﬁnition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is deﬁned by Ker (f) = {v ∈ Rn |f (v) = 0} Deﬁnition If f : Rn → Rm is linear, the range of f is deﬁned by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
28. 28. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
29. 29. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
30. 30. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
31. 31. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
32. 32. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 17 / 140
33. 33. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
34. 34. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
35. 35. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
36. 36. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
37. 37. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
38. 38. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Deﬁnition The matrix A deﬁned above for the function f is called the matrix of f in the standard basis. 20 / 140
39. 39. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Deﬁnition The matrix A deﬁned above for the function f is called the matrix of f in the standard basis. 20 / 140
40. 40. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 21 / 140
41. 41. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
42. 42. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
43. 43. Then, we have Deﬁnition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
44. 44. Then, we have Deﬁnition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
45. 45. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
46. 46. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
47. 47. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
48. 48. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 25 / 140
49. 49. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
50. 50. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
51. 51. Rank and Nullity of a Matrix Deﬁnition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
52. 52. Rank and Nullity of a Matrix Deﬁnition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
53. 53. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
54. 54. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
55. 55. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
56. 56. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
57. 57. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
58. 58. About the Nullity of the Matrix Deﬁnition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
59. 59. About the Nullity of the Matrix Deﬁnition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
60. 60. About the Nullity of the Matrix Deﬁnition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
61. 61. Number of Free Variables Theorem The nullity of a matrix in Gauss-Jordan form is equal to the number of free variables. Proof Suppose A is m × n, and that the Gauss-Jordan form has j leading variables and k free variables: j + k = n 31 / 140
62. 62. Proof Then, when computing the solution to the homogeneous equation We solve for the ﬁrst j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
63. 63. Proof Then, when computing the solution to the homogeneous equation We solve for the ﬁrst j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
64. 64. Proof Then, when computing the solution to the homogeneous equation We solve for the ﬁrst j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
65. 65. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
66. 66. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
67. 67. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
68. 68. Therefore They are a basis for the null space of A And there are k of them, the same as the number of free variables. 34 / 140
69. 69. Now Deﬁnition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a ﬁnite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
70. 70. Now Deﬁnition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a ﬁnite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
71. 71. Now Deﬁnition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a ﬁnite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
72. 72. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
73. 73. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
74. 74. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
75. 75. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
76. 76. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
77. 77. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
78. 78. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
79. 79. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
80. 80. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
81. 81. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
82. 82. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
83. 83. Finally We have dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) 40 / 140
84. 84. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 41 / 140
85. 85. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to deﬁne matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
86. 86. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to deﬁne matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
87. 87. Basic Deﬁnition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we deﬁne the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
88. 88. Basic Deﬁnition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we deﬁne the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
89. 89. What is this The Matrix denotes the m × n matrix of ﬁrst order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our ﬁrst ideas on derivatives For Linear Transformations. 44 / 140
90. 90. What is this The Matrix denotes the m × n matrix of ﬁrst order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our ﬁrst ideas on derivatives For Linear Transformations. 44 / 140
91. 91. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 45 / 140
92. 92. Derivative of y = Ax Theorem Let y = Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x, then ∂y ∂x = A 46 / 140
93. 93. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
94. 94. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
95. 95. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
96. 96. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 48 / 140
97. 97. Derivative of yT Ax Theorem Let the scalar α be deﬁned by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
98. 98. Derivative of yT Ax Theorem Let the scalar α be deﬁned by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
99. 99. Proof Deﬁne wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
100. 100. Proof Deﬁne wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
101. 101. Proof Deﬁne wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
102. 102. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 51 / 140
103. 103. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
104. 104. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
105. 105. Example Hyperplane in R3 53 / 140
106. 106. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 54 / 140
107. 107. Splitting The Space R2 Using a simple straight line (Hyperplane) Class Class 55 / 140
108. 108. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
109. 109. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
110. 110. Then, we have The following results g 1 2 = (−1, 2) 1 2 = −1 × 1 + 2 × 2 = 3 g 3 1 = (−1, 2) 3 1 = −1 × 3 + 2 × 1 = −1 YES!!! We have a positive side and a negative side!!! 57 / 140
111. 111. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 58 / 140
112. 112. The Decision Surface The equation g (x) = 0 deﬁnes a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
113. 113. The Decision Surface The equation g (x) = 0 deﬁnes a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
114. 114. The Decision Surface The equation g (x) = 0 deﬁnes a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
115. 115. Deﬁning a Decision Surface Then wT (x1 − x2) = 0 (4) 60 / 140
116. 116. Therefore x1 − x2 lives in the hyperplane i.e. it is perpendicular to wT Remark: any vector in the hyperplane is a linear combination of elements in a basis Therefore any vector in the plane is perpendicular to wT 61 / 140
117. 117. Therefore The space is split in two regions (Example in R3 ) by the hyperplane H 62 / 140
118. 118. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 63 / 140
119. 119. Some Properties of the Hyperplane Given that g (x) > 0 if x ∈ R1 64 / 140
120. 120. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
121. 121. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
122. 122. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
123. 123. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
124. 124. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
125. 125. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
126. 126. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
127. 127. We have something like this We have then 66 / 140
128. 128. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
129. 129. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
130. 130. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
131. 131. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
132. 132. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
133. 133. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
134. 134. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
135. 135. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
136. 136. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
137. 137. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
138. 138. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 69 / 140
139. 139. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
140. 140. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
141. 141. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
142. 142. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
143. 143. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
144. 144. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
145. 145. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
146. 146. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
147. 147. More Remarks In addition The hyperplane decision surface H deﬁned by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
148. 148. More Remarks In addition The hyperplane decision surface H deﬁned by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
149. 149. More Remarks In addition The distance from y to H is: wT augxaug waug = |g (xaug)| waug 73 / 140
150. 150. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to ﬁnd a weight vector waug instead of ﬁnding the weight vector w and the threshold w0. 74 / 140
151. 151. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to ﬁnd a weight vector waug instead of ﬁnding the weight vector w and the threshold w0. 74 / 140
152. 152. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 75 / 140
153. 153. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 76 / 140
154. 154. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
155. 155. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
156. 156. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
157. 157. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
158. 158. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
159. 159. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
160. 160. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
161. 161. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
162. 162. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
163. 163. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
164. 164. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
165. 165. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
166. 166. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 80 / 140
167. 167. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
168. 168. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
169. 169. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
170. 170. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
171. 171. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
172. 172. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
173. 173. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
174. 174. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
175. 175. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 84 / 140
176. 176. Here, we have multiple errors What can we do? 85 / 140
177. 177. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
178. 178. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
179. 179. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
180. 180. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
181. 181. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
182. 182. Furthermore Then stacking all the possible estimations into the product Data Matrix and weight vector Xw =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d                  w1 w2 w3 ... wd+1         88 / 140
183. 183. Note about other representations We could have xT = (x1, x2, ..., xd, 1) thus X =          (x1)1 · · · (x1)j · · · (x1)d 1 ... ... ... (xi)1 (xi)j (xi)d 1 ... ... ... (xN )1 · · · (xN )j · · · (xN )d 1          (15) 89 / 140
184. 184. Then, we have the following trick with X With the Data Matrix Xw =         xT 1 w xT 2 w xT 3 w ... xT N w         (16) 90 / 140
185. 185. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
186. 186. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
187. 187. Then, we have The following equality N i=1 yi − xT i w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (17) 92 / 140
188. 188. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
189. 189. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
190. 190. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
191. 191. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
192. 192. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
193. 193. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
194. 194. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
195. 195. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
196. 196. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
197. 197. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
198. 198. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-deﬁnite. If it is also invertible, it is positive deﬁnite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
199. 199. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-deﬁnite. If it is also invertible, it is positive deﬁnite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
200. 200. The Final Discriminant Function Very Simple!!! g(x) = xT w = xT XT X −1 XT y (21) 97 / 140
201. 201. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 98 / 140
202. 202. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
203. 203. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
204. 204. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
205. 205. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
206. 206. Now, Deﬁne Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
207. 207. Now, Deﬁne Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
208. 208. Now, Deﬁne Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
209. 209. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
210. 210. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
211. 211. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 103 / 140
212. 212. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
213. 213. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
214. 214. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
215. 215. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 105 / 140
216. 216. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
217. 217. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
218. 218. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
219. 219. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
220. 220. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
221. 221. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
222. 222. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
223. 223. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
224. 224. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
225. 225. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
226. 226. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
227. 227. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
228. 228. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
229. 229. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
230. 230. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
231. 231. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
232. 232. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to deﬁne M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
233. 233. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to deﬁne M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
234. 234. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to deﬁne M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
235. 235. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 113 / 140
236. 236. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
237. 237. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
238. 238. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
239. 239. Then Project samples xi into subspaces dim=k zi = UT Kxi With Uk is a matrix with k columns 115 / 140
240. 240. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 116 / 140
241. 241. Example From Bishop 117 / 140
242. 242. Example From Bishop 118 / 140
243. 243. Example From Bishop 119 / 140
244. 244. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 120 / 140
245. 245. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
246. 246. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
247. 247. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
248. 248. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and diﬀerences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
249. 249. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and diﬀerences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
250. 250. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and diﬀerences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
251. 251. Using the Hypotenuse A little bit of Geometry, we get 123 / 140
252. 252. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
253. 253. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
254. 254. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
255. 255. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
256. 256. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
257. 257. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
258. 258. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
259. 259. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
260. 260. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
261. 261. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
262. 262. Why? Do you remember the Projection to a single vector p? Deﬁnition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
263. 263. Why? Do you remember the Projection to a single vector p? Deﬁnition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
264. 264. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
265. 265. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
266. 266. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can deﬁne the following singular value σ1 (A) = Av1 130 / 140
267. 267. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can deﬁne the following singular value σ1 (A) = Av1 130 / 140
268. 268. This is known as Deﬁnition The best-ﬁt line problem describes the problem of ﬁnding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can ﬁnd the best-ﬁt d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
269. 269. This is known as Deﬁnition The best-ﬁt line problem describes the problem of ﬁnding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can ﬁnd the best-ﬁt d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
270. 270. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
271. 271. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
272. 272. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
273. 273. Proving that the strategy is good Theorem Let A be an n × d matrix where v1, v2, ..., vr are the singular vectors deﬁned above. For 1 ≤ k ≤ r, let Vk be the subspace spanned by v1, v2, ..., vk. Then for each k, Vk is the best-ﬁt k-dimensional subspace for A. 133 / 140
274. 274. Proof For k = 1 What about k = 2? Let W be a best-ﬁt 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
275. 275. Proof For k = 1 What about k = 2? Let W be a best-ﬁt 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
276. 276. Proof For k = 1 What about k = 2? Let W be a best-ﬁt 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
277. 277. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
278. 278. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
279. 279. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
280. 280. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
281. 281. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
282. 282. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
283. 283. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
284. 284. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
285. 285. Outline 1 Linear Transformation Introduction Functions that can be deﬁned using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Deﬁning the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 138 / 140
286. 286. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
287. 287. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
288. 288. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
289. 289. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
290. 290. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
291. 291. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140