Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

05 linear transformations

83 views

Published on

A Introduction to Linear Algebra

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

05 linear transformations

  1. 1. Mathematics for Artificial Intelligence Transformation and Applications Andres Mendez-Vazquez April 9, 2020 1 / 140
  2. 2. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 2 / 140
  3. 3. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 3 / 140
  4. 4. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  5. 5. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  6. 6. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  7. 7. Going further than solving Ax = y We can go further We can think on the matrix A as a function!!! In general A function f whose domain Rn and defines a rule that associate x ∈ Rn to a vector y ∈ Rm y = f (x) equivalently f : Rn −→ Rm We like the second expression because 1 It is easy to identify the domain Rn 2 It is easy to find the range Rm 4 / 140
  8. 8. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
  9. 9. Examples f : R → R3 f (t) =    x (t) y (t) z (t)    =    t 3t2 + 1 sin (t)    This are called parametric functions Depending on the context, it could represent the position or the velocity of a mass point. 5 / 140
  10. 10. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 6 / 140
  11. 11. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  12. 12. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  13. 13. A Classic Example We have if A is a m × n, we can use A to define a function. We will call them fA : Rn → Rm In other words fA (x) = Ax 7 / 140
  14. 14. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  15. 15. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  16. 16. Example Let A2×3 = 1 2 3 4 5 6 This allows to define fA    x y z    = 1 2 3 4 5 6    x y z    = x + 2y + z 4x + 5y + 6z We have For each vector x ∈ R3 to the vector Ax ∈ R2 8 / 140
  17. 17. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 9 / 140
  18. 18. We have Definition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
  19. 19. We have Definition A function f : Rn → Rm is said to be linear if 1 f (x1 + x2) = f (x1) + f (x2) 2 f (cx) = cf (x) for all x1, x2 ∈ Rn and for all the scalars c. Thus A linear function f is also known as a linear transformation. 10 / 140
  20. 20. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
  21. 21. We have the following proposition Proposition f : Rn → Rm is linear ⇐⇒ for all vectors x1, x2 ∈ Rn and for all the scalars c1, c2: f (c1x1 + c2x2) = c1f (x1) + c2f (x2) Proof Any idea? 11 / 140
  22. 22. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  23. 23. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  24. 24. Proof If Am×n is a matrix, fA is a linear transformation How? First fA (x1 + x2) = A (x1 + x2) = Ax1 + Ax2 = fA (x1) + fA (x2) Second What about fA (cx1)? 12 / 140
  25. 25. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 13 / 140
  26. 26. We have Definition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is defined by Ker (f) = {v ∈ Rn |f (v) = 0} Definition If f : Rn → Rm is linear, the range of f is defined by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
  27. 27. We have Definition (Actually related the null-space) If f : Rn → Rm is linear, the kernel of f is defined by Ker (f) = {v ∈ Rn |f (v) = 0} Definition If f : Rn → Rm is linear, the range of f is defined by Range (f) = {y ∈ Rm |y = f (x) for somer x ∈ Rn } 14 / 140
  28. 28. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
  29. 29. We have also the following Spaces Row Space We have that the span of the row vectors of A form a subspace. Column Space We have that the span of the column vectors of A, also, form a subspace. 15 / 140
  30. 30. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
  31. 31. From This It can be shown that Ker (f) is a subspace of Rn Also Range (f) is a subspace of Rm 16 / 140
  32. 32. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 17 / 140
  33. 33. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
  34. 34. Assume the following Let e1 =       1 0 ... 0       n×1 , e3 =       0 1 ... 0       n×1 , ..., en =       0 0 ... 1       n×1 Then any vector x ∈ Rn x =       x1 x2 ... xn       = x1e1 + x2e2 + ... + xnen 18 / 140
  35. 35. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  36. 36. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  37. 37. Then Applying f f (x) = x1f (e1) + x2f (e2) + ... + xnf (en) A linear combination of elements {f (e1) , f (e2) , ..., f (en)} They are column vectors in Rm A = (f (e1) |f (e2) |...|f (en))m×n 19 / 140
  38. 38. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Definition The matrix A defined above for the function f is called the matrix of f in the standard basis. 20 / 140
  39. 39. Thus, we have Finally, we have f (x) = (f (e1) |f (e2) |...|f (en)) x = Ax Definition The matrix A defined above for the function f is called the matrix of f in the standard basis. 20 / 140
  40. 40. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 21 / 140
  41. 41. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
  42. 42. Given an m × n matrix A The set of all solutions to the homogeneous equation Ax It is a subspace V of Rn. Ax = 0 Remember how to prove the subspaces... x2 + x2 ∈ V and cx ∈ V Do you remember? 22 / 140
  43. 43. Then, we have Definition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
  44. 44. Then, we have Definition This important subspace is called the null space of A, and is denoted Null(A) It is also known as xH = {x|Ax = 0} 23 / 140
  45. 45. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  46. 46. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  47. 47. Knowing that Range(f) and Ker(f) are sub-spaces Which ones they are? Any Idea? Range(f) The column space of the matrix A. Ker(f) It is the null space of A. 24 / 140
  48. 48. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 25 / 140
  49. 49. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
  50. 50. We have a nice theorem Dimension Theorem Let f : Rn → Rm be linear. Then dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) Where The dimension of V , written dim(V ), is the number of elements in any basis of V . 26 / 140
  51. 51. Rank and Nullity of a Matrix Definition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
  52. 52. Rank and Nullity of a Matrix Definition The rank of the matrix A is the dimension of the row space of A, and is denoted R(A). Example The rank of In×n is n. 27 / 140
  53. 53. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  54. 54. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  55. 55. Then Theorem The rank of a matrix in Gauss-Jordan form is equal to the number of leading variables. Proof In the G form of a matrix, every non-zero row has a leading 1, which is the only non-zero entry in its column. Then No elementary row operation can zero out a leading 1, so these non-zero rows are linearly independent. 28 / 140
  56. 56. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
  57. 57. Thus We have Since all the other rows are zero, the dimension of the row space of the Gauss-Jordan form is equal to the number of leading 1’s. Finally This is the same as the number of leading variables. Q.E.D. 29 / 140
  58. 58. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  59. 59. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  60. 60. About the Nullity of the Matrix Definition The nullity of the matrix A is the dimension of the null space of A, and is denoted by dim [N(A)]. Example The nullity of I is 0. 30 / 140
  61. 61. Number of Free Variables Theorem The nullity of a matrix in Gauss-Jordan form is equal to the number of free variables. Proof Suppose A is m × n, and that the Gauss-Jordan form has j leading variables and k free variables: j + k = n 31 / 140
  62. 62. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  63. 63. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  64. 64. Proof Then, when computing the solution to the homogeneous equation We solve for the first j (leading) variables in terms of the remaining k free variables: s1, s2, s3, ..., sk Then Then the general solution to the homogeneous equation are: s1v1 + s2v2 + s3v3 + · · · + skvk 32 / 140
  65. 65. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  66. 66. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  67. 67. Where The vectors are the Canonical Ones Here, a trick!!! Meaning in v1, we have 1, after many 0 It appears at position j + 1, with zeros afterwards, and so on. Therefore the vectors are linearly independents v1, v2, v3, · · · , vk 33 / 140
  68. 68. Therefore They are a basis for the null space of A And there are k of them, the same as the number of free variables. 34 / 140
  69. 69. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  70. 70. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  71. 71. Now Definition The matrix B is said to be row equivalent to A (B ∼ A) if B can be obtained from A by a finite sequence of elementary row operations. In matrix terms B ∼ A ⇔ There exist elementary matrices such that B = EkEk−1Ek−1 · · · E1A If we write C = EkEk−1Ek−1 · · · E1 B is row equivalent to A if B = CA with C invertible. 35 / 140
  72. 72. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  73. 73. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  74. 74. Then, we have Theorem If B ∼ A, then Null(B) = Null(A). Theorem If B ∼ A, then the row space of B is identical to that of A. Summarizing Row operations change neither the row space nor the null space of A. 36 / 140
  75. 75. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
  76. 76. Corollaries Corollary 1 If R is the Gauss-Jordan form of A, then R has the same null space and row space as A. Corollary 2 If B ∼ A, then R(B) = R(A), and N(B) = N(A). 37 / 140
  77. 77. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  78. 78. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  79. 79. Then Theorem The number of linearly independent rows of the matrix A is equal to the number of linearly independent columns of A. Thus In particular, the rank of A is also equal to the number of linearly independent columns, and hence to the dimension of the column space of A Therefore The number of linearly independent columns of A is then just the number of leading entries in the Gauss-Jordan form of A which is, as we know, the same as the rank of A. 38 / 140
  80. 80. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  81. 81. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  82. 82. Proof of the theorem (Dimension Theorem) First The rank of A is the same as the rank of the Gauss-Jordan form of A which is equal to the number of leading entries in the Gauss-Jordan form. Additionally The dimension of the null space is equal to the number of free variables in the reduced echelon (Gauss-Jordan) form of A. Then We know further that the number of free variables plus the number of leading entries is exactly the number of columns. 39 / 140
  83. 83. Finally We have dim (domain (f)) = dim (Range (f)) + dim (Ker (f)) 40 / 140
  84. 84. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 41 / 140
  85. 85. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to define matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
  86. 86. As we know Many Times We want to obtain a maximum or a minimum of a cost function expressed in terms of matrices.... We need then to define matrix derivatives Thus, this discussion is useful in Machine Learning. 42 / 140
  87. 87. Basic Definition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we define the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
  88. 88. Basic Definition Let ψ (x) = y Where y is an m-element vector, and x is an n-element vector Then, we define the derivative with respect to a vector ∂y ∂x =       ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn ... ... ... ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn       43 / 140
  89. 89. What is this The Matrix denotes the m × n matrix of first order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our first ideas on derivatives For Linear Transformations. 44 / 140
  90. 90. What is this The Matrix denotes the m × n matrix of first order partial derivatives Such a matrix is called the Jacobian matrix of the transformation ψ (x). Then, we can get our first ideas on derivatives For Linear Transformations. 44 / 140
  91. 91. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 45 / 140
  92. 92. Derivative of y = Ax Theorem Let y = Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x, then ∂y ∂x = A 46 / 140
  93. 93. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  94. 94. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  95. 95. Proof Each ith element of y is given by yi = N k=1 aikxk We have that ∂yi ∂xj = aij for all i = 1, 2, ..., m and j = 1, 2, ..., n Hence ∂y ∂x = A 47 / 140
  96. 96. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 48 / 140
  97. 97. Derivative of yT Ax Theorem Let the scalar α be defined by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
  98. 98. Derivative of yT Ax Theorem Let the scalar α be defined by α = yT Ax where y is a m × 1, x is a n × 1, A is a m × n and A does not depend on x and y, then ∂α ∂x = yT A and ∂α ∂y = xT AT 49 / 140
  99. 99. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  100. 100. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  101. 101. Proof Define wT = yT A Note α = wT x By the previous theorem ∂α ∂x = wT = yT A In a similar way, you can prove the other statement. 50 / 140
  102. 102. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 51 / 140
  103. 103. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
  104. 104. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) Note: wT x is also know as dot product In the case of R2 We have: g (x) = (w1, w2) x1 x2 + w0 = w1x1 + w2x2 + w0 (2) 52 / 140
  105. 105. Example Hyperplane in R3 53 / 140
  106. 106. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 54 / 140
  107. 107. Splitting The Space R2 Using a simple straight line (Hyperplane) Class Class 55 / 140
  108. 108. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
  109. 109. Splitting the Space? For example, assume the following vector w and constant w0 w = (−1, 2)T and w0 = 0 Hyperplane 56 / 140
  110. 110. Then, we have The following results g 1 2 = (−1, 2) 1 2 = −1 × 1 + 2 × 2 = 3 g 3 1 = (−1, 2) 3 1 = −1 × 3 + 2 × 1 = −1 YES!!! We have a positive side and a negative side!!! 57 / 140
  111. 111. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 58 / 140
  112. 112. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  113. 113. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  114. 114. The Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Now assume x1 and x2 are both on the decision surface wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 59 / 140
  115. 115. Defining a Decision Surface Then wT (x1 − x2) = 0 (4) 60 / 140
  116. 116. Therefore x1 − x2 lives in the hyperplane i.e. it is perpendicular to wT Remark: any vector in the hyperplane is a linear combination of elements in a basis Therefore any vector in the plane is perpendicular to wT 61 / 140
  117. 117. Therefore The space is split in two regions (Example in R3 ) by the hyperplane H 62 / 140
  118. 118. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 63 / 140
  119. 119. Some Properties of the Hyperplane Given that g (x) > 0 if x ∈ R1 64 / 140
  120. 120. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  121. 121. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  122. 122. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  123. 123. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  124. 124. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  125. 125. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  126. 126. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 65 / 140
  127. 127. We have something like this We have then 66 / 140
  128. 128. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  129. 129. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  130. 130. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  131. 131. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  132. 132. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  133. 133. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 67 / 140
  134. 134. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  135. 135. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  136. 136. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  137. 137. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 68 / 140
  138. 138. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 69 / 140
  139. 139. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  140. 140. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  141. 141. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  142. 142. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  143. 143. We want to solve the independence of w0 We would like w0 as part of the dot product by making x0 = 1 g (x) = w0 × 1 + d i=1 wixi = w0 × x0 + d i=1 wixi = d i=0 wixi (7) By making xaug =       1 x1 ... xd       =      1 x      Where xaug is called an augmented feature vector. 70 / 140
  144. 144. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  145. 145. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  146. 146. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting xaug vectors, all lie in a d-dimensional subspace which is the x-space itself. 71 / 140
  147. 147. More Remarks In addition The hyperplane decision surface H defined by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
  148. 148. More Remarks In addition The hyperplane decision surface H defined by wT augxaug = 0 passes through the origin in xaug-space. Even Though The corresponding hyperplane H can be in any position of the x-space. 72 / 140
  149. 149. More Remarks In addition The distance from y to H is: wT augxaug waug = |g (xaug)| waug 73 / 140
  150. 150. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 74 / 140
  151. 151. Now Is w ≤ waug Ideas? d i=1 w2 i ≤ d i=1 w2 i + w2 0 This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 74 / 140
  152. 152. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 75 / 140
  153. 153. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 76 / 140
  154. 154. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  155. 155. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  156. 156. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  157. 157. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. The name of this weight vector It is called a separating vector or solution vector. 77 / 140
  158. 158. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  159. 159. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  160. 160. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  161. 161. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 78 / 140
  162. 162. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  163. 163. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  164. 164. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  165. 165. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 79 / 140
  166. 166. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 80 / 140
  167. 167. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  168. 168. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  169. 169. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + e (8) Where the e It has a e ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + e (9) 81 / 140
  170. 170. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
  171. 171. Thus, we have What to do? e = ynoise − gideal (x) (10) Graphically 82 / 140
  172. 172. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  173. 173. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  174. 174. Then, we have A TRICK... Quite a good one!!! Instead of using ynoise e = ynoise − gideal (x) (11) We use yideal e = yideal − gideal (x) (12) We will see How the geometry will solve the problem with using these labels. 83 / 140
  175. 175. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 84 / 140
  176. 176. Here, we have multiple errors What can we do? 85 / 140
  177. 177. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  178. 178. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  179. 179. Sum Over All the Errors We can do the following J (w) = N i=1 e2 i = N i=1 (yi − gideal (xi))2 (13) Remark: This is know as the Least Squared Error cost function Generalizing The dimensionality of each sample (data point) is d. You can extend each vector sample to be xT = (1, x ). 86 / 140
  180. 180. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
  181. 181. We can use a trick The following function gideal (x) = 1 x1 x2 ... xd         w0 w2 w3 ... wd         = xT w We can rewrite the error equation as J (w) = N i=1 (yi − gideal (xi))2 = N i=1 yi − xT i w 2 (14) 87 / 140
  182. 182. Furthermore Then stacking all the possible estimations into the product Data Matrix and weight vector Xw =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d                  w1 w2 w3 ... wd+1         88 / 140
  183. 183. Note about other representations We could have xT = (x1, x2, ..., xd, 1) thus X =          (x1)1 · · · (x1)j · · · (x1)d 1 ... ... ... (xi)1 (xi)j (xi)d 1 ... ... ... (xN )1 · · · (xN )j · · · (xN )d 1          (15) 89 / 140
  184. 184. Then, we have the following trick with X With the Data Matrix Xw =         xT 1 w xT 2 w xT 3 w ... xT N w         (16) 90 / 140
  185. 185. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
  186. 186. Therefore We have that         y1 y2 y3 ... y4         −         xT 1 w xT 2 w xT 3 w ... xT N w         =         y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w ... y4 − xT N w         Then, we have the following equality y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w · · · y4 − xT N w     y1 − xT 1 w y2 − xT 2 w y3 − xT 3 w . . . y4 − xT N w     = N i=1 yi − x T i w 2 91 / 140
  187. 187. Then, we have The following equality N i=1 yi − xT i w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (17) 92 / 140
  188. 188. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  189. 189. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  190. 190. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − yT Xw − wT XT y + wT XT Xw (18) Now Derive with respect to w Assume that XT X is invertible 93 / 140
  191. 191. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
  192. 192. Therefore We have the following equivalences dwT Aw dw = wT A + AT , dwT A dw = AT (19) Now given that the transpose of a number is the number itself yT Xw = yT Xw T = wT XT y 94 / 140
  193. 193. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  194. 194. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  195. 195. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  196. 196. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  197. 197. Then, when we derive by w We have then d yT y − 2wT XT y + wT XT Xw dw = −2yT X + wT XT X + XT X = −2yT X + 2wT XT X Making this equal to the zero row vector − 2yT X + 2wT XT X = 0 We apply the transpose −2yT X + 2wT XT X T = [0]T − 2XT y + 2 XT X w = 0 (column vector) 95 / 140
  198. 198. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
  199. 199. Solving for w We have then w = XT X −1 XT y (20) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 96 / 140
  200. 200. The Final Discriminant Function Very Simple!!! g(x) = xT w = xT XT X −1 XT y (21) 97 / 140
  201. 201. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 98 / 140
  202. 202. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
  203. 203. Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 99 / 140
  204. 204. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
  205. 205. Dimensional Variance Remember the Variance Sample in R V AR(X) = N i=1 (xi − x)2 N − 1 (22) You can do the same in the case of two variables X and Y COV (x, y) = N i=1 (xi − x) (yi − y) N − 1 (23) 100 / 140
  206. 206. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  207. 207. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  208. 208. Now, Define Given the data x1, x2, ..., xN (24) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (25) Center data x1 − x, x2 − x, ..., xN − x (26) 101 / 140
  209. 209. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
  210. 210. Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (27) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 102 / 140
  211. 211. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 103 / 140
  212. 212. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  213. 213. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  214. 214. Using S to Project Data For this we use a u1 with uT 1 u1 = 1, an orthonormal vector Question What is the Sample Variance of the Projected Data? 104 / 140
  215. 215. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 105 / 140
  216. 216. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
  217. 217. Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (28) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (29) 106 / 140
  218. 218. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  219. 219. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  220. 220. Derive by u1 We get Su1 = λ1u1 (30) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (31) 107 / 140
  221. 221. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
  222. 222. What about the second eigenvector u2 We have the following optimization problem max uT 2 Su2 s.t. uT 2 u2 = 1 uT 2 u1 = 0 Lagrangian L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 108 / 140
  223. 223. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  224. 224. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  225. 225. Explanation First the constrained minimization We want to to maximize uT 2 Su2 Given that the second eigenvector is orthonormal We have then uT 2 u2 = 1 Under orthonormal vectors The covariance goes to zero cov (u1, u2) = uT 2 Su1 = u2λ1u1 = λ1uT 1 u2 = 0 109 / 140
  226. 226. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  227. 227. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  228. 228. Meaning The PCA’s are perpendicular L (u2, λ1, λ2) = uT 2 Su2 − λ1 uT 2 u2 − 1 − λ2 uT 2 u1 − 0 The the derivative with respect to u2 ∂L (u2, λ1, λ2) ∂u2 = Su2 − λ1u2 − λ2u1 = 0 Then, we left multiply u1 uT 1 Su2 − λ1uT 1 u2 − λ2uT 1 u1 = 0 110 / 140
  229. 229. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  230. 230. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  231. 231. Then, we have that Something Notable 0 − 0 − λ2 = 0 We have Su2 − λ2u2 = 0 Implying u2 is the eigenvector of S with second largest eigenvalue λ2. 111 / 140
  232. 232. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  233. 233. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  234. 234. Thus Variance will be the maximum when uT 1 Su1 = λ1 (32) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to define M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 112 / 140
  235. 235. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 113 / 140
  236. 236. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  237. 237. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  238. 238. We have the following steps Determine covariance matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (33) Generate the decomposition S = UΣUT With Eigenvalues in Σ and eigenvectors in the columns of U. 114 / 140
  239. 239. Then Project samples xi into subspaces dim=k zi = UT Kxi With Uk is a matrix with k columns 115 / 140
  240. 240. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 116 / 140
  241. 241. Example From Bishop 117 / 140
  242. 242. Example From Bishop 118 / 140
  243. 243. Example From Bishop 119 / 140
  244. 244. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 120 / 140
  245. 245. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  246. 246. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  247. 247. What happened with no-square matrices We can still diagonalize it Thus, we can obtain certain properties. We want to avoid the problems with S−1 AS The eigenvectors in S have three big problems 1 They are usually not orthogonal. 2 There are not always enough eigenvectors. 3 Ax = λx requires A to be square. 121 / 140
  248. 248. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  249. 249. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  250. 250. Therefore, we can look at the following problem We have a series of vectors {x1, x2, ..., xd} Then imagine a set of projection vectors and differences {β1, β2, ..., βd} and {α1, α2, ..., αd} We want to know a little bit of the relations between them After all, we are looking at the possibility of using them for our problem 122 / 140
  251. 251. Using the Hypotenuse A little bit of Geometry, we get 123 / 140
  252. 252. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
  253. 253. Therefore We have two possible quantities for each j αT j αj = xT j xj − aT j aj aT j aj = xT j xj − αT j αj Then, we can minimize and maximize given that xT j xj is a constant min n j=1 αT j αj max n j=1 aT j aj 124 / 140
  254. 254. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  255. 255. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  256. 256. Actually this is know as the dual problem (Weak Duality) An example of this min wT x s.tAx ≤b x ≥0 Then, using what is know as slack variables Ax + A x = b Each row lives in the column space, but the yi lives in the column space Ax + A x i → yi and x ≥ 0 125 / 140
  257. 257. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  258. 258. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  259. 259. Then, we have that Example    0 0 0 1 1 0    Element in the column space of dimensionality have three dimensions But in the row space their dimension is 2 Properties 126 / 140
  260. 260. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
  261. 261. We have then Stack such vectors that in the d-dimensional space In a matrix A of n × d A =       aT 1 aT 2 ... aT n       The matrix works as a Projection Matrix We are looking for a unit vector v such that length of the projection is maximized. 127 / 140
  262. 262. Why? Do you remember the Projection to a single vector p? Definition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
  263. 263. Why? Do you remember the Projection to a single vector p? Definition of the projection under unitary vector p = vT ai vT v v = vT ai v Therefore the length of the projected vector is vT ai v = vT ai 128 / 140
  264. 264. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
  265. 265. Then Thus with a little bit of notation Av =       aT 1 aT 2 ... aT d       v =       aT 1 v aT 2 v ... aT d v       Therefore Av = d i=1 aT i v 2 129 / 140
  266. 266. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can define the following singular value σ1 (A) = Av1 130 / 140
  267. 267. Then It is possible to ask to maximize the longitude of such vector (Singular Vector) v1 = arg max v =1 Av Then, we can define the following singular value σ1 (A) = Av1 130 / 140
  268. 268. This is known as Definition The best-fit line problem describes the problem of finding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can find the best-fit d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
  269. 269. This is known as Definition The best-fit line problem describes the problem of finding the best line for a set of data points, where the quality of the line is measured by the sum of squared (perpendicular) distances of the points to the line. Remember, we are looking at the dual problem.... Generalization This can be transferred to higher dimensions: One can find the best-fit d-dimensional subspace, so the subspace which minimizes the sum of the squared distances of the points to the subspace 131 / 140
  270. 270. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  271. 271. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  272. 272. Then, in a Greedy Fashion The second singular vector v2 v2 = arg max v⊥v1, v =1 Av Them you go through this process Stop when we have found all the following vectors: v1, v2, ..., vr As singular vectors and arg max v ⊥ v1, v2, ..., vr v = 1 Av 132 / 140
  273. 273. Proving that the strategy is good Theorem Let A be an n × d matrix where v1, v2, ..., vr are the singular vectors defined above. For 1 ≤ k ≤ r, let Vk be the subspace spanned by v1, v2, ..., vk. Then for each k, Vk is the best-fit k-dimensional subspace for A. 133 / 140
  274. 274. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  275. 275. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  276. 276. Proof For k = 1 What about k = 2? Let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2 of W |Aw1|2 + |Aw2|2 is the sum of the squared lengths of the projections of the rows of A to W. Now, choose a basis w1, w2 so that w2 is perpendicular to v1 This can be a unit vector perpendicular to v1 projection in W. 134 / 140
  277. 277. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  278. 278. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  279. 279. Do you remember v1 = arg max v =1 Av ? Therefore |Aw1|2 ≤ |Av1|2 and |Aw2|2 ≤ |Av2|2 Then |Aw1|2 + |Aw2|2 ≤ |Av1|2 + |Av2|2 In a similar way for k > 2 Vk is at least as good as W and hence is optimal. 135 / 140
  280. 280. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  281. 281. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  282. 282. Remarks Every Matrix has a singular value decomposition A = UΣV T Where The columns of U are an orthonormal basis for the column space. The columns of V are an orthonormal basis for the row space. The Σ is diagonal and the entries on its diagonal σi = Σii are positive real numbers, called the singular values of A. 136 / 140
  283. 283. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
  284. 284. Properties of the Singular Value Decomposition First The eigenvalues of the symmetric matrix AT A are equal to the square of the singular values of A AT A = V ΣUT UT ΣV T = V Σ2V T Second The rank of a matrix is equal to the number of non-zero singular values. 137 / 140
  285. 285. Outline 1 Linear Transformation Introduction Functions that can be defined using matrices Linear Functions Kernel and Range The Matrix of a Linear Transformation Going Back to Homogeneous Equations The Rank-Nullity Theorem 2 Derivative of Transformations Introduction Derivative of a Linear Transformation Derivative of a Quadratic Transformation 3 Linear Regression The Simplest Functions Splitting the Space Defining the Decision Surface Properties of the Hyperplane wT x + w0 Augmenting the Vector Least Squared Error Procedure The Geometry of a Two-Category Linearly-Separable Case The Error Idea The Final Error Equation 4 Principal Component Analysis Karhunen-Loeve Transform Projecting the Data Lagrange Multipliers The Process Example 5 Singular Value Decomposition Introduction Image Compression 138 / 140
  286. 286. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
  287. 287. Singular Value Decomposition as Sums The singular value decomposition can be viewed as a sum of rank 1 matrices A = A1 + A2 + ... + AR (34) Why? u1A = U   σ1 0 · · · 0 0 σ2 · · · 0 . . . . . . . . . . . . 0 0 · · · σR  V T = u1 u2 · · · uR    σ1vT 1 σ2vT 2 . . . σRvT R    =σ1u1v T 1 + σ2u2v T 2 + · · · + σRuRv T R 139 / 140
  288. 288. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  289. 289. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  290. 290. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140
  291. 291. Truncating Truncating the singular value decomposition allows us to represent the matrix with less parameters For a 512 × 512 Full Representation 512 × 512 = 262, 144 Rank 10 approximation 512×10 + 10 + 10 × 512 = 10, 250 Rank 40 approximation 512×40 + 40 + 40 × 512 = 41, 000 Rank 80 approximation 512×80 + 80 + 80 × 512 = 82, 000 140 / 140

×