LINEAR ALGEBRA, WITH OPTIMIZATION; VECTORS & MATRICES, LINEAR SUB-SPACES, LINEAR INDEPENDENCE & DIMENSION; MATRIX SUB-SPACES, MATRIX RANK, MATRIX INVERSE, SYSTEMS OF EQUATIONS, etc.
15. LINEAR INDEPENDENCE AND DIMENSION
• Vectors 𝒙1, … , 𝒙 𝑚 are linearly independent if
𝑖=1
𝑚
𝛼𝑖 𝒙𝑖 = 0 ⟺ 𝛼 = 0
• Every linear combination of the 𝒙𝑖 is unique
• Dim 𝒱 = 𝑚 if 𝒙1, … , 𝒙 𝑚 span 𝒱 and are linearly independent
• If 𝒚1, … , 𝒚 𝑘 span 𝒱 then
• 𝑘 ≥ 𝑚
• If 𝑘 > 𝑚 then 𝒚𝑖 are NOT linearly independent
17. MATRIX SUBSPACES
• Matrix 𝑀 ∈ ℝ 𝑚×𝑛
defines two subspaces
• Column space col 𝑀 = 𝑀𝛼 𝛼 ∈ ℝ 𝑛
⊂ ℝ 𝑚
• Row space row 𝑀 = 𝑀 𝑇
𝛽 𝛽 ∈ ℝ 𝑚
⊂ ℝ 𝑛
• Nullspace of 𝑀: null 𝑀 = 𝑥 ∈ ℝ 𝑛
𝑀𝑥 = 0
• null 𝑀 ⊥ row 𝑀
• dim null 𝑀 + dim row 𝑀 = 𝑛
• Analog for column space
18. MATRIX RANK
• rank 𝑀 gives dimensionality of row and column spaces
• If 𝑀 ∈ ℝ 𝑚×𝑛 has rank 𝑘, can decompose into product of 𝑚 × 𝑘 and 𝑘 × 𝑛
matrices
𝑀
rank = 𝑘
=𝑚
𝑛
𝑚
𝑛
𝑘
𝑘
19. PROPERTIES OF RANK
• For 𝐴, 𝐵 ∈ ℝ 𝑚×𝑛
1. rank 𝐴 ≤ min 𝑚, 𝑛
2. rank 𝐴 = rank 𝐴 𝑇
3. rank 𝐴𝐵 ≤ min rank 𝐴 , rank 𝐵
4. rank 𝐴 + 𝐵 ≤ rank 𝐴 + rank 𝐵
• 𝐴 has full rank if rank 𝐴 = min 𝑚, 𝑛
• If 𝑚 > rank 𝐴 rows not linearly independent
• Same for columns if 𝑛 > rank 𝐴
20. OUTLINE
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue decompositions
• Convex optimization
21. MATRIX INVERSE
• 𝑀 ∈ ℝ 𝑚×𝑚
is invertible iff rank 𝑀 = 𝑚
• Inverse is unique and satisfies
1. 𝑀−1
𝑀 = 𝑀𝑀−1
= 𝐼
2. 𝑀−1 −1
= 𝑀
3. 𝑀 𝑇 −1
= 𝑀−1 𝑇
4. If 𝐴 is invertible then 𝑀𝐴 is invertible and 𝑀𝐴 −1
= 𝐴−1
𝑀−1
22. SYSTEMS OF EQUATIONS
• Given 𝑀 ∈ ℝ 𝑚×𝑛, 𝑦 ∈ ℝ 𝑚 wish to solve
𝑀𝑥 = 𝑦
• Exists only if 𝑦 ∈ col 𝑀
• Possibly infinite number of solutions
• If 𝑀 is invertible then 𝑥 = 𝑀−1
𝑦
• Notational device, do not actually invert matrices
• Computationally, use solving routines like Gaussian elimination
23. SYSTEMS OF EQUATIONS
• What if 𝑦 ∉ col 𝑀 ?
• Find 𝑥 that gives 𝑦 = 𝑀𝑥 closest to 𝑦
• 𝑦 is projection of 𝑦 onto col 𝑀
• Also known as regression
• Assume rank 𝑀 = 𝑛 < 𝑚
𝑥 = 𝑀 𝑇
𝑀 −1
𝑀 𝑇
𝑦 𝑦 = 𝑀 𝑀 𝑇
𝑀 −1
𝑀 𝑇
𝑦
Invertible Projection
matrix
25. EIGENVALUE DECOMPOSITION
• Eigenvalue decomposition of symmetric 𝑀 ∈ ℝ 𝑚×𝑚
is
𝑀 = 𝑄Σ𝑄 𝑇 =
𝑖=1
𝑚
𝜆𝑖 𝒒𝑖 𝒒𝑖
𝑇
• Σ = diag 𝜆1, … , 𝜆 𝑛 contains eigenvalues of 𝑀
• 𝑄 is orthogonal and contains eigenvectors 𝒒𝑖 of 𝑀
• If 𝑀 is not symmetric but diagonalizable
𝑀 = 𝑄Σ𝑄−1
• Σ is diagonal by possibly complex
• 𝑄 not necessarily orthogonal
26. CHARACTERIZATIONS OF EIGENVALUES
• Traditional formulation
𝑀𝑥 = 𝜆𝑥
• Leads to characteristic polynomial
det 𝑀 − 𝜆𝐼 = 0
• Rayleigh quotient (symmetric 𝑀)
max
𝑥
𝑥 𝑇
𝑀𝑥
𝑥 2
2
27. EIGENVALUE PROPERTIES
• For 𝑀 ∈ ℝ 𝑚×𝑚
with eigenvalues 𝜆𝑖
1. tr 𝑀 = 𝑖=1
𝑚
𝜆𝑖
2. det 𝑀 = 𝜆1 𝜆2 … 𝜆 𝑚
3. rank 𝑀 = #𝜆𝑖 ≠ 0
• When 𝑀 is symmetric
• Eigenvalue decomposition is singular value decomposition
• Eigenvectors for nonzero eigenvalues give orthogonal basis for row 𝑀 =
col 𝑀
28. SIMPLE EIGENVALUE PROOF
• Why det 𝑀 − 𝜆𝐼 = 0?
• Assume 𝑀 is symmetric and full rank
1. 𝑀 = 𝑄Σ𝑄 𝑇
2. 𝑀 − 𝜆𝐼 = 𝑄Σ𝑄 𝑇 − 𝜆𝐼 = 𝑄 Σ − 𝜆𝐼 𝑄 𝑇
3. If 𝜆 = 𝜆𝑖, 𝑖th eigenvalue of 𝑀 − 𝜆𝐼 is 0
4. Since det 𝑀 − 𝜆𝐼 is product of eigenvalues, one of the terms is 0, so
product is 0
𝑄𝑄 𝑇 = 𝐼
29. OUTLINE
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue decompositions
• Convex optimization
30. CONVEX OPTIMIZATION
• Find minimum of a function subject to solution constraints
• Business/economics/ game theory
• Resource allocation
• Optimal planning and strategies
• Statistics and Machine Learning
• All forms of regression and classification
• Unsupervised learning
• Control theory
• Keeping planes in the air!
31. CONVEX SETS
• A set 𝐶 is convex if ∀𝑥, 𝑦 ∈ 𝐶 and ∀𝛼 ∈ 0,1
𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝐶
• Line segment between points in 𝐶 also lies in 𝐶
• Ex
• Intersection of halfspaces
• 𝐿 𝑝 balls
• Intersection of convex sets
32. CONVEX FUNCTIONS
• A real-valued function 𝑓 is convex if dom𝑓 is convex and ∀𝑥, 𝑦 ∈ dom𝑓 and
∀𝛼 ∈ 0,1
𝑓 𝛼𝑥 + 1 − 𝛼 𝑦 ≤ 𝛼𝑓 𝑥 + 1 − 𝛼 𝑓 𝑦
• Graph of 𝑓 upper bounded by line segment between points on graph
𝑥, 𝑓 𝑥
𝑦, 𝑓 𝑦
35. GRADIENT DESCENT
• To minimize 𝑓 move down gradient
• But not too far!
• Optimum when 𝛻𝑓 = 0
• Given 𝑓, learning rate 𝛼, starting point 𝑥0
𝑥 = 𝑥0
Do until 𝛻𝑓 = 0
𝑥 = 𝑥 − 𝛼𝛻𝑓
36. STOCHASTIC GRADIENT DESCENT
• Many learning problems have extra structure
𝑓 𝜃 =
𝑖=1
𝑛
𝐿 𝜃; 𝒙𝑖
• Computing gradient requires iterating over all points, can be too costly
• Instead, compute gradient at single training example
37. STOCHASTIC GRADIENT DESCENT
• Given 𝑓 𝜃 = 𝑖=1
𝑛
𝐿 𝜃; 𝒙𝑖 , learning rate 𝛼, starting point 𝜃0
𝜃 = 𝜃0
Do until 𝑓 𝜃 nearly optimal
For 𝑖 = 1 to 𝑛 in random order
𝜃 = 𝜃 − 𝛼𝛻𝐿 𝜃; 𝒙𝑖
• Finds nearly optimal 𝜃