Linear Algebra Review (with a Small Dose of Optimization)

Linear Algebra Review(with a Small Dose of Optimization)

Hristo PaskovCS246

Outline

• Basic definitions• Subspaces and Dimensionality• Matrix functions: inverses and eigenvalue

decompositions• Convex optimization

Vectors and Matrices

• Vector

• May also write

Vectors and Matrices

• Matrix

• Written in terms of rows or columns

Multiplication

• Vector-vector:

• Matrix-vector:

Multiplication

• Matrix-matrix:

=5

3

3

4 4

5

Multiplication

• Matrix-matrix: – rows of , cols of

Multiplication Properties

• Associative

• Distributive

• NOT commutative

– Dimensions may not even be conformable

Useful Matrices

• Identity matrix

• Diagonal matrix

Useful Matrices

• Symmetric • Orthogonal

– Columns/ rows are orthonormal• Positive semidefinite :

– Equivalently, there exists

Outline



Norms

• Quantify “size” of a vector• Given , a norm satisfies

1. 2. 3.

• Common norms:1. Euclidean -norm: 2. -norm: 3. -norm:

Linear Subspaces

Linear Subspaces

• Subspace satisfies1. 2. If and , then

• Vectors span if

Linear Independence and Dimension

• Vectors are linearly independent if

– Every linear combination of the is unique• if span and are linearly independent– If span then

• If then are NOT linearly independent

Linear Independence and Dimension

Matrix Subspaces

• Matrix defines two subspaces– Column space – Row space

• Nullspace of :

– Analog for column space

Matrix Rank

• gives dimensionality of row and column spaces

• If has rank , can decompose into product of and matrices

=𝑚

𝑛

𝑚𝑛

𝑘

𝑘

Properties of Rank

• For

• has full rank if • If rows not linearly independent– Same for columns if

Outline



Matrix Inverse

• is invertible iff • Inverse is unique and satisfies

1. 2. 3. 4. If is invertible then is invertible and

Systems of Equations

• Given wish to solve

– Exists only if • Possibly infinite number of solutions

• If is invertible then – Notational device, do not actually invert matrices– Computationally, use solving routines like

Gaussian elimination


• What if ?• Find that gives closest to – is projection of onto – Also known as regression

• Assume

Invertible Projection matrix


[1 02 1] [ .5−2.5]=[ .5−1.5] [−1 1

1 1][ −1− .5 ]=[ .5−1.5 ]

Eigenvalue Decomposition

• Eigenvalue decomposition of symmetric is

– contains eigenvalues of – is orthogonal and contains eigenvectors of

• If is not symmetric but diagonalizable

– is diagonal by possibly complex– not necessarily orthogonal

Characterizations of Eigenvalues

• Traditional formulation

– Leads to characteristic polynomial

• Rayleigh quotient (symmetric )

Eigenvalue Properties

• For with eigenvalues 1. 2. 3.

• When is symmetric– Eigenvalue decomposition is singular value

decomposition– Eigenvectors for nonzero eigenvalues give

orthogonal basis for

Simple Eigenvalue Proof

• Why ?• Assume is symmetric and full rank1. 2. 3. If , eigenvalue of is 04. Since is product of eigenvalues, one of the

terms is , so product is 0

𝑄𝑄𝑇=𝐼

Outline



Convex Optimization• Find minimum of a function subject to solution

constraints• Business/economics/ game theory– Resource allocation– Optimal planning and strategies

• Statistics and Machine Learning– All forms of regression and classification– Unsupervised learning

• Control theory– Keeping planes in the air!

Convex Sets

• A set is convex if and

– Line segment between points in also lies in • Ex– Intersection of halfspaces– balls– Intersection of convex sets

Convex Functions

• A real-valued function is convex if is convex and and

– Graph of upper bounded by line segment between points on graph

(𝑥 , 𝑓 (𝑥 ) )

( 𝑦 , 𝑓 (𝑦 ) )

Gradients

• Differentiable convex with • Gradient at gives linear approximation

𝑓

𝑥

𝑓 (𝑥 )+𝑤𝑇𝛻 𝑓

Gradients

• Differentiable convex with • Gradient at gives linear approximation

𝑓

𝑥

𝑓 (𝑥 )+𝑤𝑇𝛻 𝑓

Gradient Descent

• To minimize move down gradient– But not too far!– Optimum when

• Given , learning rate , starting point

Do until

Stochastic Gradient Descent

• Many learning problems have extra structure

• Computing gradient requires iterating over all points, can be too costly

• Instead, compute gradient at single training example

Stochastic Gradient Descent

• Given , learning rate , starting point Do until nearly optimal

For in random order

• Finds nearly optimal

Minimize

Learning Parameter

Documents

Linear Algebra Review (with a Small Dose of Optimization)