View
217
Download
1
Tags:
Embed Size (px)
Citation preview
The Terms that You Have to Know!
Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection
Least Squares Problem:
The normal equation for LS problem: ATAx = ATb
Finding the projection of onto thecol(A)b
Ax ù b
The projection matrix: P = A(ATA)à 1AT 2 Rmâ m
Let be a matrix with full column rankA 2 Rmâ n
If has orthonormal columns, then the LS problem becomes easy:
A
Pb= AATb=P
i=1
n
A ï iATï ib
Think of orthonormal axis system
Matrix Factorization
LU-Factorization: A = LU
QR-Factorization:
Very useful for solving linear system equations Some row exchanges are required
A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân
Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )
A 2 Rmâ n
A = QR Q
Rm = n Q
QTQ = I
QR Factorization SimplifiesLeast Squares Problem
The normal equation for LS problem: ATAx = ATb
ATAx = RTQTQRx = RTRx = RTQTb
, Rx = QTb (RT is invertible)
A ï j = Q áR ï j =P
k=1
n
RkjQ ï k
A
Note: The orthogonal matrix constructs the column space of matrix
Q
LS problem: Finding the projection of onto the col(A)b
Motivation for Computing QR of the term-by-doc Matrix
The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection
A
cosòk = jjA ï kjj2jjqjj2
A Tï káq = jjQR ï kjj2jjqjj2
(QR ï k)Táq = jjR ï kjj2jjqjj2
R Tï k(Q
Táq)
Let be the angle between a query and the document vector
òk qA ï k
That means we can keep and instead of Q R A
QR also can be applied to dimension reduction
Recall Matrix Notations Random vector x = [X1, X2, …, Xn]T where each Xi is
a random variable to describe the value of the i-th attribute
Expectation: E[x] = , covariance: E[(x – )(x –)T] =
Expectation of projection:
E[wTx] = E[∑iwi Xi] = ∑
iwi E[Xi] = wTE[x] = wT
Variance of projection: Var(wTx) = E[(wTx – wT)2] = E[(wTx – wT)(wTx – wT)]= E[wT(x – )(x – )Tw]= wT E[(x – )(x –)T]w = wT w
wT: 1 n x: n 1
Principal Components Analysis (PCA)
Not using the output information Find a mapping from the inputs in the original n-
dimensional space to a new (k<n)-dimensional space such that when x is projected there, information loss is minimized.
The projection of x on the direction of w is: z = wTx Find w such that Var(z) is maximized (after the
projection, the difference between the sample points becomes most apparent)
For a unique solution, ||w|| = 1
w (||w|| = 1)
x
wTx
The 1st Principal Component
Maximize Var(z) subject to ||w||=1
Take the derivative w.r.t. w1 and setting it to 0, we have
That is, w1 is an eigenvector of and the corresponding eigenvalue
Also,
We can choose the largest eigenvalue for Var(z) to be maximum
The 1st principal component is the eigenvector of the covariance matrix of the input sample with the largest eigenvalue, 1 =
maxw1
wT1Î w1à ë wT
1w1à 1à á
2Î w1à 2ëw1 = 0) Î w1 = ëw1
Var(z) = wT1Î w1 = ëwT
1w1 = ë
The 2nd Principal Component
Maximize Var(z2), s.t., ||w2|| = 1 and orthogonal to w1
Take the derivative w.r.t. w2 and setting it equal to 0, we have
Premultiply by w1T and we get
Note that w1Tw2 = 0, and w1
T w2 is a scalar, equal to its transpose, therefore
And we have That is, w2 is the eigenvector of with the second
largest eigenvalue 2 = , and so on.
maxw2
wT2Î w2à ë wT
2w2à 1à á
à ì wT2w1à 0
à á
2Î w2à 2ëw2à ì w1 = 0
2wT1Î w2à 2ëwT
1w2à ì wT1w1 = 0
wT1Î w2 = wT
2Î w1 = õ1wT2w1 = 0
Î w2 = ëw2
Recall from Linear Algebra
Theorem: Eigenvectors associated with different eigenvalues are orthogonal to each other
Theorem: A real symmetry matrix A can be transformed into a diagonal matrix by P-1AP = D where P has its columns as the eigenvectors of A
Recall from Linear Algebra (cont.)
Def: Positive definite bilinear form: f (x, x) > 0, x 0 E.g.: f (x, y) = xTAy
xTAx > 0 x 0 A an nn matrix is called a positive definite matrix
Def: Positive semidefinite bilinear form: f (x, x) 0, x E.g.: xTAx 0, x A is called a positive
semidefinite matrix
Theorem: Matrix A is positive definite if and only if all the eigenvalues of A are positives
What PCA does Consider an Rn Rk transformation
where the k columns of W are the k leading eigenvectors of S (the estimator to ), and m is the sample meanNote: if k = n, WWT = WTW = I, so W-1 = WT ,or WTW = Ikk, if k < n
The transformation will center the data at the origin and rotates the axes to those eigenvectors, and the variances over the new dimensions are equal to the eigenvalues
z = WT(x – m) (just like z = w1Tx, z = w2
Tx, …)
Singular Value Decomposition (SVD)
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
The columns of are eigenvectors of and the columnsU AAT
of are eigenvectors ofV ATA
Î =
û1 ááá 00
... 00 ááá ûr
0... 0
0 ááá 0
2
6664
3
7775
mâ n
r = min(m;n)
û1>û2>. . .>ûr
eigenvalues of both and AATATA
are square roots of the nonzero
Singular Value Decomposition (SVD)
à 1 1 00 à 1 1
ô õ=
à 1 1
22
p
22
pô õ
3p
0 00 1 0
ô õ 66
p
à 36
p
66
p
à 22
p
0 22
p
33
p
33
p
33
p
2
64
3
75
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
AAT = UÎ VTVÎ TUT = UÎ Î TUT ) col(A) = col(U)
ATA = VÎ TÎ VT ) row(A) = col(V)
à 122
" #
=à
31á
à 1 2 22 à 1 22 2 à 1
" #á 3
00
" #
1[ ]
Latent Semantic Indexing (LSI)
Basic idea: explore the correlation between words and documents
Two words are correlated when they co-occur together many times
Two documents are correlated when they have many words
Latent Semantic Indexing (LSI)
Computation: using single value decomposition (SVD)
Concept Space m is the number of
conceptsRep. of Concepts
in term space
Concept
Concept
Rep. of concepts in document space
m: number of concepts/topics
54.20
034.3
X X
SVD: Properties
rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.
SVD produces the best low rank approximation
X’: rank(X’) = 2X: rank(X) = 9