Guaranteed Learning of Latent Variable Models through...

Preview:

Citation preview

Guaranteed Learning of Latent Variable Models

through Spectral and Tensor Methods

Anima Anandkumar

U.C. Irvine

Application 1: Clustering

Basic operation of grouping data points.

Hypothesis: each data point belongs to an unknown group.

Application 1: Clustering

Basic operation of grouping data points.

Hypothesis: each data point belongs to an unknown group.

Probabilistic/latent variable viewpoint

The groups represent different distributions. (e.g. Gaussian).

Each data point is drawn from one of the given distributions. (e.g.Gaussian mixtures).

Application 2: Topic Modeling

Document modeling

Observed: words in document corpus.

Hidden: topics.

Goal: carry out document summarization.

Application 3: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships

Hidden: groups/communities of actors.

Application 4: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews.

Goal: Predict new recommendations.

Modeling: Find groups/communities of users and products.

Application 5: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g.image and speech recognition.

Sparse representations, low dimensional hidden structures.

Application 6: Computational Biology

Observed: gene expression levels

Goal: discover gene groups

Hidden variables: regulators controlling gene groups

How to model hidden effects?

Basic Approach: mixtures/clusters

Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions.

Can model mixed memberships.

x1 x2 x3 x4 x5

h1

h2 h3

This talk: basic mixture model. Tomorrow: advanced models.

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

In this series: guaranteed and efficient learning through spectral methods

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.

Wednesday: implementation of tensor methods.

Resources for Today’s Talk

“A Method of Moments for Mixture Models and Hidden MarkovModels.” by A. , D. Hsu, and S.M. Kakade. Proc. of COLT, June2012.

“Tensor Decompositions for Learning Latent Variable Models.” by A.,R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012.

Resources available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U

⊤(k), where U(k) are top-k eigen vectors.

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U

⊤(k), where U(k) are top-k eigen vectors.

Proof

By Pythagorean theorem:∑

i ‖xi − Pxi‖2 =∑

i ‖xi‖2 −∑

i ‖Pxi‖2.Maximize: 1

n

i ‖Pxi‖2 = 1n

iTr[

Pxix⊤i P

⊤]

= Tr[PSP⊤].

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Rayleigh Quotient

For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then

max‖z‖=1

z⊤Sz = λ1, min‖z‖=1

z⊤Sz = λd,

and the optimizing vectors are u1 and ud.

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Rayleigh Quotient

For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then

max‖z‖=1

z⊤Sz = λ1, min‖z‖=1

z⊤Sz = λd,

and the optimizing vectors are u1 and ud.

Optimal Projection

maxP :P 2=I

Rank(P )=k

Tr(P⊤SP ) = λ1 + λ2 . . . + λk and P spans {u1, . . . , uk}.

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

How the above equation is obtained

E[(x− µ)(x− µ)⊤] = E[(Ah− µ)(Ah− µ)⊤] + E[zz⊤]

=∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).

Lowest eigenvalue of covariance matrix yields σ2.

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).

Lowest eigenvalue of covariance matrix yields σ2.

How to Learn A?

Learning through Spectral Clustering

Learning A through Spectral Clustering

Project samples x on to span(A).

Distance-based clustering (e.g. k-means).

A series of works, e.g. Vempala & Wang.

Learning through Spectral Clustering

Learning A through Spectral Clustering

Project samples x on to span(A).

Distance-based clustering (e.g. k-means).

A series of works, e.g. Vempala & Wang.

Failure to cluster under large variance.

Learning Gaussian Mixtures Without Separation Constraints?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

What if the data is not Gaussian?

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

What if the data is not Gaussian?

◮ Moment-based Estimation of probabilistic latent variable models?

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Tensor Notation for Higher Order Moments

Multi-variate higher order moments form tensors.

Are there spectral operations on tensors akin to PCA?

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Third order moment for Gaussian mixtures

Consider mixture of k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means. µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

Intuition behind equation

E[x⊗ x⊗ x] = E[(Ah)⊗ (Ah)⊗ (Ah)] + E[(Ah)⊗ z ⊗ z] + . . .

=∑

i

wi · ai ⊗ ai ⊗ ai + σ2∑

i

µ⊗ ei ⊗ ei + . . .

How to recover parameters A and w from third order moment?

Simplifications

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

σ2 is obtained from σmin(Cov(x)).

E[x⊗ x] =∑

iwiai ⊗ ai + σ2I.

Simplifications

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

σ2 is obtained from σmin(Cov(x)).

E[x⊗ x] =∑

iwiai ⊗ ai + σ2I.

Can obtain

M3 =∑

i

wiai ⊗ ai ⊗ ai

M2 =∑

i

wiai ⊗ ai.

How to obtain parameters A and w from M2 and M3?

Tensor Slices

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Multilinear transformation of tensor

M3(B,C,D) :=∑

i

wi(B⊤ai) · (C⊤ai) · (D⊤ai)

Tensor Slices

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Multilinear transformation of tensor

M3(B,C,D) :=∑

i

wi(B⊤ai) · (C⊤ai) · (D⊤ai)

Slice of a tensor

M3(I, I, r) =∑

iwiai ⊗ ai〈ai, r〉 = ADiag(w)Diag(A⊤r)A⊤

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤

M2 = A · Diag(w) ·A⊤

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.

We have vi ∝ U⊤ai .

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.

We have vi ∝ U⊤ai .

Technical Detail

r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.

Learning Gaussian Mixtures through

Eigen-decomposition of Tensor Slices

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V Diag(λ̃)V −1.

Recovery of A (method 1)

Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .

Learning Gaussian Mixtures through

Eigen-decomposition of Tensor Slices

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V Diag(λ̃)V −1.

Recovery of A (method 1)

Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .

Recovery of A (method 2)

Let Θ ∈ Rk×k be a random rotation matrix.

Consider k slices M3(I, I, θi) and find eigen-decomposition above.

Let Λ̃ = [λ̃1| . . . |λ̃k] be the matrix of eigenvalues of all slices.

Recover A as : A = UΘ−1Λ̃.

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.

More efficient learning methods using higher order moments?

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

When is it the most compact representation? (Identifiability).

Can we recover the decomposition? (Algorithm?)

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

When is it the most compact representation? (Identifiability).

Can we recover the decomposition? (Algorithm?)

The most compact representation is known as CP-decomposition(CANDECOMP/PARAFAC).

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv = M(I, v) = λv.

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv = M(I, v) = λv.

Two Problems

How to find eigenvectors of a tensor?

A is not orthogonal in general.

Whitening

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.

When A ∈ Rd×k has full column rank, it is an invertible

transformation.

v1

v2v3

Wa1a2a3

Using Whitening to Obtain Orthogonal Tensor

Tensor M3 Tensor T

Multi-linear transform

M3 ∈ Rd×d×d and T ∈ R

k×k×k.

T = M3(W,W,W ) =∑

iwi(W⊤ai)

⊗3.

T =∑

i∈[k]

wi · vi ⊗ vi ⊗ vi is orthogonal.

Dimensionality reduction when k ≪ d.

How to Find Whitening Matrix?

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

v1

v2v3

Wa1a2a3

Use pairwise moments M2 to find W s.t. W⊤M2W = I.

Eigen-decomposition of M2 = UDiag(λ̃)U⊤, thenW = UDiag(λ̃−1/2).

V := W⊤ADiag(w)1/2 is an orthogonal matrix.

T = M3(W,W,W ) =∑

i

w−1/2i (W⊤ai

√wi)

⊗3

=∑

i

λivi ⊗ vi ⊗ vi, λi := w−1/2i .

T is an orthogonal tensor.

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Tensor Power Method

Start from an initial vector v. v 7→ T (I, v, v)

‖T (I, v, v)‖ .

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Tensor Power Method

Start from an initial vector v. v 7→ T (I, v, v)

‖T (I, v, v)‖ .

Canonical recovery method

Randomly initialize the power method. Run to convergence to obtainv with eigenvalue λ.

Deflate: T − λv ⊗ v ⊗ v and repeat.

Putting it together

Gaussian mixture: x = Ah+ z, where E[h] = w.

z ∼ N (0, σ2I).

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

Obtain whitening matrix W from SVD of M2.

Use W for multilinear transform: T = M3(W,W,W ).

Find eigenvectors of T through power method and deflation.

What about learning other latent variable models?

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Topic Modeling

Probabilistic Topic Models

Useful abstraction for automatic categorization of documents

Observed: words. Hidden: topics.

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

Exchangeability: x1 ⊥⊥ x2 ⊥⊥ . . . |h

A(i, j) := P[xm = i|ym = j] :

topic-word matrix.Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

AAAAA

h

Distribution of the topic proportions vector h

If there are k topics, distribution over the simplex ∆k−1

∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}.

Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.

Tomorrow: Latent Dirichlet allocation.

Distribution of the topic proportions vector h

If there are k topics, distribution over the simplex ∆k−1

∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}.

Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.

Tomorrow: Latent Dirichlet allocation.

Formulation as Linear Models

Distribution of the words x1, x2, . . .

Order d words in vocabulary. If x1 is jth word, assign ej ∈ Rd.

Distribution of each xi: supported on vertices of ∆d−1.

Properties

Pr[x1|topic h = i] = E[x1|topic h = i] = ai

Linear Model: E[xi|h] = Ah .

Multiview model: h is fixed and multiple words (xi) are generated.

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Geometric Picture for Topic Models

Single topic (h)

Geometric Picture for Topic Models

Single topic (h)

AAA

x1

x2

x3Word generation (x1, x2, . . .)

Gaussian Mixtures vs. Topic Models(spherical) Mixture of Gaussian:

k means: a1, . . . ak

Component h = i with prob. wi

observe x, with spherical noise,

x = ai + z, z ∼ N (0, σ2i I)

(single) Topic Models

k topics: a1, . . . ak

Topic h = i with prob. wi

observe l (exchangeable) words

x1, x2, . . . xl i.i.d. from ai

Unified Linear Model: E[x|h] = Ah

Gaussian mixture: single view, spherical noise.

Topic model: multi-view, heteroskedastic noise.

Three words per document suffice for learning.

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Recap: Basic Tensor Decomposition Method

Toy Example in MATLAB

Simulated Samples: Exchangeable Model

Whiten The Samples◮ Second Order Moments◮ Matrix Decomposition

Orthogonal Tensor Eigen Decomposition◮ Third Order Moments◮ Power Iteration

Simulated Samples: Exchangeable Model

Model Parameters

Hidden State:h ∈ basis {e1, . . . , ek}k = 2

Observed States:xi ∈ basis {e1, . . . , ed}d = 3

Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A

Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3

h

x1 x2 x3

Simulated Samples: Exchangeable Model

Model Parameters

Hidden State:h ∈ basis {e1, . . . , ek}k = 2

Observed States:xi ∈ basis {e1, . . . , ed}d = 3

Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A

Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3

Generate Samples Snippet

for t = 1 : n% generate h for this sampleh category=(rand()>0.5) + 1;h(t,h category)=1;transition cum=cumsum(A true(:,h category));% generate x1 for this sample | hx category=find(transition cum> rand(),1);x1(t,x category)=1;% generate x2 for this sample | hx category=find(transition cum >rand(),1);x2(t,x category)=1;% generate x3 for this sample | hx category=find(transition cum > rand(),1);x3(t,x category)=1;end

Whiten The Samples

Second Order Moments

M2 =1n

t xt1 ⊗ xt2

Whitening Matrix

W = UwL−0.5w ,

[Uw, Lw] =k-svd(M2)

Whiten Data

yt1 = W⊤xt1

Orthogonal Basis

V = W⊤A→ V ⊤V = I

Whitening Snippet

fprintf(’The second order moment M2:’);M2 = x1′*x2/n[Uw, Lw, Vw]= svd(M2);fprintf(’M2 singular values:’); LwW = Uw(:,1:k)* sqrt(pinv(Lw(1:k,1:k)));y1 = x1 * W; y2 = x2 * W; y3 = x3 * W;

a1 6⊥ a2 v1 ⊥ v2

Orthogonal Tensor Eigen Decomposition

Third Order Moments

T =1

n

t∈[n]

yt1 ⊗ yt2 ⊗ yt3 ≈∑

i∈[k]

λivi ⊗ vi ⊗ vi, V ⊤V = I

Gradient Ascent

T (I, v1, v1) =1

n

t∈[n]

〈v1, yt2〉〈v1, yt3〉yt1 ≈∑

i

λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Orthogonal Tensor Eigen Decomposition

T ← T −∑

j

λjv⊗3

j , v ← T (I, v, v)

‖T (I, v, v)‖

Power Iteration Snippet

.V = zeros(k,k); Lambda = zeros(k,1);

.for i = 1:k

. v old = rand(k,1); v old = normc(v old);

. for iter = 1 : Maxiter

. v new = (y1’* ((y2*v old).*(y3*v old)))/n;

. if i >1

. % deflation

. for j = 1: i-1

. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);

. end

. end

. lambda = norm(v new);v new = normc(v new);

. if norm(v old - v new) < TOL

. fprintf(′Converged at iteration %d.′ , iter);

. V(:,i) = v new; Lambda(i,1) = lambda;

. break;

. end

. v old = v new;

. end

.end

Orthogonal Tensor Eigen Decomposition

T ← T −∑

j

λjv⊗3

j , v ← T (I, v, v)

‖T (I, v, v)‖

Power Iteration Snippet

.V = zeros(k,k); Lambda = zeros(k,1);

.for i = 1:k

. v old = rand(k,1); v old = normc(v old);

. for iter = 1 : Maxiter

. v new = (y1’* ((y2*v old).*(y3*v old)))/n;

. if i >1

. % deflation

. for j = 1: i-1

. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);

. end

. end

. lambda = norm(v new);v new = normc(v new);

. if norm(v old - v new) < TOL

. fprintf(′Converged at iteration %d.′ , iter);

. V(:,i) = v new; Lambda(i,1) = lambda;

. break;

. end

. v old = v new;

. end

.end

iter 0

iter 1iter 2iter 3iter 4iter 5

Green: Groundtruth

Red: Estimation at each iteration

Conclusion

h

x1 x2x3

Gaussian mixtures and topic models. Unified linear representation.

Learning through higher order moments.

Tensor decomposition via whitening and power method.

Tomorrow’s lecture

Latent variable models and moments. Analysis of tensor power method.

Wednesday’s lecture: Implementation of tensor method.

Code snippet available at

http://newport.eecs.uci.edu/anandkumar/MLSS.html

Recommended