97
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine

Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Guaranteed Learning of Latent Variable Models

through Spectral and Tensor Methods

Anima Anandkumar

U.C. Irvine

Page 2: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 1: Clustering

Basic operation of grouping data points.

Hypothesis: each data point belongs to an unknown group.

Page 3: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 1: Clustering

Basic operation of grouping data points.

Hypothesis: each data point belongs to an unknown group.

Probabilistic/latent variable viewpoint

The groups represent different distributions. (e.g. Gaussian).

Each data point is drawn from one of the given distributions. (e.g.Gaussian mixtures).

Page 4: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 2: Topic Modeling

Document modeling

Observed: words in document corpus.

Hidden: topics.

Goal: carry out document summarization.

Page 5: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 3: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships

Hidden: groups/communities of actors.

Page 6: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 4: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews.

Goal: Predict new recommendations.

Modeling: Find groups/communities of users and products.

Page 7: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 5: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g.image and speech recognition.

Sparse representations, low dimensional hidden structures.

Page 8: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Application 6: Computational Biology

Observed: gene expression levels

Goal: discover gene groups

Hidden variables: regulators controlling gene groups

Page 9: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

How to model hidden effects?

Basic Approach: mixtures/clusters

Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions.

Can model mixed memberships.

x1 x2 x3 x4 x5

h1

h2 h3

This talk: basic mixture model. Tomorrow: advanced models.

Page 10: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Page 11: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Page 12: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Page 13: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Challenges in Learning

Basic goal in all mentioned applications

Discover hidden structure in data: unsupervised learning.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)?

Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

In this series: guaranteed and efficient learning through spectral methods

Page 14: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Page 15: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Page 16: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Page 17: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Page 18: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Page 19: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

Page 20: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.

Page 21: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

What this talk series will cover

This talk

Start with the most basic latent variable model: Gaussian mixtures.

Basic spectral approach: principal component analysis (PCA).

Beyond correlations: higher order moments.

Tensor notations, PCA extension to tensors.

Guaranteed learning of Gaussian mixtures.

Introduce topic models. Present a unified viewpoint.

Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.

Wednesday: implementation of tensor methods.

Page 22: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Resources for Today’s Talk

“A Method of Moments for Mixture Models and Hidden MarkovModels.” by A. , D. Hsu, and S.M. Kakade. Proc. of COLT, June2012.

“Tensor Decompositions for Learning Latent Variable Models.” by A.,R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012.

Resources available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html

Page 23: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Page 24: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Page 25: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U

⊤(k), where U(k) are top-k eigen vectors.

Page 26: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Warm-up: PCA

Optimization problem

For (centered) points xi ∈ Rd , find projection P

with Rank(P ) = k s.t.

minP∈Rd×d

1

n

i∈[n]

‖xi − Pxi‖2.

Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U

⊤(k), where U(k) are top-k eigen vectors.

Proof

By Pythagorean theorem:∑

i ‖xi − Pxi‖2 =∑

i ‖xi‖2 −∑

i ‖Pxi‖2.Maximize: 1

n

i ‖Pxi‖2 = 1n

iTr[

Pxix⊤i P

⊤]

= Tr[PSP⊤].

Page 27: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Page 28: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Rayleigh Quotient

For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then

max‖z‖=1

z⊤Sz = λ1, min‖z‖=1

z⊤Sz = λd,

and the optimizing vectors are u1 and ud.

Page 29: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Review of linear algebra

For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.

For symm. S ∈ Rd×d, there are d eigen values.

S =∑

i∈[d] λiuiu⊤i . U is orthogonal.

Rayleigh Quotient

For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then

max‖z‖=1

z⊤Sz = λ1, min‖z‖=1

z⊤Sz = λd,

and the optimizing vectors are u1 and ud.

Optimal Projection

maxP :P 2=I

Rank(P )=k

Tr(P⊤SP ) = λ1 + λ2 . . . + λk and P spans {u1, . . . , uk}.

Page 30: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

Page 31: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

Page 32: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means.

Let µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

How the above equation is obtained

E[(x− µ)(x− µ)⊤] = E[(Ah− µ)(Ah− µ)⊤] + E[zz⊤]

=∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

Page 33: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

Page 34: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).

Lowest eigenvalue of covariance matrix yields σ2.

Page 35: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

PCA on Gaussian Mixtures

E[(x− µ)(x− µ)⊤] =∑

i∈[k]

wi(ai − µ)(ai − µ)⊤ + σ2I.

The vectors {ai − µ} are linearly dependent:∑

i wi(ai − µ) = 0. ThePSD matrix

i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.

(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).

Lowest eigenvalue of covariance matrix yields σ2.

How to Learn A?

Page 36: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Learning through Spectral Clustering

Learning A through Spectral Clustering

Project samples x on to span(A).

Distance-based clustering (e.g. k-means).

A series of works, e.g. Vempala & Wang.

Page 37: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Learning through Spectral Clustering

Learning A through Spectral Clustering

Project samples x on to span(A).

Distance-based clustering (e.g. k-means).

A series of works, e.g. Vempala & Wang.

Failure to cluster under large variance.

Learning Gaussian Mixtures Without Separation Constraints?

Page 38: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

Page 39: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

Page 40: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

Page 41: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

Page 42: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

What if the data is not Gaussian?

Page 43: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Beyond PCA: Spectral Methods on Tensors

How to learn the component means A (not just its span) withoutseparation constraints?

PCA is a spectral method on (covariance) matrices.

◮ Are higher order moments helpful?

What if number of components is greater than observeddimensionality k > d?

◮ Do higher order moments help to learn overcomplete models?

What if the data is not Gaussian?

◮ Moment-based Estimation of probabilistic latent variable models?

Page 44: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Page 45: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Notation for Higher Order Moments

Multi-variate higher order moments form tensors.

Are there spectral operations on tensors akin to PCA?

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Page 46: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Third order moment for Gaussian mixtures

Consider mixture of k Gaussians: each sample is x = Ah+ z.

h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.

A ∈ Rd×k: columns are component means. µ := Aw be the mean.

z ∼ N (0, σ2I) is white Gaussian noise.

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

Intuition behind equation

E[x⊗ x⊗ x] = E[(Ah)⊗ (Ah)⊗ (Ah)] + E[(Ah)⊗ z ⊗ z] + . . .

=∑

i

wi · ai ⊗ ai ⊗ ai + σ2∑

i

µ⊗ ei ⊗ ei + . . .

How to recover parameters A and w from third order moment?

Page 47: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Simplifications

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

σ2 is obtained from σmin(Cov(x)).

E[x⊗ x] =∑

iwiai ⊗ ai + σ2I.

Page 48: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Simplifications

E[x⊗ x⊗ x] =∑

i

wiai ⊗ ai ⊗ ai + σ2∑

i

(µ⊗ ei ⊗ ei + . . .)

σ2 is obtained from σmin(Cov(x)).

E[x⊗ x] =∑

iwiai ⊗ ai + σ2I.

Can obtain

M3 =∑

i

wiai ⊗ ai ⊗ ai

M2 =∑

i

wiai ⊗ ai.

How to obtain parameters A and w from M2 and M3?

Page 49: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Slices

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Multilinear transformation of tensor

M3(B,C,D) :=∑

i

wi(B⊤ai) · (C⊤ai) · (D⊤ai)

Page 50: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Slices

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Multilinear transformation of tensor

M3(B,C,D) :=∑

i

wi(B⊤ai) · (C⊤ai) · (D⊤ai)

Slice of a tensor

M3(I, I, r) =∑

iwiai ⊗ ai〈ai, r〉 = ADiag(w)Diag(A⊤r)A⊤

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤

M2 = A · Diag(w) ·A⊤

Page 51: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Page 52: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

Page 53: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

Page 54: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Page 55: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.

We have vi ∝ U⊤ai .

Page 56: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Eigen-decomposition

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

Assumption: A ∈ Rd×k has full column rank.

M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.

U⊤M2U = Λ ∈ Rk×k is invertible.

X =(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V · Diag(λ̃) · V −1.

Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.

We have vi ∝ U⊤ai .

Technical Detail

r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.

Page 57: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Learning Gaussian Mixtures through

Eigen-decomposition of Tensor Slices

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V Diag(λ̃)V −1.

Recovery of A (method 1)

Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .

Page 58: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Learning Gaussian Mixtures through

Eigen-decomposition of Tensor Slices

M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.

(

U⊤M3(I, I, r)U) (

U⊤M2U)−1

= V Diag(λ̃)V −1.

Recovery of A (method 1)

Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .

Recovery of A (method 2)

Let Θ ∈ Rk×k be a random rotation matrix.

Consider k slices M3(I, I, θi) and find eigen-decomposition above.

Let Λ̃ = [λ̃1| . . . |λ̃k] be the matrix of eigenvalues of all slices.

Recover A as : A = UΘ−1Λ̃.

Page 59: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Page 60: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Page 61: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

Page 62: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.

Page 63: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Implications

Learn Gaussian mixtures through slices of third-order moment.

Guaranteed learning through eigen decomposition.

“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,

D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Shortcomings

The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.

Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.

M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.

More efficient learning methods using higher order moments?

Page 64: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Page 65: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

Page 66: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

When is it the most compact representation? (Identifiability).

Can we recover the decomposition? (Algorithm?)

Page 67: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Tensor Factorization

Recover A and w from M2 and M3.

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

= + ....

Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2

a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .

M3 is a sum of rank-1 terms.

When is it the most compact representation? (Identifiability).

Can we recover the decomposition? (Algorithm?)

The most compact representation is known as CP-decomposition(CANDECOMP/PARAFAC).

Page 68: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

Page 69: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

Page 70: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv = M(I, v) = λv.

Page 71: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Initial Ideas

M3 =∑

i

wiai ⊗ ai ⊗ ai.

What if A has orthogonal columns?

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv = M(I, v) = λv.

Two Problems

How to find eigenvectors of a tensor?

A is not orthogonal in general.

Page 72: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Whitening

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.

When A ∈ Rd×k has full column rank, it is an invertible

transformation.

v1

v2v3

Wa1a2a3

Page 73: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Using Whitening to Obtain Orthogonal Tensor

Tensor M3 Tensor T

Multi-linear transform

M3 ∈ Rd×d×d and T ∈ R

k×k×k.

T = M3(W,W,W ) =∑

iwi(W⊤ai)

⊗3.

T =∑

i∈[k]

wi · vi ⊗ vi ⊗ vi is orthogonal.

Dimensionality reduction when k ≪ d.

Page 74: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

How to Find Whitening Matrix?

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

v1

v2v3

Wa1a2a3

Use pairwise moments M2 to find W s.t. W⊤M2W = I.

Eigen-decomposition of M2 = UDiag(λ̃)U⊤, thenW = UDiag(λ̃−1/2).

V := W⊤ADiag(w)1/2 is an orthogonal matrix.

T = M3(W,W,W ) =∑

i

w−1/2i (W⊤ai

√wi)

⊗3

=∑

i

λivi ⊗ vi ⊗ vi, λi := w−1/2i .

T is an orthogonal tensor.

Page 75: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Page 76: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Tensor Power Method

Start from an initial vector v. v 7→ T (I, v, v)

‖T (I, v, v)‖ .

Page 77: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.

T (I, v1, v1) =∑

i λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Tensor Power Method

Start from an initial vector v. v 7→ T (I, v, v)

‖T (I, v, v)‖ .

Canonical recovery method

Randomly initialize the power method. Run to convergence to obtainv with eigenvalue λ.

Deflate: T − λv ⊗ v ⊗ v and repeat.

Page 78: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Putting it together

Gaussian mixture: x = Ah+ z, where E[h] = w.

z ∼ N (0, σ2I).

M2 =∑

i

wiai ⊗ ai, M3 =∑

i

wiai ⊗ ai ⊗ ai.

Obtain whitening matrix W from SVD of M2.

Use W for multilinear transform: T = M3(W,W,W ).

Find eigenvectors of T through power method and deflation.

What about learning other latent variable models?

Page 79: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Page 80: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Topic Modeling

Page 81: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Probabilistic Topic Models

Useful abstraction for automatic categorization of documents

Observed: words. Hidden: topics.

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

Exchangeability: x1 ⊥⊥ x2 ⊥⊥ . . . |h

A(i, j) := P[xm = i|ym = j] :

topic-word matrix.Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

AAAAA

h

Page 82: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Distribution of the topic proportions vector h

If there are k topics, distribution over the simplex ∆k−1

∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}.

Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.

Tomorrow: Latent Dirichlet allocation.

Page 83: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Distribution of the topic proportions vector h

If there are k topics, distribution over the simplex ∆k−1

∆k−1 := {h ∈ Rk, hi ∈ [0, 1],

i

hi = 1}.

Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.

Tomorrow: Latent Dirichlet allocation.

Page 84: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Formulation as Linear Models

Distribution of the words x1, x2, . . .

Order d words in vocabulary. If x1 is jth word, assign ej ∈ Rd.

Distribution of each xi: supported on vertices of ∆d−1.

Properties

Pr[x1|topic h = i] = E[x1|topic h = i] = ai

Linear Model: E[xi|h] = Ah .

Multiview model: h is fixed and multiple words (xi) are generated.

Page 85: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Page 86: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Geometric Picture for Topic Models

Single topic (h)

Page 87: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Geometric Picture for Topic Models

Single topic (h)

AAA

x1

x2

x3Word generation (x1, x2, . . .)

Page 88: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Gaussian Mixtures vs. Topic Models(spherical) Mixture of Gaussian:

k means: a1, . . . ak

Component h = i with prob. wi

observe x, with spherical noise,

x = ai + z, z ∼ N (0, σ2i I)

(single) Topic Models

k topics: a1, . . . ak

Topic h = i with prob. wi

observe l (exchangeable) words

x1, x2, . . . xl i.i.d. from ai

Unified Linear Model: E[x|h] = Ah

Gaussian mixture: single view, spherical noise.

Topic model: multi-view, heteroskedastic noise.

Three words per document suffice for learning.

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Page 89: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Outline

1 Introduction

2 Warm-up: PCA and Gaussian Mixtures

3 Higher order moments for Gaussian Mixtures

4 Tensor Factorization Algorithm

5 Learning Topic Models

6 Conclusion

Page 90: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Recap: Basic Tensor Decomposition Method

Toy Example in MATLAB

Simulated Samples: Exchangeable Model

Whiten The Samples◮ Second Order Moments◮ Matrix Decomposition

Orthogonal Tensor Eigen Decomposition◮ Third Order Moments◮ Power Iteration

Page 91: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Simulated Samples: Exchangeable Model

Model Parameters

Hidden State:h ∈ basis {e1, . . . , ek}k = 2

Observed States:xi ∈ basis {e1, . . . , ed}d = 3

Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A

Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3

h

x1 x2 x3

Page 92: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Simulated Samples: Exchangeable Model

Model Parameters

Hidden State:h ∈ basis {e1, . . . , ek}k = 2

Observed States:xi ∈ basis {e1, . . . , ed}d = 3

Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A

Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3

Generate Samples Snippet

for t = 1 : n% generate h for this sampleh category=(rand()>0.5) + 1;h(t,h category)=1;transition cum=cumsum(A true(:,h category));% generate x1 for this sample | hx category=find(transition cum> rand(),1);x1(t,x category)=1;% generate x2 for this sample | hx category=find(transition cum >rand(),1);x2(t,x category)=1;% generate x3 for this sample | hx category=find(transition cum > rand(),1);x3(t,x category)=1;end

Page 93: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Whiten The Samples

Second Order Moments

M2 =1n

t xt1 ⊗ xt2

Whitening Matrix

W = UwL−0.5w ,

[Uw, Lw] =k-svd(M2)

Whiten Data

yt1 = W⊤xt1

Orthogonal Basis

V = W⊤A→ V ⊤V = I

Whitening Snippet

fprintf(’The second order moment M2:’);M2 = x1′*x2/n[Uw, Lw, Vw]= svd(M2);fprintf(’M2 singular values:’); LwW = Uw(:,1:k)* sqrt(pinv(Lw(1:k,1:k)));y1 = x1 * W; y2 = x2 * W; y3 = x3 * W;

a1 6⊥ a2 v1 ⊥ v2

Page 94: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

Third Order Moments

T =1

n

t∈[n]

yt1 ⊗ yt2 ⊗ yt3 ≈∑

i∈[k]

λivi ⊗ vi ⊗ vi, V ⊤V = I

Gradient Ascent

T (I, v1, v1) =1

n

t∈[n]

〈v1, yt2〉〈v1, yt3〉yt1 ≈∑

i

λi〈vi, v1〉2vi = λ1v1.

vi are eigenvectors of tensor T .

Page 95: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

T ← T −∑

j

λjv⊗3

j , v ← T (I, v, v)

‖T (I, v, v)‖

Power Iteration Snippet

.V = zeros(k,k); Lambda = zeros(k,1);

.for i = 1:k

. v old = rand(k,1); v old = normc(v old);

. for iter = 1 : Maxiter

. v new = (y1’* ((y2*v old).*(y3*v old)))/n;

. if i >1

. % deflation

. for j = 1: i-1

. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);

. end

. end

. lambda = norm(v new);v new = normc(v new);

. if norm(v old - v new) < TOL

. fprintf(′Converged at iteration %d.′ , iter);

. V(:,i) = v new; Lambda(i,1) = lambda;

. break;

. end

. v old = v new;

. end

.end

Page 96: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Orthogonal Tensor Eigen Decomposition

T ← T −∑

j

λjv⊗3

j , v ← T (I, v, v)

‖T (I, v, v)‖

Power Iteration Snippet

.V = zeros(k,k); Lambda = zeros(k,1);

.for i = 1:k

. v old = rand(k,1); v old = normc(v old);

. for iter = 1 : Maxiter

. v new = (y1’* ((y2*v old).*(y3*v old)))/n;

. if i >1

. % deflation

. for j = 1: i-1

. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);

. end

. end

. lambda = norm(v new);v new = normc(v new);

. if norm(v old - v new) < TOL

. fprintf(′Converged at iteration %d.′ , iter);

. V(:,i) = v new; Lambda(i,1) = lambda;

. break;

. end

. v old = v new;

. end

.end

iter 0

iter 1iter 2iter 3iter 4iter 5

Green: Groundtruth

Red: Estimation at each iteration

Page 97: Guaranteed Learning of Latent Variable Models through ...newport.eecs.uci.edu/anandkumar/pubs/MLSS-part1.pdf · Basic spectral approach: principal component analysis (PCA). Beyond

Conclusion

h

x1 x2x3

Gaussian mixtures and topic models. Unified linear representation.

Learning through higher order moments.

Tensor decomposition via whitening and power method.

Tomorrow’s lecture

Latent variable models and moments. Analysis of tensor power method.

Wednesday’s lecture: Implementation of tensor method.

Code snippet available at

http://newport.eecs.uci.edu/anandkumar/MLSS.html