Matrix Decomposition Methods in Information Retrieval Thomas Hofmann Department of Computer Science Brown University (& Chief

Matrix Decomposition Methods

in Information RetrievalThomas HofmannDepartment of Computer ScienceBrown Universitywww.cs.brown.edu/people/th(& Chief Scientist, RecomMind Inc.)

In collaboration with:Jan Puzicha, UC Berkeley & RecomMindDavid Cohen, CMU & Burning Glass

http://www.cs.brown.edu/people/th

2

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Overview

1. Introduction: A Brief History of

Mechanical IR

2. Latent Semantic Analysis

3. Probabilistic Latent Semantic

Analysis

4. Learning (from) Hyperlink Graphs

5. Collaborative Filtering

6. Future Work and Conclusion

1. Introduction: A Brief History of Mechanical IR

3

4

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Memex – “As we may think.”

Vannevar Bush (1945)

The idea of an easily accessible, individually configurable storehouse of knowledge, the beginning of the literature on mechanized information retrieval:

“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”

“The world has arrived at an age of cheap complex devices of great reliability; and something is bound to come of it.”

5

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Memex – “As we may think.”

Vannevar Bush (1945)

The civilizational challenge:

“The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.”

V. Bush, “As we may think”, Atlantic Monthly, 176 (1945), pp.101-108

6

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

The Thesaurus Approach

Hans Peter Luhn (1957, 1961) Words of similar or related meaning are grouped into

“notional families” Encoding of documents in terms of notional elements Matching by measuring the degree of notional

similarity A common language for annotating documents, key

word in context (KWIC) indexing. “… the faculty of interpretation is beyond the talent of

machines.” Statistical cues extracted by machines to assist human

indexer; vocabulary method to detecting similarities.

H.P. Luhn, “A statistical approach to mechanical literature searching”, New York, IBM Research Center, 1957.H.P. Luhn, “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Text”, Information Retrieval and Machine Translation, 3(2), pp.1021-1028, 1961

7

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

To Punch or not to punch …

T. Joyce & R.M. Needham (1958)

Lattices & hierarchies of search terms“As in other systems, the documents are represented by holes in punched cards which represent the various terms, and in addition, when a hole is punched in any term card, all the terms at higher levels of the lattice […] are also punched.”The postcoordinate revolution: card sorting at search time!

“Investigations […] to lessen the physical work are continuing.”T. Joyce & R.M. Needham, “The Thesaurus Approach to Information Retrieval”,

American Documentation, 9, pp. 192-197, 1958.

8

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Term Associations

Lauren B. Doyle (1962)

Unusual co-occurrences of pairs of words = associations of words in text

Statistical testing: Chi-square and Pearson correlation coefficient to determine pairwise correlations

Term association maps for interactive retrieval

Today: semantic maps

L.B. Doyle, “Indexing and Abstracting by Association”, Unisys Corporation, 1962.

10

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Vector Space Model

Gerard Salton (1960/70)

Instead of indexing documents by selected index terms, preserve (almost) all terms in automatic indexing

Represent documents by a high-dimensional vector.

Each term can be associated with a weight Geometrical interpretation

G. Salton, “The SMART Retrieval System – Experiments in Automatic Document Processing”, 1971.

11

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Term-Document Matrix

d i

w jintelligence

w1 ... w j ... w J

d1

...

d i

...

d I

D

W

...

...

...

... ),( ji wdc

Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...]

D = {documents in database}

W = {terms in vocabulary}

...

art

ifici

al

1

inte

llig

ence

inte

rest

0

art

ifact

0 ...... 2t

term-document matrix

term weighting

Xd =

12

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

similarity betweendocument and query

Documents in “Inner” Space

Retrieval method rank documents according

to similarity with query term weighting schemes,

for example, TFIDF used in SMART system and

many successor systems, high popularity

00.2

0.40.6

0.8

1

0

0.2

0.4

0.6

0.8

10

0.2

0.4

0.6

0.8

1

0.75

0.64

cosine of angle between query and document(s)

qd

q,d)q,d(cos)q,d(sim

13

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Advantages of the Vector Space Model

No subjective selection of index terms Partial matching of queries and documents

(dealing with the case where no document contains all search terms)

Ranking according to similarity score (dealing with large result sets)

Term weighting schemes (improves retrieval performance)

Various extensions Document clustering Relevance feedback (modifying query vector)

Geometric foundation

2. Latent Semantic Analysis

14

15

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Limitations of the Vector Space Model Dimensionality:

Vector space representation is high-dimensional (several 10-100K).

Learning and estimation has to deal with curse of dimensionality.

Sparseness: Document vectors are typically very sparse. Cosine similarity can be noisy and inaccurate.

Semantics: The inner product can only match occurrences of exactly

the same terms. The vector representation does not capture semantic

relations between words. Independence

Bag-of-Words Representation Unable to capture phrases and semantic/syntactic

regularities

16

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

The Lost Meaning of Words …

Ambiguity and association in natural language

Polysemy: Words often have a multitude of meanings and different types of usage (more urgent for very heterogeneous collections).The vector space model is unable to discriminate between different meanings of the same word.

Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic).No associations between words are made in the vector space representation.

)q,d(cos)q,d(sim

)q,d(cos)q,d(sim

17

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Polysemy and Context

Document similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

…planet

...

contribution to similarity, if used in 1st meaning, but not if in 2nd

18

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Latent Semantic Analysis

General idea Map documents (and terms) to a low-

dimensional representation. Design a mapping such that the low-

dimensional space reflects semantic associations (latent semantic space).

Compute document similarity based on the inner product in the latent semantic space.

Goals Similar terms map to similar location in low

dimensional space. Noise reduction by dimension reduction.

19

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Dimension reduction by singular value decomposition of term-document matrix

original td matrix

original td matrix

L2 optimalapproximation

reconstructed td matrix

reconstructed td matrixCVUVUC ˆˆ tt

term/documentvectors

term/documentvectors

thresholdedsingular values

thresholdedsingular values

LSA: Matrix Decomposition by SVD

)w,d(cc),c( jiijij C word frequencies(possibly transformed)

•Document length normalization•Sublinear transformation (e.g., log)•Global term weight

20

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Singular Value Decomposition, definition

: orthonormal columns : diagonal with singular values (ordered)

Properties: Existence & uniqueness Thresholding small singular values yields an optimal

low-rank approximation (in the sense of the Frobenius norm)

Background: SVD

VU,

tVUC = X Xn X m n X n n X n n X m

tˆˆ VUC = X Xn X m n X k k X k k X m

21

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

SVD and PCA

If (!) the rows of would be shifted such that their mean is zero, then:

Then, one would essentially perform a projection on the principal axis defined by the columns of

Yet, this would destroy the sparseness of the term-document matrix (and consequently might hurt the performance of SVD methods)

t2tttt )( UUVUVUCC

C

U

22

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Hirschfield 1935, Hotelling 1936, Fisher 1940:Correlation analysis for contingency tables

K

2kjkikkjiij vu1ccc

J

1jiji cc

I

1iijj cc

I

1i

J

1jjkjiki 0vcuc

I

1ikl

J

1jjljkjiliki vvcuuc

1ii

constraints

Canonical Analysis

23

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Correspondence Analysis (as a method of scaling):Guttman 1941, Torgerson 1958, Benzecri 1969, Hill 1974Whitaker 1967: “gradient analysis”

“reciprocal averaging”

j

jijjc

1i vcu

iiij

ic

1j ucv

solutions: unit vectors and scores of canonical analysis SVD of rescaled matrix with entries

jiijij cc/cc

Canoncial & Correspondence Analysis

(not exactly what is done in LSA)

24

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

lower dimensionaldocument representation

lower dimensionaldocument representation

t2tttt UÛ)Uˆ)(Û(UˆVVÛˆˆ CC

Similarity: inner product in lower dimensional space

For given decomposition, additional documents or queries can be mapped to semantic space (folding-in) Notice that:

Hence, for new document/query q

Semantic Inner Product / Kernel

1t CVUCVU

1tt qq̂ V

25

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Term Associations from LSA

(taken from slide by S. Dumais)

Ter

m 2

Term 1

Conce

pt

26

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

LSA: Discussion

pros: Low-dimensional document representation is able to capture

synonyms. Noise removal and robustness by dimension reduction Experimentally: advantages over naïve vector space model

cons: “Formally”: L2 norm is inappropriate as a distance function for

count vectors (reconstruction may contain negative entries) “Conceptually”:

Problem of polysemy is not addressed; principle of linear superposition, no active disambiguation

Context of terms is not taken into account. Directions in latent space are hard to interpret. No probabilistic model of term occurrences.

[ad hoc selection of the number of dimensions, ...]

27

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Features of IR Methods

Features VSM LSA

Quantitative relevance score yes yes

Partial query matching yes yes

Document similarity yes yes

Word correlations, synonyms no yes

Low-dimensional representation no yes

Notional families, concepts no not really

Dealing with polysemy no no

Probabilistic model no no

Sparse representation yes no

3. Probabilistic Latent Semantic Analysis

28

29

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Documents as Information Sources

w1 ... w j ... w J

d1

...

d i

...

d I

D

W

...

...

...

... )w,d(c ji

D = {documents in database}

W = {words in vocabulary}

“real” document: empirical probability distrib. relative frequencies

)d(c

)w,d(c)d|w(P̂

sampleother documents ?)d|w(P

“ideal” document: (memoryless) information source

30

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Information Source Models in IR

Bayes rule: probability of relevance of document w.r.t. query

)d(P)d|q(P)q|d(P prior probabilityof relevance

w

qt

)d|w(P )w|t(P)d|t(P

,)d|t(P)d|q(P

Query translation model

• Probability that q is “generated” from d

• Probability that query term is generated

Language model

Translation model

J. Ponte & W.B. Croft, ”A Language Model Approach to Information Retrieval”, SIGIR 1998.A. Berger & J. Lafferty, “Information Retrieval as Statistical Translation, SIGIR 1999.

31

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Probabilistic Latent Semantic Analysis How can we learn document-specific language

models? Sparseness problem, even for unigrams.

Probabilistic dimension reduction techniques to overcome data sparseness problem.

Factor analysis for count data: factors concepts

z

d)|P(zz)|P(wd)|P(w

(topic) factor“sources”

document-specificmixing proportions

document“sources”

latent variable z(“small” #states)

T. Hofmann, “Probabilistic Latent Semantic Analysis”, UAI 1999.

z

)z(P)z|P(dz)|P(w)dP(w,

32

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

docu

men

tco

llect

ion

single documentin collection

word occurrences

in a document

PLSA: Graphical Model

z

wc(d)

P(w|d) P(w|z) P(z|d)z

33

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge



colle

ctio

n

N

wc(d)

P(z|d)

z

34

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge



N

wc(d)

P(z|d)

z

P(w|z)

35

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge



N

wc(d)

P(z|d)

z

shared by all words in a document

shared by all documents in

collection

P(w|z)

36

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Probabilistic Latent Semantic Space documents are represented as points in low-

dimensional sub-simplex (dimensionality reduction for probability distributions)

KL-divergence projection, not orthogonal

)z|w(P 1

spannedsub-

simplex

0

+simplexembedding )d|w(P̂

)z|w(P 1

)z|w(P 3)z|w(P 2

)d|w(P

37

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Positive Matrix Decomposition mixture decomposition in matrix notation

CPPC~t

wd diag(P( ),..., P( ))z zK1

)z|dP()( kik,id P

)z|wP()( kjk,jw P

constraints Non-negativity of all matrices

Normalization according to L1-norm

(no orthogonality)

D.D. Lee & H.S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, 1999.

38

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Positive Matrix Decomposition & SVD mixture decomposition in matrix notation

CPPC~t

wd diag(P( ),..., P( ))z zK1

)z|dP()( kik,id P

)z|wP()( kjk,jw P

CVUVUC ˆˆ tt compare to

probabilistic approach vs. linear algebra decomposition conditional independence assumption “replaces” outer product

class-conditional distributions “replace” left/right eigenvectors

maximum likelihood instead of minimum L2 norm

criterion

j,i z

ijijj,i

ijij )z(P)z|d(P)z|w(Plogcc~logcL

39

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Expectation Maximization Algorithm

Maximizing log-likelihood by (tempered) EM iterations

E-step (posterior probabilities of latent variables)

M-step (max. of expected complete log-likelihood)

d

)w,d|z(P)w,d(c)z|wP(

probability that a term occurrence w within d is explained by topic z“

w

) w, d| z( P) w, d(c ) z| d P(

d, w

) w, d| z( P) w, d(c ) z P(

'z

)'z(P)'z|w(P)'z|d(P

)z(P)z|w(P)z|d(P)w,d|zP(

'z

)'z(P)]'z|w(P)'z|d(P[

)z(P)]z|w(P)z|d(P[)w,d|zP(

40

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Example: Science Magazine Papers

Dataset with approx. 12K papers from Science Magazine Selected concepts from model with K=200

41

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Example: TDT1 news stories

TDT1 = document collection with approx. 16,000 news stories (Reuters, CNN, years 1994/95)

results based on decomposition with 128 concepts 2 main factors for “flight“ and “love“ (most probable

words)

“love”

homefamilylikejustkidsmotherlifehappyfriendscnn

film moviemusicnewbesthollywoodloveactorentertainmentstar

“flight”

planeairportcrashflightsafetyaircraftairpassengerboardairline

spaceshuttlemissionastronautslaunchstationcrewnasasatelliteearth

pro

babili

tyP(w

|z)

42

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Folding-in a Document/Query

unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar

iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder

refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma

buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing

4 selected factorswith their most probable keywords

TDT1 collection: approx. 16,000 news storiesPLSA model with 128 dimensionsQuery keywords: “aid food medical people UN war”4 most probable factors for queryTrack posteriors for every key word

43

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge


0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

aidfoodmedicalpeopleunwar





Iteration 1Posterior

probabilites

44

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge








probabilites

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

45

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge








probabilites

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

46

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge







Iteration Posterior

probabilites

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

47

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

0 50 1000

10

20

30

40

50

60

70

80

90

MED

recall [%]

pre

cisi

on

[%]

0 50 1000

10

20

30

40

50

60

70

CRAN

recall [%]0 50 100

0

10

20

30

40

50

60

CACM

recall [%]0 50 100

0

5

10

15

20

25

30

35

40

45

50

CISI

recall [%]

cosLSIPLSI*

cosLSIPLSI*

cosLSIPLSI*

cosLSIPLSI*

Experiments: Precison-Recall

4 test collections (each with approx.1000- 3500 docs)

48

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Experimental Results: TFIDF

0

10

20

30

40

50

60

70

80

Medline CRAN CACM CISI

VSM

LSA

PLSA

Avera

ge P

reci

sion

-Reca

ll

49

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Experimental Results: TFIDF

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Medline CRAN CACM CISI

VSM

LSA

PLSA

Rela

tive G

ain

in A

vera

ge P

R

50

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

From Probabilistic Models to Kernels: The Fisher Kernel

Use idea of a Fisher kernel: Main idea: Derive a kernel or similarity function from a

generative model How do ML estimates of parameters change, around a

point in sample space?

Derive Fisher scores from model

Kernel/similarity function

y1t

x U)θ̂(IU)y,x(sim

θ)|x(PlogU θx point sample:x parameters model:θ

T. Jaakkola & D. Haussler, “Exploiting Generative Models for Discriminative Training”, NIPS 1999.

51

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Semantic Kernel from PLSA: Outline

Outline of the technical derivation Parameterize multinomials by variance

stabilizing parameters (=square-root parameterization)

Assume information orthogonality of parameters for different multinomials (approximation).

In each block, an isometric embedding with constant Fisher information is obtained. (Inversion problem for information matrix is circumvented)

… and the result …

52

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Semantic Kernel from PLSA: Result

J

1j mi

mjijK

1kjmji

K

1kmimi

cc

cc)w,d|z(P)w,d|z(Pα

)d|z(P)d|z(P)d,d(sim

K=1 essentially reduces to Vector Space Model (!)

topical overlap: probability that randomly chosen word in first and in second document refer to the same topic/concept

word sense(!) overlap: do both terms refer to the same concept?

word overlap: do both documents contain common terms?

53

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Text Categorization: SVM with PLSA standard text collection: Reuters21578 (5 main

categories) with standard kernel and PLSA kernel (Fisher kernel)

substantial improvement, if additional unlabeled documents are available

0

1

2

3

4

5

6

7

8

Error%

ear

n

acq

money

grai

n

crude

SVM 5%

SVM+ 5%

SVM 20%

SVM+ 20%

SVM 100%

SVM+ 100%

54

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Latent Class Analysis: Example

•document collection with approx. 1,400 abstracts on “clustering“ (INSPEC 1991-1997), preprocessing: stemming, stop word list

•4 main factors (K=128) for term “SEGMENT“ (most probable words)

imagSEGMENTtexturcolortissubrainsliceclustermrivolum

image segmentation

videosequencmotionframesceneSEGMENTshotimagclustervisual

motionsegmentation

constraintlinematchlocatimaggeometrimposSEGMENTfundamentrecogn

linematching

speakerspeechrecognisignaltrainHMMsourcspeakerindep.SEGMENTsound

speechrecognition

55

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Multiresolution wavelet decomposition and neuro-fuzzy clustering for segmentation of radiographic images.

Segmentation of medical images is a challenging problem in the field of image analysis. Several diagnostics are based on proper segmentation of the digitized image. Segmentation of medical images is needed for applications involving estimation of the boundary of an object, classification of tissue abnormalities, shape analysis, contour detection and texture segmentation. […]

0.55340.00000.00120.0000

Unknown-multiple signal source clustering problem using ergodic HMM and applied to speaker classification.

The authors consider signals originated from a sequence of sources. More specifically, the problems of segmenting such signals and relating the segments to their sources are addressed. This issue has wide applications in many fields. The report describes a resolution method that is based on an ergodic hidden Markov model (HMM), in which each HMM state corresponds to a signal source. […]

0.00020.66890.04550.0000

relative similarity (VSM): 1.4 relative similarity (PLSA): 0.7

“image” “speech”“video”“line”

Document Similarity: Example (1)

56

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

McCalpin, J.P.; Nishenko, S.P.: Holocene paleoseismicity, temporal clustering, and probabilities of future large (M>7) earthquakes on the Wasatch fault zone, Utah.

The chronology of M>7 paleoearthquakes on the central five segments of the Wasatch fault zone (WFZ) contains 16 earthquakes in the past 5500 years with an average repeat time of 350 years. Four of the central five segments ruptured between 620+or-30 and 1230+or-60 calendar years B.P. The remaining segment (Brigham City segment) has not ruptured in the past 2120+or-100 years. Comparison of the WFZ space-time diagram of paleoearthquakes with synthetic paleoseismic histories indicates that the observed temporal clusters and gaps have about an equal probability (depending on model assumptions) of reflecting random coincidence as opposed to intersegment contagion. Regional seismicity suggests […]

relative similarity (VSM): 1.0 relative similarity (PLSA): 0.5

Blatt, M.; Wiseman, S.; Domany, E.: Clustering data through an analogy to the Potts model

A new approach for clustering is proposed. This method is based on an analogy to a physical model; the ferromagnetic Potts model at thermal equilibrium is used as an analog computer for this hard optimization problem. We do not assume any structure of the underlying distribution of the data. Phase space of the Potts model is divided into three regions; ferromagnetic, super-paramagnetic and paramagnetic phases. The region of interest is that corresponding to the super-paramagnetic one, where domains of aligned spins appear. The range of temperatures where these structures are stable is indicated by […]

Document Similarity: Example (2)

57

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Features of IR Methods

Features LSA PLSA

Quantitative relevance score yes yes

Partial query matching yes yes

Document similarity yes yes

Word correlations, synonyms yes yes

Low-dimensional representation yes yes

Notional families, concepts not really

yes

Dealing with polysemy no yes

Probabilistic model no yes

Sparse representation no yes

4. Learning (from) Hyperlink Graphs

58

59

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

The Importance of Hyperlinks in IR

Hyperlinks provide latent human annotation Hyperlinks represent an implicit

endorsement of the page being pointed to Social structures are reflected in the Web

graph (cyber/virtual/Web communities) Link structure allows assessment of page

authority goes beyond content-based analysis potentially discriminates between high and low

quality sites

60

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

HITS (Hyperlink Induced Topic Search)

Jon Kleinberg and the Smart group (IBM) HITS

Retrieve a subset of Web pages, based on query-based search: result set + context graph

Extract hyperlink graph of pages in subset Rescoring method with hubs- and authority weights

using the adjacency matrix of a Web subgraph

Solution: left/right eigenvectors (SVD)

J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, 1998.

E)p,q(:p

)t(p

)1t(q

E)p,q(:q

)t(q

)t(p

xy

yx Authority scores

Hub scores

pq

…

…)t(qy )t(

px

qp

…

…)t(px )1t(

qy

61

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Learning a Semantic Model of the Web

Making sense of the text Probabilistic latent semantic analysis Automatically identifies concepts and topics.

Making sense of the link structure Probabilistic graph model, i.e., predictive model

for additional links/nodes based on existing ones Centered around the notion of “Web

communities” Probabilistic version of HITS Enables to predict the existence of hyperlinks:

estimate the entropy of the Web graph

62

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Finding Web Communities

)z|s(P)z|t(P

z

)z|t(P)z|s(P)z(P)t,s(P

Probabilistic model

Source nodes Target nodes

st

identical

Web Community: densely connected bipartite subgraph

63

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Decomposing the Web Graph

Web subgraph Community 1

Community 3Community 2

Links (probabilistically)belong to exactly one community.

Nodes may belong tomultiple communities.

64

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Linking Hyperlinks and Content

PLSA and PHITS (probabilistic HITS) can be combined into one joint decomposition model

w

z

P(z|s)

P(w|z)

concept/topic

P(t|z)

t

Web community

65

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

“Ulysses” Webs: Space, War, and Genius (no heros wanted)

ulysses 0.022082space 0.015334page 0.013885home 0.011904nasa 0.008915science 0.007417solar 0.007143esa 0.006757mission 0.006090

ulysses.jpl.nasa.gov/ 0.028583helio.estec.esa.nl/ulysses 0.026384www.sp.ph.ic.ak.uk/Ulysses 0.026384

grant 0.019197s 0.017092ulysses 0.013781online 0.006809war 0.006619school 0.005966poetry 0.005762president 0.005259civil 0.005065www.lib.siu.edu/projects/usgrant/ 0.019358www.whitehouse.gov/WH/glimpse /presidents/ug18.html 0.017598saints.css.edu/mkelsey/gppg.html 0.015838

page 0.020032ulysses 0.013361new 0.010455web 0.009060site 0.009009joyce 0.008430net 0.007799teachers 0.007236information 0.007170http://www.purchase.edu/Joyce/Ulysses.htm 0.008469http://www.bibliomania.com/Fiction/joyce/ulysses/index.html 0.007274 http://teachers.net/chatroom/ 0.005082

D. Cohn & T. Hofmann, “The Missing Link”, NIPS 2001.

• Decomposition of a base set generated from Altavista with query “Ulysses”• Combined decomposition based on links and text

5. Collaborative Filtering

66

67

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Personalized Information Filtering:

Users/Customers

Objects

Judgement/Selection

“likes”“has seen”

68

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Predicting Preferences and Actions

User ProfileDr. Strangeloves *****Three Colors: Blue *****Fargo *****Pretty Woman *Movie? Rating?

.

***************

69

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Collaborative and Content-Based Filtering

Collaborative/social filtering Properties of persons or similarities between

persons are used to improve predictions. Makes use of user profile data Formally: starting point is sparse matrix with

user ratings

Content-based filtering properties of objects or similarities between

objects are used to improve predictions

70

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

PLSA for Predicting User Ratings

Multi-valued (or real-valued) rating }5,4,3,2,1,0{v

u y

z v preference v is independent of person u, given latent state z“community-based” variant

• Each user is represented by a specific probability distribution

• Analogy to IR [user=document], [items=terms]

71

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

PLSA vs. Memory-Based Approaches

Standard approach: memory-based Given active user, compute correlation with all

user profiles in the data base (e.g., Pearson) Transform correlation into relative weight and

perform a weighted prediction over neighbors PLSA

Explicitly decomposes preferences: interests are inherently “multi-dimensional”, no global similarity function used (!)

Probabilistic model Data mining: interest groups

72

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

EachMovie Data Set (I)

33.4

35.3

39.940.8

30

32

34

36

38

40

42

Baseline

Memory

PLSA, K=20

PLSA, K=200

EachMovie: >40K users, >1.6K movies, >2M votes

Experimental evaluation: comparison with memory-based method (competitive), leave-one-out protocol

Prediction accuracy

73

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

EachMovie Data Set (II)

Absolute Deviation

1.091

0.951 0.9470.924

0.9

0.95

1

1.05

1.1

Baseline

Memory

PLSA, K=20

PLSA, K=200

74

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

EachMovie Data Set (III)

26.95 27.89

44.64 45.98

0

10

20

30

40

50

Baseline

Memory

PLSA, K=20

PLSA, K=200

Ranking score: exponential fall-off of weights with position in recommendation list

75

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Interests Group, Each Movie

76

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Des-Interests Group, Each Movie

6. Open Problems & Conclusions

77

78

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Scalability of Matrix Decomposition

RecomMind Inc., Retrieval Engine >1M documents >50K vocabulary >1K concepts

Internet Archive (www.archive.org) Large-scale Web experiments, >10M sites

79

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Conclusion: Matrix Decomposition

Enables semantic document indexing: concepts, notional families

Increased robustness in information retrieval Text/data mining: finding regularities & patterns Improved categorization by providing more

suitable document representations Probabilistic nature of models allows the use of

formal inference Very versatile: term-document matrix,

adjacency matrix, rating matrix, etc.

80

© T

hom

as

Hofm

an

nC

S D

ep

art

men

t, B

row

n U

niv

ers

ity, Pro

vid

en

ce R

I, t

h@

cs.b

row

n.e

du

KerM

IT &

Neu

roC

OLT

Work

shop

, A

pri

l 3

0th-M

ay 2

nd 2

00

1, C

um

berl

an

d L

od

ge

Open Problems

Conceptual Bayesian model learning and model combination Distributed learning of latent class models Relational Bayesian networks (Koller et al.) Principled ways to exploit sparseness in algorithm

design Beyond bag-of-words models (string kernels, bigram

language models)

Applications Combining content filtering with collaborative filtering Personalized information retrieval Interactive retrieval using extracted structure Multimedia retrieval New application domains