76
Computing & Information Sciences Kansas State University Monday, 24 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday, 24 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732 Instructor home page: http://www.cis.ksu.edu/~bhsu Reading: Today: Section 7.5, Han & Kamber 2 e After spring break: Sections 7.6 – 7.7, Han & Kamber 2 e Model-Based Clustering: Expectation-Maximization

Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Embed Size (px)

Citation preview

Page 1: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Lecture 24 of 42

Monday, 24 March 2008

William H. Hsu

Department of Computing and Information Sciences, KSU

KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih

Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading:

Today: Section 7.5, Han & Kamber 2e

After spring break: Sections 7.6 – 7.7, Han & Kamber 2e

Model-Based Clustering:Expectation-Maximization

Page 2: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

• Organizing data into classes such that there is

• high intra-class similarity

• low inter-class similarity

• Finding the class labels and the number of classes directly from the data (in contrast to classification).

• More informally, finding natural groupings among objects.

Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing

What is Clustering?

Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn

Page 3: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Pedro (Portuguese)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)

Cristovao (Portuguese)Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English)

Miguel (Portuguese)Michalis (Greek), Michael (English), Mick (Irish!)

Hierarchical Clustering:Names (using String Edit Distance) P

iotr

Pyo

tr P

etro

s P

ietr

oP

edro

Pie

rre

Pie

ro P

eter

Ped

er P

eka

Pea

dar

Mic

halis

Mic

hael

Mig

uel

Mic

kC

rist

ovao

Chr

isto

pher

Chr

isto

phe

Chr

isto

phC

risd

ean

Cri

stob

alC

rist

ofor

oK

rist

offe

rK

ryst

of

Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn

Page 4: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Pio

tr

Pyo

tr

Pet

ros

Pie

tro

Ped

ro P

ierr

e

Pie

ro

Pet

er

Ped

er P

eka

Pea

dar

Pedro (Portuguese/Spanish)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)

Hierarchical Clustering:Names by Linguistic Similarity

Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn

Page 5: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Nearest Neighbor ClusteringNearest Neighbor ClusteringNot to be confused with Nearest Neighbor Not to be confused with Nearest Neighbor ClassificationClassification

• Items are iteratively merged into the existing clusters that are closest.

• Incremental

• Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

Incremental Clustering [1]

Page 6: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Threshold t

t 1

2

Incremental Clustering [2]

Page 7: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

New data point arrives…

It is within the threshold for cluster 1, so add it to the cluster, and update cluster center. 1

2

3

Incremental Clustering [3]

Page 8: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

New data point arrives…

It is not within the threshold for cluster 1, so create a new cluster, and so on..

1

2

3

4

Algorithm is highly order dependent…

It is difficult to determine t in advance…

Incremental Clustering [4]

Page 9: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Similarity and clusteringSimilarity and clustering

Page 10: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

MotivationMotivation

Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation

Clustering document responses to queries along lines of different topics.

Problem 2: Manual construction of topic hierarchies and taxonomies Solution:

Preliminary clustering of large samples of web documents.

Problem 3: Speeding up similarity search Solution:

Restrict the search for documents similar to a query to most representative cluster(s).

Page 11: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ExampleExample

Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)

Page 12: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within

which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is

interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.

Similarity measures Represent documents by TFIDF vectors Distance between document vectors Cosine of angle between document vectors

Issues Large number of noisy dimensions Notion of noise is application dependent

Page 13: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Top-down clusteringTop-down clustering

k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids

Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:

Find probability that document d is generated from distribution f for all d and f

Estimate distribution parameters from weighted contribution of documents

Page 14: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing `k’Choosing `k’

Mostly problem driven Could be ‘data driven’ only when either

Data is not sparse Measurement dimensions are not too noisy

Interactive Data analyst interprets results of structure discovery

Page 15: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches

Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions

Require regularity conditions on the mixture likelihood function (Smith’85)

Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques

Page 16: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches

Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description

Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)

Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average

likelihoods on held-out data Dtest

Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)

Page 17: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Similarity and clusteringSimilarity and clustering

Page 18: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

MotivationMotivation

Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation

Clustering document responses to queries along lines of different topics.

Problem 2: Manual construction of topic hierarchies and taxonomies Solution:

Preliminary clustering of large samples of web documents.

Problem 3: Speeding up similarity search Solution:

Restrict the search for documents similar to a query to most representative cluster(s).

Page 19: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ExampleExample

Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)

Page 20: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms

into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is

interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.

Collaborative filtering: Clustering of two/more objects which have bipartite relationship

Page 21: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Clustering (contd)Clustering (contd)

Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning

Visualisation techniques: Embedding of corpus in a low-dimensional space

Characterising the entities: Internally : Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs

Learning: Supplement stock algorithms with experience with data

Page 22: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Clustering: ParametersClustering: Parameters

Similarity measure: (eg: cosine similarity)

Distance measure: (eg: eucledian distance)

Number “k” of clusters Issues

Large number of noisy dimensions Notion of noise is application dependent

),( 21 dd

),( 21 dd

Page 23: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Clustering: Formal specificationClustering: Formal specification

Partitioning Approaches Bottom-up clustering Top-down clustering

Geometric Embedding Approaches Self-organization map Multidimensional scaling Latent semantic indexing

Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics

Page 24: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Partitioning ApproachesPartitioning Approaches Partition document collection into k clusters Choices:

Minimize intra-cluster distance Maximize intra-cluster semblance

If cluster representations are available Minimize Maximize

Soft clustering d assigned to with `confidence’ Find so as to minimize or maximize

Two ways to get partitions - bottom-up clustering and top-down clustering

}.....,{ 21 kDDD

i Ddd i

dd21 ,

21 ),(

i Ddd i

dd21 ,

21 ),(

i Dd

i

i

Dd ),(

iD

i Dd

i

i

Dd ),(

iD idz ,idz ,

i Ddiid

i

Ddz ),(,

i Ddiid

i

Ddz ),(,

Page 25: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Bottom-up clustering(HAC)Bottom-up clustering(HAC)

Initially G is a collection of singleton groups, each with one document Repeat

Find , in G with max similarity measure, s() Merge group with group

For each keep track of best Use above info to plot the hierarchical merging process (DENDOGRAM) To get desired number of clusters: cut across any level of the dendogram

d

Page 26: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

DendogramDendogram

A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Page 27: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Similarity measureSimilarity measure

Typically s() decreases with increasing number of merges Self-Similarity

Average pair wise similarity between documents in

= inter-document similarity measure (say cosine of tfidf vectors) Other criteria: Maximium/Minimum pair wise similarity between

documents in the clusters

21 ,21

2

),(1

)(dd

ddsC

s

),( 21 dds

Page 28: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ComputationComputation

Un-normalizedgroup profile:

ddpp̂

Can show:

1)(ˆ),(ˆ

pp

s

1

)(ˆ),(ˆ

pp

s

pp

pppppp

ˆ,ˆ2

ˆ,ˆˆ,ˆˆ,ˆ

O(n2logn) algorithm with n2 space

Page 29: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

SimilaritySimilarity

))(())((

))(()),((),(

cgcg

cgcgs

productinner ,

Normalizeddocument profile: ))((

))(()(

cg

cgp

Profile fordocument group :

)(

)()(

p

pp

Page 30: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Switch to top-downSwitch to top-down Bottom-up

Requires quadratic time and space Top-down or move-to-nearest

Internal representation for documents as well as clusters Partition documents into `k’ clusters 2 variants

“Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores

Termination when assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations

Page 31: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Top-down clusteringTop-down clustering

Hard k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids

Soft k-Means : Don’t break close ties between document assignments to clusters Don’t make documents contribute to a single cluster which wins narrowly

Contribution for updating cluster centroid from document related to the current similarity between and .

c ddc

ccc

cc d

d

)||exp(

)||exp(2

2

Page 32: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Seeding `k’ clustersSeeding `k’ clusters

Randomly sample documents Run bottom-up group average clustering algorithm to reduce to k

groups or clusters : O(knlogn) time Iterate assign-to-nearest O(1) times

Move each document to nearest cluster Recompute cluster centroids

Total time taken is O(kn) Non-deterministic behavior

knO

Page 33: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing `k’Choosing `k’

Mostly problem driven Could be ‘data driven’ only when either

Data is not sparse Measurement dimensions are not too noisy

Interactive Data analyst interprets results of structure discovery

Page 34: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches

Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions

Require regularity conditions on the mixture likelihood function (Smith’85)

Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques

Page 35: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches

Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description

Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)

Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average

likelihoods on held-out data Dtest

Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)

Page 36: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Visualisation techniquesVisualisation techniques

Goal: Embedding of corpus in a low-dimensional space Hierarchical Agglomerative Clustering (HAC)

lends itself easily to visualisaton Self-Organization map (SOM)

A close cousin of k-means Multidimensional scaling (MDS)

minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data.

Latent Semantic Indexing (LSI) Linear transformations to reduce number of dimensions

Page 37: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Self-Organization Map (SOM)Self-Organization Map (SOM) Like soft k-means

Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine

Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialised even if eventually many are to remain

devoid of documents

Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and

cc

c),( ch

Page 38: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

SOM : Update RuleSOM : Update Rule

Like Neural network Data item d activates neuron (closest cluster) as well as the

neighborhood neurons Eg Gaussian neighborhood function

Update rule for node under the influence of d is:

Where is the ndb width and is the learning rate parameter

dc

)( dcN

))(2

||||exp(),(

2

2

tch c

))(,()()()1( dchttt d

)(t)(2 t

Page 39: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

SOM : Example ISOM : Example I

SOM computed from over a million documents taken from 80 Usenet newsgroups. Lightareas have a high density of documents.

Page 40: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

SOM: Example IISOM: Example II

Another example of SOM at work: the sites listed in the Open Directory have beenorganized within a map of Antarctica at

http://antarcti.ca/.

Page 41: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Multidimensional Scaling(MDS)Multidimensional Scaling(MDS)

Goal “Distance preserving” low dimensional embedding of documents

Symmetric inter-document distances Given apriori or computed from internal representation

Coarse-grained user feedback User provides similarity between documents i and j . With increasing feedback, prior distances are overridden

Objective : Minimize the stress of embedding^

ijd

ijd

^

,

2

,

2)(

jiij

jiijij

d

dd

stress

Page 42: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

MDS: issuesMDS: issues

Stress not easy to optimize Iterative hill climbing

1. Points (documents) assigned random coordinates by external heuristic

2. Points moved by small distance in direction of locally decreasing stress

For n documents Each takes time to be moved Totally time per relaxation

)(nO

)( 2nO

Page 43: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Fast Map [Faloutsos ’95]Fast Map [Faloutsos ’95]

No internal representation of documents available Goal

find a projection from an ‘n’ dimensional space to a space with a smaller number `k‘’ of dimensions.

Iterative projection of documents along lines of maximum spread

Each 1D projection preserves distance information

Page 44: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Best lineBest line

Pivots for a line: two points (a and b) that determine it Avoid exhaustive checking by picking pivots that are far apart First coordinates of point on “best line”

ba

xbbaxa

d

dddx

,

2,

2,

2,

1 2

1x x),( ba

Page 45: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Iterative projectionIterative projection

For i = 1 to k1. Find a next (ith ) “best” line

A “best” line is one which gives maximum variance of the point-set in the direction of the line

2. Project points on the line

3. Project points on the “hyperspace” orthogonal to the above line

Page 46: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

ProjectionProjection

Purpose To correct inter-point distances between points

by taking into account the components already accounted for by the first pivot line.

Project recursively upto 1-D space Time: O(nk) time

211

2,

'

,)('' yxdd yxyx

),( '' yx),( 11 yx

'' , yxd

Page 47: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

IssuesIssues

Detecting noise dimensions Bottom-up dimension composition too slow Definition of noise depends on application

Running time Distance computation dominates Random projections Sublinear time w/o losing small clusters

Integrating semi-structured information Hyperlinks, tags embed similarity clues A link is worth a ? words

Page 48: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:

Find probability that document d is generated from distribution f for all d and f

Estimate distribution parameters from weighted contribution of documents

Page 49: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Extended similarityExtended similarity

Where can I fix my scooter? A great garage to repair your 2-wheeler is at … auto and car co-occur often Documents having related words are related Useful for search and clustering Two basic approaches

Hand-made thesaurus (WordNet) Co-occurrence and associations

… car …

… auto …

… auto …car… car … auto… auto …car

… car … auto… auto …car… car … auto

car auto

Page 50: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

k

k-dim vector

Latent semantic indexingLatent semantic indexing

A

Documents

Ter

ms

U

d

t

r

D V

d

SVD

Term Document

car

auto

Page 51: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren

Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren

Collaborative recommendationCollaborative recommendation

People=record, movies=features People and features to be clustered

Mutual reinforcement of similarity

Need advanced models

From Clustering methods in collaborative filtering, by Ungar and Foster

Page 52: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

A model for collaborationA model for collaboration

People and movies belong to unknown classes Pk = probability a random person is in class k

Pl = probability a random movie is in class l

Pkl = probability of a class-k person liking a class-l movie

Gibbs sampling: iterate Pick a person or movie at random and assign to a class with probability

proportional to Pk or Pl

Estimate new parameters

Page 53: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect ModelAspect Model

Metric data vs Dyadic data vs Proximity data vs Ranked preference data. Dyadic data : domain with two finite sets of objects Observations : Of dyads X and Y Unsupervised learning from dyadic data Two sets of objects

},....{},,....{ 11 nini yyyYxxxX

Page 54: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect Model (contd)Aspect Model (contd)

Two main tasks Probabilistic modeling:

learning a joint or conditional probability model over

structure discovery: identifying clusters and data hierarchies.

YX

Page 55: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect ModelAspect Model

Statistical models Empirical co-occurrence frequencies

Sufficient statistics Data spareseness:

Empirical frequencies either 0 or significantly corrupted by sampling noise Solution

Smoothing Back-of method [Katz’87] Model interpolation with held-out data [JM’80, Jel’85] Similarity-based smoothing techniques [ES’92]

Model-based statistical approach: a principled approach to deal with data sparseness

Page 56: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect ModelAspect Model

Model-based statistical approach: a principled approach to deal with data sparseness Finite Mixture Models [TSM’85] Latent class [And’97] Specification of a joint probability distribution for latent and observable

variables [Hoffmann’98]

Unifies statistical modeling

Probabilistic modeling by marginalization

structure detection (exploratory data analysis) Posterior probabilities by baye’s rule on latent space of structures

Page 57: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect ModelAspect Model

Realisation of an underlying sequence of random variables

2 assumptions All co-occurrences in sample S are iid are independent given

P(c) are the mixture components

:),( 1 Nnnn yxS

:),( 1 Nnnn YXS

nAnn YX ,

Page 58: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect Model: Latent classesAspect Model: Latent classes

},..{

},...{

)}(),({

1

1

1

L

K

Nnnn

ddD

ccC

YDXC

},....{

),(

1

1

K

Nnnnn

aaA

YXA

},...{

}),({

1

1

K

Nnnn

ccC

YXC

},...{

}),({

1

1

K

Nnnn

ccC

YXC

IncreasingDegree ofRestrictionOn Latent

space

Page 59: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aspect ModelAspect Model

Symmetric Asymmetric

Xx Yy

yxn

AaXxYy

yxn

N

n

N

n

nnnnnnnn

ayPaxPaPyxPSP

ayPaxPaPayxPaSP

),(),(

1 1

)]|()|()([),()(

)|()|()(),,(),(

Xx Yy

yxn

AaXxYy

yxn

N

n

N

n

nnnnnnnn

ayPxaPxPyxPSP

ayPaxPaPayxPaSP

),(),(

1 1

)]|()|([)(),()(

)|()|()(),,(),(

Page 60: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Clustering vs AspectClustering vs Aspect

Clustering model constrained aspect model

For flat:

For hierarchical Group structure on object spaces as against partition the

observations Notation

P(.) : are the parametersP{.}: are posteriors

acnn cxCxXaAPcxaP })(,|(),|(

ackk ac

),|(. cxaPca ackk

Page 61: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Hierarchical Clustering modelHierarchical Clustering model

One-sided clustering Hierarchical clustering

Yy

yxn

Aa CcXx

Xx Yy

yxn

AaXxYy

yxn

ayPcxaPcPxP

ayPxaPxPyxPSP

),(

),(),(

)]|(),|([)()(

)]|()|([)(),()(

Yy

yxnxn

XxCcYy

yxn

Aa CcXx

Xx Yy

yxn

AaXxYy

yxn

ayPxPcPayPcxaPcPxP

ayPxaPxPyxPSP

),()(),(

),(),(

)]|([])([)()]|(),|([)()(

)]|()|([)(),()(

Page 62: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Comparison of E’sComparison of E’s

Aa

nnn

ayPaxPaP

ayPaxPaPyYxXaAP

'

)'|()'|()'(

)|()|()(};,|{

Cc

yxn

Yy

yxn

Yyx cyPcP

cyPcP

ScxCP

'

),(

),(

)]'|([)'(

)]|([)(

},|)({

Cc Yy

yxn

yxn

Yy Aa

cxaPayPcP

cxaPayPcP

ScxCP

'

),(

),(

)]',|()|([)'(

)],|()|([)(

},|)({

Aa

nnn

ayPcxaP

ayPcxaPcxCyYxXaAP

'

)'|(),|'(

)|(),|(};)(,,|{

•Aspect model

•One-sided aspect model

•Hierarchical aspect model

Page 63: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Tempered EM(TEM)Tempered EM(TEM)

Additively (on the log scale) discount the likelihood part in Baye’s formula:

1. Set and perform EM until the performance on held--out data deteriorates (early stopping).

2. Decrease e.g., by setting with some rate parameter .

3. As long as the performance on held-out data improves continue TEM iterations at this value of

4. Stop on i.e., stop when decreasing does not yield further improvements, otherwise goto step (2)

5. Perform some final iterations using both, training and heldout data.

Aa

nnn

ayPaxPaP

ayPaxPaPyYxXaAP

'

)]'|()'|()['(

)]|()|()[(};,|{

1

Page 64: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

M-StepsM-Steps

)';,'|(),'(

)';,|(),(

)';,|(

)';,|(

)|(

,'1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

axP

yx

y

nnN

n

nn

xxn n

)';',|()',(

)';,|(),(

)';,|(

)';,|(

)|(

',1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yx

x

nnN

n

nn

yyn n

}';|)({)(

}';|)({),()|(

xx

xx

ScxCPxn

ScxCPyxncyP

N

xnxP

)()(

}';',|{)',(

}';,|{),(

}';,|{

}';,|{

)|(

',1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yx

x

nnN

n

nn

yyn n

N

xnxP

)()(

N

xnxP

)()(

)';,'|(),'(

)';,|(),(

)';,|(

)';,|(

)|(

,'1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

xaP

yx

y

nnN

n

nn

xxn n

1. Aspect

2. Assymetric

3. Hierarchical x-clustering

4. One-sided x-clustering

Page 65: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Example Model [Hofmann and Popat CIKM 2001]Example Model [Hofmann and Popat CIKM 2001]

Hierarchy of document categories

Page 66: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Example ApplicationExample Application

Page 67: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Topic HierarchiesTopic Hierarchies

To overcome sparseness problem in topic hierarchies with large number of classes

Sparseness Problem: Small number of positive examples• Topic hierarchies to reduce variance in parameter estimation

Automatically differentiate Make use of term distributions estimated for more general, coarser text aspects to

provide better, smoothed estimates of class conditional term distributions Convex combination of term distributions in a Hierarchical Mixture Model refers to all inner nodes a above the terminal class node c.

ca

awPcaPcwP )|()|()|(

Page 68: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Topic Hierarchies (Hierarchical X-clustering)

Topic Hierarchies (Hierarchical X-clustering)

X = document, Y = word

}';'),(|{)'),((

}';),(|{)),((

}';',|{)',(

}';,|{),(

}';,|{

}';,|{

)|(

',)(

)(

',1

:

yxcaPyxcn

yxcaPyxcn

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yaxc

axc

yx

x

nnN

n

nn

yyn n

N

xnxP

)()(

Cc Yy

yxn

yxn

Yy ca

Cc Yy

yxn

yxn

Yy Aa

xcaPayPcP

xcaPayPcP

xcxaPayPcP

xcxaPayPcP

ScxCP

'

),(

),(

'

),(

),(

))]('|()|([)'(

))](|()|([)(

))](',|()|([)'(

))](,|()|([)(

},|)({

caAa

ayPxcaP

ayPcaP

ayPcxaP

ayPcxaPxcyaPxcyxaP

''

)'|())(|'(

)|()|(

)'|(),|'(

)|(),|(});(,|{});(,,|{

ca

y

xcyaP

xcyaPcyn

xcaP

'

))(,|'(

))(,|(),(

});(|{

Page 69: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Document Classification ExerciseDocument Classification Exercise

Modification of Naïve Bayes

ca

awPcaPcwP )|()|()|(

xyi

c

xyi

i

i

cyPcP

cyPcP

xcP)'|()'(

)|()(

)|(

'

Page 70: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Mixture vs ShrinkageMixture vs Shrinkage

Shrinkage [McCallum Rosenfeld AAAI’98]: Interior nodes in the hierarchy represent coarser views of the data which are obtained by simple pooling scheme of term counts

Mixture : Interior nodes represent abstraction levels with their corresponding specific vocabulary Predefined hierarchy [Hofmann and Popat CIKM 2001] Creation of hierarchical model from unlabeled data [Hofmann IJCAI’99]

Page 71: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]

Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]

broad and flexible class of distributions that are capable of modeling completely general continuous distributions

superimpose simple component densities with well known properties to generate or approximate more complex distributions

Two modules: Mixture models: Output has a distribution given as mixture of distributions Neural Network: Outputs determine parameters of the mixture model.

Page 72: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

MDN: ExampleMDN: Example

A conditional mixture density network with Gaussian component densities

Page 73: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

MDNMDN

Parameter Estimation : Using Generalized EM (GEM) algo to speed up.

Inference Even for a linear mixture, closed form solution not possible Use of Monte Carlo Simulations as a substitute

Page 74: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Vocabulary V, term wi, document represented by

is the number of times wi occurs in document Most f’s are zeroes for a single document Monotone component-wise damping function g such as log or

square-root

Document modelDocument model

Vwi iwfc ),()(

),( iwf

Vwi iwfgcg )),(())((

Page 75: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Terminology

Expectation-Maximization (EM) Algorithm Iterative refinement: repeat until convergence to a locally optimal label

Expectation step: estimate parameters with which to simulate data

Maximization step: use simulated (“fictitious”) data to update parameters

Unsupervised Learning and Clustering Constructive induction: using unsupervised learning for supervised learning

Feature construction: “front end” - construct new x values

Cluster definition: “back end” - use these to reformulate y

Clustering problems: formation, segmentation, labeling

Key criterion: distance metric (points closer intra-cluster than inter-cluster)

AlgorithmsAutoClass: Bayesian clustering

Principal Components Analysis (PCA), factor analysis (FA)

Self-Organizing Maps (SOM): topology preserving transform (dimensionality

reduction) for competitive unsupervised learning

Page 76: Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Summary Points

Expectation-Maximization (EM) Algorithm

Unsupervised Learning and Clustering Types of unsupervised learning

Clustering, vector quantization

Feature extraction (typically, dimensionality reduction)

Constructive induction: unsupervised learning in support of supervised learningFeature construction (aka feature extraction)

Cluster definition

AlgorithmsEM: mixture parameter estimation (e.g., for AutoClass)

AutoClass: Bayesian clustering

Principal Components Analysis (PCA), factor analysis (FA)

Self-Organizing Maps (SOM): projection of data; competitive algorithm

Clustering problems: formation, segmentation, labeling

Next Lecture: Time Series Learning and Characterization