Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,

Computing & Information SciencesKansas State University

Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Lecture 24 of 42

Monday, 24 March 2008

William H. Hsu

Department of Computing and Information Sciences, KSU

KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih

Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading:

Today: Section 7.5, Han & Kamber 2e

After spring break: Sections 7.6 – 7.7, Han & Kamber 2e

Model-Based Clustering:Expectation-Maximization



• Organizing data into classes such that there is

• high intra-class similarity

• low inter-class similarity

• Finding the class labels and the number of classes directly from the data (in contrast to classification).

• More informally, finding natural groupings among objects.

Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing

What is Clustering?

Adapted from slides © 2003 Eamonn Keogh http://www.cs.ucr.edu/~eamonn



Pedro (Portuguese)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)

Cristovao (Portuguese)Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English)

Miguel (Portuguese)Michalis (Greek), Michael (English), Mick (Irish!)

Hierarchical Clustering:Names (using String Edit Distance) P

iotr

Pyo

tr P

etro

s P

ietr

oP

edro

Pie

rre

Pie

ro P

eter

Ped

er P

eka

Pea

dar

Mic

halis

Mic

hael

Mig

uel

Mic

kC

rist

ovao

Chr

isto

pher

Chr

isto

phe

Chr

isto

phC

risd

ean

Cri

stob

alC

rist

ofor

oK

rist

offe

rK

ryst

of




Pio

tr

Pyo

tr

Pet

ros

Pie

tro

Ped

ro P

ierr

e

Pie

ro

Pet

er

Ped

er P

eka

Pea

dar

Pedro (Portuguese/Spanish)Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)

Hierarchical Clustering:Names by Linguistic Similarity




Nearest Neighbor ClusteringNearest Neighbor ClusteringNot to be confused with Nearest Neighbor Not to be confused with Nearest Neighbor ClassificationClassification

• Items are iteratively merged into the existing clusters that are closest.

• Incremental

• Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

Incremental Clustering [1]



10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Threshold t

t 1

2




10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

New data point arrives…

It is within the threshold for cluster 1, so add it to the cluster, and update cluster center. 1

2

3




10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

New data point arrives…

It is not within the threshold for cluster 1, so create a new cluster, and so on..

1

2

3

4

Algorithm is highly order dependent…

It is difficult to determine t in advance…




Similarity and clusteringSimilarity and clustering



MotivationMotivation

Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation

Clustering document responses to queries along lines of different topics.

Problem 2: Manual construction of topic hierarchies and taxonomies Solution:

Preliminary clustering of large samples of web documents.

Problem 3: Speeding up similarity search Solution:

Restrict the search for documents similar to a query to most representative cluster(s).



ExampleExample

Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)



ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within

which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is

interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.

Similarity measures Represent documents by TFIDF vectors Distance between document vectors Cosine of angle between document vectors

Issues Large number of noisy dimensions Notion of noise is application dependent



Top-down clusteringTop-down clustering

k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids

Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:

Find probability that document d is generated from distribution f for all d and f

Estimate distribution parameters from weighted contribution of documents



Choosing `k’Choosing `k’

Mostly problem driven Could be ‘data driven’ only when either

Data is not sparse Measurement dimensions are not too noisy

Interactive Data analyst interprets results of structure discovery



Choosing ‘k’ : ApproachesChoosing ‘k’ : Approaches

Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions

Require regularity conditions on the mixture likelihood function (Smith’85)

Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques




Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description

Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)

Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average

likelihoods on held-out data Dtest

Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)



Similarity and clusteringSimilarity and clustering



MotivationMotivation

Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualisation

Clustering document responses to queries along lines of different topics.

Problem 2: Manual construction of topic hierarchies and taxonomies Solution:

Preliminary clustering of large samples of web documents.

Problem 3: Speeding up similarity search Solution:

Restrict the search for documents similar to a query to most representative cluster(s).



ExampleExample

Scatter/Gather, a text clustering system, can separate salient topics in the response tokeyword queries. (Image courtesy of Hearst)



ClusteringClustering Task : Evolve measures of similarity to cluster a collection of documents/terms

into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is

interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs.

Collaborative filtering: Clustering of two/more objects which have bipartite relationship



Clustering (contd)Clustering (contd)

Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning

Visualisation techniques: Embedding of corpus in a low-dimensional space

Characterising the entities: Internally : Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs

Learning: Supplement stock algorithms with experience with data



Clustering: ParametersClustering: Parameters

Similarity measure: (eg: cosine similarity)

Distance measure: (eg: eucledian distance)

Number “k” of clusters Issues

Large number of noisy dimensions Notion of noise is application dependent

),( 21 dd

),( 21 dd



Clustering: Formal specificationClustering: Formal specification

Partitioning Approaches Bottom-up clustering Top-down clustering

Geometric Embedding Approaches Self-organization map Multidimensional scaling Latent semantic indexing

Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics



Partitioning ApproachesPartitioning Approaches Partition document collection into k clusters Choices:

Minimize intra-cluster distance Maximize intra-cluster semblance

If cluster representations are available Minimize Maximize

Soft clustering d assigned to with `confidence’ Find so as to minimize or maximize

Two ways to get partitions - bottom-up clustering and top-down clustering

}.....,{ 21 kDDD

i Ddd i

dd21 ,

21 ),(

i Ddd i

dd21 ,

21 ),(

i Dd

i

i

Dd ),(

iD

i Dd

i

i

Dd ),(

iD idz ,idz ,

i Ddiid

i

Ddz ),(,

i Ddiid

i

Ddz ),(,



Bottom-up clustering(HAC)Bottom-up clustering(HAC)

Initially G is a collection of singleton groups, each with one document Repeat

Find , in G with max similarity measure, s() Merge group with group

For each keep track of best Use above info to plot the hierarchical merging process (DENDOGRAM) To get desired number of clusters: cut across any level of the dendogram

d



DendogramDendogram

A dendogram presents the progressive, hierarchy-forming merging process pictorially.



Similarity measureSimilarity measure

Typically s() decreases with increasing number of merges Self-Similarity

Average pair wise similarity between documents in

= inter-document similarity measure (say cosine of tfidf vectors) Other criteria: Maximium/Minimum pair wise similarity between

documents in the clusters

21 ,21

2

),(1

)(dd

ddsC

s

),( 21 dds



ComputationComputation

Un-normalizedgroup profile:

ddpp̂

Can show:

1)(ˆ),(ˆ

pp

s

1

)(ˆ),(ˆ

pp

s

pp

pppppp

ˆ,ˆ2

ˆ,ˆˆ,ˆˆ,ˆ

O(n2logn) algorithm with n2 space



SimilaritySimilarity

))(())((

))(()),((),(

cgcg

cgcgs

productinner ,

Normalizeddocument profile: ))((

))(()(

cg

cgp

Profile fordocument group :

)(

)()(

p

pp



Switch to top-downSwitch to top-down Bottom-up

Requires quadratic time and space Top-down or move-to-nearest

Internal representation for documents as well as clusters Partition documents into `k’ clusters 2 variants

“Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores

Termination when assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations



Top-down clusteringTop-down clustering

Hard k-Means: Repeat… Choose k arbitrary ‘centroids’ Assign each document to nearest centroid Recompute centroids

Soft k-Means : Don’t break close ties between document assignments to clusters Don’t make documents contribute to a single cluster which wins narrowly

Contribution for updating cluster centroid from document related to the current similarity between and .

c ddc

ccc

cc d

d

)||exp(

)||exp(2

2



Seeding `k’ clustersSeeding `k’ clusters

Randomly sample documents Run bottom-up group average clustering algorithm to reduce to k

groups or clusters : O(knlogn) time Iterate assign-to-nearest O(1) times

Move each document to nearest cluster Recompute cluster centroids

Total time taken is O(kn) Non-deterministic behavior

knO



Choosing `k’Choosing `k’

Mostly problem driven Could be ‘data driven’ only when either

Data is not sparse Measurement dimensions are not too noisy

Interactive Data analyst interprets results of structure discovery




Hypothesis testing: Null Hypothesis (Ho): Underlying density is a mixture of ‘k’ distributions

Require regularity conditions on the mixture likelihood function (Smith’85)

Bayesian Estimation Estimate posterior distribution on k, given data and prior on k. Difficulty: Computational complexity of integration Autoclass algorithm of (Cheeseman’98) uses approximations (Diebolt’94) suggests sampling techniques




Penalised Likelihood To account for the fact that Lk(D) is a non-decreasing function of k. Penalise the number of parameters Examples : Bayesian Information Criterion (BIC), Minimum Description

Length(MDL), MML. Assumption: Penalised criteria are asymptotically optimal (Titterington 1985)

Cross Validation Likelihood Find ML estimate on part of training data Choose k that maximises average of the M cross-validated average

likelihoods on held-out data Dtest

Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)



Visualisation techniquesVisualisation techniques

Goal: Embedding of corpus in a low-dimensional space Hierarchical Agglomerative Clustering (HAC)

lends itself easily to visualisaton Self-Organization map (SOM)

A close cousin of k-means Multidimensional scaling (MDS)

minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data.

Latent Semantic Indexing (LSI) Linear transformations to reduce number of dimensions



Self-Organization Map (SOM)Self-Organization Map (SOM) Like soft k-means

Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine

Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialised even if eventually many are to remain

devoid of documents

Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and

cc

c),( ch



SOM : Update RuleSOM : Update Rule

Like Neural network Data item d activates neuron (closest cluster) as well as the

neighborhood neurons Eg Gaussian neighborhood function

Update rule for node under the influence of d is:

Where is the ndb width and is the learning rate parameter

dc

)( dcN

))(2

||||exp(),(

2

2

tch c

))(,()()()1( dchttt d

)(t)(2 t



SOM : Example ISOM : Example I

SOM computed from over a million documents taken from 80 Usenet newsgroups. Lightareas have a high density of documents.



SOM: Example IISOM: Example II

Another example of SOM at work: the sites listed in the Open Directory have beenorganized within a map of Antarctica at

http://antarcti.ca/.



Multidimensional Scaling(MDS)Multidimensional Scaling(MDS)

Goal “Distance preserving” low dimensional embedding of documents

Symmetric inter-document distances Given apriori or computed from internal representation

Coarse-grained user feedback User provides similarity between documents i and j . With increasing feedback, prior distances are overridden

Objective : Minimize the stress of embedding^

ijd

ijd

^

,

2

,

2)(

jiij

jiijij

d

dd

stress



MDS: issuesMDS: issues

Stress not easy to optimize Iterative hill climbing

1. Points (documents) assigned random coordinates by external heuristic

2. Points moved by small distance in direction of locally decreasing stress

For n documents Each takes time to be moved Totally time per relaxation

)(nO

)( 2nO



Fast Map [Faloutsos ’95]Fast Map [Faloutsos ’95]

No internal representation of documents available Goal

find a projection from an ‘n’ dimensional space to a space with a smaller number `k‘’ of dimensions.

Iterative projection of documents along lines of maximum spread

Each 1D projection preserves distance information



Best lineBest line

Pivots for a line: two points (a and b) that determine it Avoid exhaustive checking by picking pivots that are far apart First coordinates of point on “best line”

ba

xbbaxa

d

dddx

,

2,

2,

2,

1 2

1x x),( ba



Iterative projectionIterative projection

For i = 1 to k1. Find a next (ith ) “best” line

A “best” line is one which gives maximum variance of the point-set in the direction of the line

2. Project points on the line

3. Project points on the “hyperspace” orthogonal to the above line



ProjectionProjection

Purpose To correct inter-point distances between points

by taking into account the components already accounted for by the first pivot line.

Project recursively upto 1-D space Time: O(nk) time

211

2,

'

,)('' yxdd yxyx

),( '' yx),( 11 yx

'' , yxd



IssuesIssues

Detecting noise dimensions Bottom-up dimension composition too slow Definition of noise depends on application

Running time Distance computation dominates Random projections Sublinear time w/o losing small clusters

Integrating semi-structured information Hyperlinks, tags embed similarity clues A link is worth a ? words



Expectation maximization (EM): Pick k arbitrary ‘distributions’ Repeat:

Find probability that document d is generated from distribution f for all d and f

Estimate distribution parameters from weighted contribution of documents



Extended similarityExtended similarity

Where can I fix my scooter? A great garage to repair your 2-wheeler is at … auto and car co-occur often Documents having related words are related Useful for search and clustering Two basic approaches

Hand-made thesaurus (WordNet) Co-occurrence and associations

… car …

… auto …

… auto …car… car … auto… auto …car

… car … auto… auto …car… car … auto

car auto



k

k-dim vector

Latent semantic indexingLatent semantic indexing

A

Documents

Ter

ms

U

d

t

r

D V

d

SVD

Term Document

car

auto



Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren

Batman Rambo Andre Hiver Whispers StarWarsLyleEllenJasonFredDeanKaren

Collaborative recommendationCollaborative recommendation

People=record, movies=features People and features to be clustered

Mutual reinforcement of similarity

Need advanced models

From Clustering methods in collaborative filtering, by Ungar and Foster



A model for collaborationA model for collaboration

People and movies belong to unknown classes Pk = probability a random person is in class k

Pl = probability a random movie is in class l

Pkl = probability of a class-k person liking a class-l movie

Gibbs sampling: iterate Pick a person or movie at random and assign to a class with probability

proportional to Pk or Pl

Estimate new parameters



Aspect ModelAspect Model

Metric data vs Dyadic data vs Proximity data vs Ranked preference data. Dyadic data : domain with two finite sets of objects Observations : Of dyads X and Y Unsupervised learning from dyadic data Two sets of objects

},....{},,....{ 11 nini yyyYxxxX



Aspect Model (contd)Aspect Model (contd)

Two main tasks Probabilistic modeling:

learning a joint or conditional probability model over

structure discovery: identifying clusters and data hierarchies.

YX




Statistical models Empirical co-occurrence frequencies

Sufficient statistics Data spareseness:

Empirical frequencies either 0 or significantly corrupted by sampling noise Solution

Smoothing Back-of method [Katz’87] Model interpolation with held-out data [JM’80, Jel’85] Similarity-based smoothing techniques [ES’92]

Model-based statistical approach: a principled approach to deal with data sparseness




Model-based statistical approach: a principled approach to deal with data sparseness Finite Mixture Models [TSM’85] Latent class [And’97] Specification of a joint probability distribution for latent and observable

variables [Hoffmann’98]

Unifies statistical modeling

Probabilistic modeling by marginalization

structure detection (exploratory data analysis) Posterior probabilities by baye’s rule on latent space of structures




Realisation of an underlying sequence of random variables

2 assumptions All co-occurrences in sample S are iid are independent given

P(c) are the mixture components

:),( 1 Nnnn yxS

:),( 1 Nnnn YXS

nAnn YX ,



Aspect Model: Latent classesAspect Model: Latent classes

},..{

},...{

)}(),({

1

1

1

L

K

Nnnn

ddD

ccC

YDXC

},....{

),(

1

1

K

Nnnnn

aaA

YXA

},...{

}),({

1

1

K

Nnnn

ccC

YXC

},...{

}),({

1

1

K

Nnnn

ccC

YXC

IncreasingDegree ofRestrictionOn Latent

space




Symmetric Asymmetric

Xx Yy

yxn

AaXxYy

yxn

N

n

N

n

nnnnnnnn

ayPaxPaPyxPSP

ayPaxPaPayxPaSP

),(),(

1 1

)]|()|()([),()(

)|()|()(),,(),(

Xx Yy

yxn

AaXxYy

yxn

N

n

N

n

nnnnnnnn

ayPxaPxPyxPSP

ayPaxPaPayxPaSP

),(),(

1 1

)]|()|([)(),()(

)|()|()(),,(),(



Clustering vs AspectClustering vs Aspect

Clustering model constrained aspect model

For flat:

For hierarchical Group structure on object spaces as against partition the

observations Notation

P(.) : are the parametersP{.}: are posteriors

acnn cxCxXaAPcxaP })(,|(),|(

ackk ac

),|(. cxaPca ackk



Hierarchical Clustering modelHierarchical Clustering model

One-sided clustering Hierarchical clustering

Yy

yxn

Aa CcXx

Xx Yy

yxn

AaXxYy

yxn

ayPcxaPcPxP

ayPxaPxPyxPSP

),(

),(),(

)]|(),|([)()(

)]|()|([)(),()(

Yy

yxnxn

XxCcYy

yxn

Aa CcXx

Xx Yy

yxn

AaXxYy

yxn

ayPxPcPayPcxaPcPxP

ayPxaPxPyxPSP

),()(),(

),(),(

)]|([])([)()]|(),|([)()(

)]|()|([)(),()(



Comparison of E’sComparison of E’s

Aa

nnn

ayPaxPaP

ayPaxPaPyYxXaAP

'

)'|()'|()'(

)|()|()(};,|{

Cc

yxn

Yy

yxn

Yyx cyPcP

cyPcP

ScxCP

'

),(

),(

)]'|([)'(

)]|([)(

},|)({

Cc Yy

yxn

yxn

Yy Aa

cxaPayPcP

cxaPayPcP

ScxCP

'

),(

),(

)]',|()|([)'(

)],|()|([)(

},|)({

Aa

nnn

ayPcxaP

ayPcxaPcxCyYxXaAP

'

)'|(),|'(

)|(),|(};)(,,|{

•Aspect model

•One-sided aspect model

•Hierarchical aspect model



Tempered EM(TEM)Tempered EM(TEM)

Additively (on the log scale) discount the likelihood part in Baye’s formula:

1. Set and perform EM until the performance on held--out data deteriorates (early stopping).

2. Decrease e.g., by setting with some rate parameter .

3. As long as the performance on held-out data improves continue TEM iterations at this value of

4. Stop on i.e., stop when decreasing does not yield further improvements, otherwise goto step (2)

5. Perform some final iterations using both, training and heldout data.

Aa

nnn

ayPaxPaP

ayPaxPaPyYxXaAP

'

)]'|()'|()['(

)]|()|()[(};,|{

1



M-StepsM-Steps

)';,'|(),'(

)';,|(),(

)';,|(

)';,|(

)|(

,'1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

axP

yx

y

nnN

n

nn

xxn n

)';',|()',(

)';,|(),(

)';,|(

)';,|(

)|(

',1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yx

x

nnN

n

nn

yyn n

}';|)({)(

}';|)({),()|(

xx

xx

ScxCPxn

ScxCPyxncyP

N

xnxP

)()(

}';',|{)',(

}';,|{),(

}';,|{

}';,|{

)|(

',1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yx

x

nnN

n

nn

yyn n

N

xnxP

)()(

N

xnxP

)()(

)';,'|(),'(

)';,|(),(

)';,|(

)';,|(

)|(

,'1

:

yxaPyxn

yxaPyxn

yxaP

yxaP

xaP

yx

y

nnN

n

nn

xxn n

1. Aspect

2. Assymetric

3. Hierarchical x-clustering

4. One-sided x-clustering



Example Model [Hofmann and Popat CIKM 2001]Example Model [Hofmann and Popat CIKM 2001]

Hierarchy of document categories



Example ApplicationExample Application



Topic HierarchiesTopic Hierarchies

To overcome sparseness problem in topic hierarchies with large number of classes

Sparseness Problem: Small number of positive examples• Topic hierarchies to reduce variance in parameter estimation

Automatically differentiate Make use of term distributions estimated for more general, coarser text aspects to

provide better, smoothed estimates of class conditional term distributions Convex combination of term distributions in a Hierarchical Mixture Model refers to all inner nodes a above the terminal class node c.

ca

awPcaPcwP )|()|()|(



Topic Hierarchies (Hierarchical X-clustering)

Topic Hierarchies (Hierarchical X-clustering)

X = document, Y = word

}';'),(|{)'),((

}';),(|{)),((

}';',|{)',(

}';,|{),(

}';,|{

}';,|{

)|(

',)(

)(

',1

:

yxcaPyxcn

yxcaPyxcn

yxaPyxn

yxaPyxn

yxaP

yxaP

ayP

yaxc

axc

yx

x

nnN

n

nn

yyn n

N

xnxP

)()(

Cc Yy

yxn

yxn

Yy ca

Cc Yy

yxn

yxn

Yy Aa

xcaPayPcP

xcaPayPcP

xcxaPayPcP

xcxaPayPcP

ScxCP

'

),(

),(

'

),(

),(

))]('|()|([)'(

))](|()|([)(

))](',|()|([)'(

))](,|()|([)(

},|)({

caAa

ayPxcaP

ayPcaP

ayPcxaP

ayPcxaPxcyaPxcyxaP

''

)'|())(|'(

)|()|(

)'|(),|'(

)|(),|(});(,|{});(,,|{

ca

y

xcyaP

xcyaPcyn

xcaP

'

))(,|'(

))(,|(),(

});(|{



Document Classification ExerciseDocument Classification Exercise

Modification of Naïve Bayes

ca

awPcaPcwP )|()|()|(

xyi

c

xyi

i

i

cyPcP

cyPcP

xcP)'|()'(

)|()(

)|(

'



Mixture vs ShrinkageMixture vs Shrinkage

Shrinkage [McCallum Rosenfeld AAAI’98]: Interior nodes in the hierarchy represent coarser views of the data which are obtained by simple pooling scheme of term counts

Mixture : Interior nodes represent abstraction levels with their corresponding specific vocabulary Predefined hierarchy [Hofmann and Popat CIKM 2001] Creation of hierarchical model from unlabeled data [Hofmann IJCAI’99]



Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]

Mixture Density Networks(MDN) [Bishop CM ’94 Mixture Density Networks]

broad and flexible class of distributions that are capable of modeling completely general continuous distributions

superimpose simple component densities with well known properties to generate or approximate more complex distributions

Two modules: Mixture models: Output has a distribution given as mixture of distributions Neural Network: Outputs determine parameters of the mixture model.



MDN: ExampleMDN: Example

A conditional mixture density network with Gaussian component densities



MDNMDN

Parameter Estimation : Using Generalized EM (GEM) algo to speed up.

Inference Even for a linear mixture, closed form solution not possible Use of Monte Carlo Simulations as a substitute



Vocabulary V, term wi, document represented by

is the number of times wi occurs in document Most f’s are zeroes for a single document Monotone component-wise damping function g such as log or

square-root

Document modelDocument model

Vwi iwfc ),()(

),( iwf

Vwi iwfgcg )),(())((



Terminology

Expectation-Maximization (EM) Algorithm Iterative refinement: repeat until convergence to a locally optimal label

Expectation step: estimate parameters with which to simulate data

Maximization step: use simulated (“fictitious”) data to update parameters

Unsupervised Learning and Clustering Constructive induction: using unsupervised learning for supervised learning

Feature construction: “front end” - construct new x values

Cluster definition: “back end” - use these to reformulate y

Clustering problems: formation, segmentation, labeling

Key criterion: distance metric (points closer intra-cluster than inter-cluster)

AlgorithmsAutoClass: Bayesian clustering

Principal Components Analysis (PCA), factor analysis (FA)

Self-Organizing Maps (SOM): topology preserving transform (dimensionality

reduction) for competitive unsupervised learning



Summary Points

Expectation-Maximization (EM) Algorithm

Unsupervised Learning and Clustering Types of unsupervised learning

Clustering, vector quantization

Feature extraction (typically, dimensionality reduction)

Constructive induction: unsupervised learning in support of supervised learningFeature construction (aka feature extraction)

Cluster definition

AlgorithmsEM: mixture parameter estimation (e.g., for AutoClass)

AutoClass: Bayesian clustering

Principal Components Analysis (PCA), factor analysis (FA)

Self-Organizing Maps (SOM): projection of data; competitive algorithm

Clustering problems: formation, segmentation, labeling

Next Lecture: Time Series Learning and Characterization