Unsupervised Methods

1

Unsupervised Methods

2

Association Measures

• Association between items: assoc(x,y)– term-term, term-document, term-category, …

• Simple measure: freq(x,y), log(freq(x,y))+1

• Based on contingency table

3

Mutual Information

• The item corresponding to x,y in the Mutual Information for X,Y:

• Disadvantage: the MI value is inflated for low freq(x,y)

• Examples: results for two NLP articles

MI x y P x yP x P y

P x yP x

P y xP y( , ) log ( , )

( ) ( ) log ( )( ) log ( | )

( )

4

Log-Likelihood Ratio Test

• Comparing the likelihood of the data given two competing hypotheses (Dunning,93)

• Does not depend heavily on assumptions of normality, can be applied to small samples

• Used to test if p(x|y) = p(x|~y) = p(x), by comparing it to the general case (inequality)

• High log-likelihood score indicates that the data is much less likely if assuming equality

5

Log-Likelihood (cont.)

• Likelihood function:

• The likelihood ratio:

• is asymptotically distributed

• High : the data is less likely given

H p p k k H k( , ,...; , ,...) ( ; )1 2 1 2

max ( ; )

max ( ; )

0H k

H k

2 log

2

0

2 log

6

Log-Likelihood for Bigrams

p p x ya

a ckn

p p x yb

b dkn

p p xa b

a b c dk kn n

11

1

22

2

1 2

1 2

( | )

( |~ )

( )

7

Log-Likelihood for Binomial

• Maximum obtained for:

H p n k pk p n k n

k( ; , ) ( )

1

H p p n k n k H p n k H p n k( , ; , , , ) ( ; , ) ( ; , )1 2 1 1 2 2 1 1 1 2 2 2

max ( , ; , , , )

max ( , ; , , , ),

p

p p

H p p n k n k

H p p n k n k1 2 1 1 2 2

1 2 1 1 2 21 2

pkn

pkn

pk kn n1

1

12

2

2

1 2

1 2

; ;

8

Measuring Term Topicality• For query relevance ranking: Inverse Document

Frequency

• For term extraction:– Frequency– Frequency ratio for specialized vs. general corpus– Entropy of term co-occurrence distribution– Burstiness:

• Entropy of distribution (frequency) in documents• Proportion of topical documents for term (freq>1) within all

documents containing term (Katz, 1996)

9

Similarity Measures

• Cosine:

• Min/Max:

• KL to Average:

attv freqlogattu freqlog

attv freqlogattu freqlogvusim

attatt

att

22 ,,

,,,

att

att

attvIattuI max

attvIattuI minvu sim

,,,

,,,,

att vattPuattP

vattPvattP

vattPuattP

uattPuattPvuA

2log

2log,

10

A Unifying Schema of Similarity(with Erez Lotan)

• A general schema encoding most measures

• Identifies explicitly the important factors that determine (word) similarity

• Provides the basis for:– a general and efficient similarity computation

procedure– evaluating and comparing alternative measures

and components

11

Mapping to Unified Similarity Scheme

joint(assoc(u,att),assoc(u,att))

joint(assoc(u,att),assoc(u,att))

007

110 03

0 0

17 004

uassoc(u,att)

count(u,att)

,att

attu assoc g uW

016050 8000

45 006

vassoc(v,att)

count(v,att)

,att

attv assoc g vW

)(),(),,(

),(),(

),(,

vWuWvuSJnorm

vuSJvuf

attv, assoc,attu, assoc jointvuSJ

SIM

vuBoth att

12

Association and Joint Association

• assoc(u,att): quantify association strength– mutual information, weighted log frequency,

conditional probability (orthogonal to scheme)

• joint(assoc(u,att),assoc(v,att)): quantify the “similarity” of the two associations– ratio, difference, min, product

•

vuBoth att

attv, assoc,attu, assoc jointvuSJ,

),(

attv,freq ,attu,freq att vuBoth 00,

13

Normalization• Global weight of a word vector:

– For cosine:

• Normalization factor:

– For cosine:

,

u Just att

attu assoc g uW attu,freq att uJust 0

vW,uW,vu,SJnorm vu,rNorm_Facto

uJust att

attu assocuW 2,

vWuWvuFactorNorm ,_

14

The General Similarity Scheme

vWuW

vu,SJvu,sim

vWuWvuSJnorm

vuSJ

vu,rNorm_Facto

vu,SJ vu,sim

:cosine - exampleFor -

,,,

,

vuBoth attattv, assoc,attu, assoc jointvuSJ

,),(

where

15

Min/Max Measures

sim u vassoc u att assoc v attassoc u att assoc v att

att

att( , )

min( ( , ), ( , ))max( ( , ), ( , ))

• May be viewed as:

)),(),,(max()(

)),(),,(max(),(),,(min(

),(joint

attvassocattuassocattweight

attvassocattuassocattvassocattuassoc

16

Associations Used with Min/Max

• Log-frequency and Global Entropy Weight (Grefenstette, 1994):

• Mutual information (Dagan et al., 1993/5):

)()1),(log(),( attGewattufreqattuassoc

Gew attnrels

P v att P v attv( )log

( ) log( ( )) 11 [ , ]0 1

assoc u attP u att

P u P att

P att u

P att

P u att

P u( , ) log

( , )( ) ( )

log( )

( )log

( )

( )

17

Cosine Measure

• Used for word similarity (Ruge, 1992) with: assoc(u,att)=ln(freq(u,att))

• Popular for document ranking (vector space)

cos( , )( , ) ( , )

( , ) ( , )u v

assoc u att assoc v att

assoc w att assoc w att

att

att att

12

22

assoc doc term tf idf

tffreq doc term

freq doc

( , )

( , )max ( , )

idfdocfreq

docfreq term log

max ( )( )

1

18

Methodological BenefitsJoint work with Erez Lotan (Dagan 2000 and in preparation)

• Uniform understanding of similarity measure structure

• Modular evaluation/comparison of measure components

• Modular implementation architecture, easy experimentation by “plugging” alternative measure combinations

19

Empirical Evaluation• Thesaurus for query expansion (e.g. “insurance laws”):

Similar words for law :

Word Similarity Judgmentregulation 0.050242 +rule 0.048414 +legislation 0.038251 + guideline 0.035041 + commission 0.034499 - bill 0.033414 + budget 0.031043 - regulator 0.031006 + code 0.030998 + circumstance 0.030534 -

•Precision and comparative Recall at each point in the list

20

Comparing Measure Combinations

Recall

Precision

• Min/Max schemes worked better than cosine and Jensen-Shannon (almost by 20 points); stable over association measures

21

Effect of Co-occurrence Type on Semantic Similarity

22

Computational Benefits

• Complexity reduced by “sparseness” factor – #non-zero cells / total #cells Two orders of magnitude in corpus data

v1 … vj vmu

i1 . . . n

atti

Similarity Results j m1 . . .

• Efficient implementation through sparse matrix indexing By computing over common attributes only (both )

attributes

words

att1

atti

attn

23

General Scheme - Conclusions• A general mathematical scheme• Identifies the important factors for measuring

similarity• Efficient general procedure based on scheme• Empirical comparison of different measure

components (measure structure and assoc)• Successful application in an Internet crawler for

thesaurus construction (small corpora)

24

Clustering Methods

• Input: A set of objects (words, documents)

• Output: A set of clusters (sets of elements)

• Based on a criterion for the quality of a class, which guides cluster split/merge/modification– a distance function between objects/classes– a global quality function

25

Clustering Types

• Soft / Hard• Hierarchical / Flat• Top-down / bottom-up• Predefined number of clusters or not• Input:

– all point-to-point distances – original vector representation for points,

computing needed distances during clustering

26

Applications of Clustering• Word clustering

– Constructing a hierarchical thesaurus – Compactness and generalization in word

cooccurrence modeling (will be discussed later)

• Document clustering– Browsing of document collections and search

query output– Assistance in defining a set of supervised

categories

27

Hierarchical Agglomerative Clustering Methods (HACM)

1. Initialize every point as a cluster2. Compute a merge score for all cluster pairs3. Perform the best scoring merge4. Compute the merge score between the new cluster and all other clusters5. If more than one cluster remains, return to 3

28

Types of Merge Score• Minimal distance between the two candidates for the merge.

Alternatives for cluster distance:– Single link: distance between two nearest points– Complete ling: distance between two furthest points– Group average: average pairwise distance for all points– Centroid: distance between the two cluster centroids

• Based on the “quality” of the merged class:– Ward’s method: minimal increase in total within-group sum of

squares (average squared distance to centroid)

• Based on a global criterion (in Brown et al., 1992: minimal reduction in average mutual information)

29

Unsupervised Statistics and Generalizations for Classification

• Many supervised methods use cooccurrence statistics as features or probability estimates– eat a {peach,beach}– fire a missile vs. fire the prime minister

• Sparse data problem: if alternative cooccurrences never occurred, how to estimate their probabilities, or their relative “strength” as features?

30

Application: Semantic Disambiguation

Weapon

Bombs

grenade

Actions

Cause_movement

throw drop

Traditional AI-style approach Manually encoded semantic preferences/constraints

<object – verb>

Anaphora resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995)

The terrorist pulled the grenade from his pocket and

threw it at the policeman ?

31

Statistical Approach

Corpus(text collection)

<verb–object: throw-grenade> 20 times

<verb–object: throw-pocket> 1 time

“Semantic” Judgment

• Semantic confidence combined with syntactic preferences it grenade

• “Language modeling” for disambiguation

32

What about sense disambiguation?(for translation)

I bought soap bars I bought window bars

sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’)

? ?

Corpus(text collection)

Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 times

Sense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times

• “Hidden” senses – supervised labeling required?

33

Solution: Mapping to Another LanguageEnglish(-English)-Hebrew Dictionary:

bar1 ‘chafisa’ soap ‘sabon’ window ‘chalon’bar2 ‘sorag’

Map ambiguous constructs to second language (all possibilities):

<noun-noun: soap-bar> 1 <noun-noun: ‘cahfisat-sabon’> 20 times 2 <noun-noun: ‘sorag-sabon’> 0 times

<noun-noun: window-bar> 1 <noun-noun: ‘cahfisat-chalon’> 0 times 2 <noun-noun: ‘sorag-chalon’> 15 times

Hebrew Corpus

• Exploiting ambiguities difference

• Principle – intersecting redundancies(Dagan and Itai 1994)

34

Selection Model Highlights• Multinomial model, under certain linguistic assumptions

• Selection “confidence” – lower bound for odds-ratio:

)(

11lnln 1 iConf

nnZ

n

n

p

p

jij

i

j

i

• Overlapping ambiguous constructs are resolved through constraint propagation, by decreasing confidence order.

• Results (HebrewEnglish):Coverage: ~70% Precision within coverage: ~90%– ~20% improvement over choosing most frequent translation

(the common baseline)

35

Data Sparseness and Similarity

?

<verb–object: ‘hidpis-tikiya’>

<verb–object: print-folder> 0 times <verb–object: print-file_cabinet> 0 times

• Standard approach: “back-off” to single term frequency

• Similarity-based inference:

print<verb-object>

folder

filedirectory

record…

Similarfile_cabinet

cupboardcloset

suitcase…

Similar

print<verb-object>

36

Computing Distributional Similarity folder

file

printeraseopen

retrievebrowse

save…

Similar

attP

uattPlog

uPattP

uattPlogattuI 22

,,

• Association between word u (“folder”) and its “attributes” (context words/features) is based on mutual information:

att

att

attvIattuI max

attvIattuI minvu sim

,,,

,,,,

• Similarity between u and v (weighted Jaccard, [0,1]):

37

Disambiguation AlgorithmSelection of preferred alternative:

• Hypothesized similarity-based frequency derived from average association for similar words(incorporating single term frequency)

• Comparing hypothesized frequencies

print<verb-object>

folder

filedirectory

record…

Similarfile_cabinet

cupboardcloset

suitcase…

Similar

print<verb-object>

38

Computation and Evaluation• Heuristic search used to speed computation of k most similar

words

• Results (HebrewEnglish):• 15% coverage increase, while decreasing precision by 2%• Accuracy 15% better than back-off to single word

frequency

(Dagan, Marcus and Markovitch 1995)

39

Probabilistic Framework - Smoothing• Counts are obtained from a sample of the probability space:

sample

• Maximum Likelihood Estimate proportional to sample counts:

MLE estimate – 0 probability for unobserved events

• Smoothing discounts observed events, leaving probability “mass” to unobserved events:

discounted estimate for observed eventspositive estimate for unobserved events

40

Smoothing Conditional Attribute Probability

• Good-Turing smoothing scheme – discount & redistribute:

0),(

0),(:

1)(

0),( )|()|( uattcount

uattcountattunorm

uattcountuattPuattP

d

• Katz seminal back-off scheme (speech language modeling):

0),( )()()|( uattcountattPunormuattP

• Similarity-based smoothing: (Dagan, Lee, Pereira 1999)

)|(),(),(

1 )|(

where

0),( )|()()|(

uattPuufuuf

uattP

uattcountuattPunormuattP

uSIM

uSIM

SIM

SIM

41

Similarity/Distance Functions for Probability Distributions

• L1 norm

),2(),(

,

vuLvuf

vattPuattP vuL

LSIM

att

• Jensen-Shannon divergence (KL-distance to the average)

attPattPattPP

PPPD

PPPDrqA

vuvu

vuvKL

vuuKL

2

1

2 s.t.

222

1,

Information loss by approximating u and v by their average

)),(exp(),( vuAvuf ASIM

β controls the relative influence of close vs. remote neighbors

42

Sample Results

• Several smoothing experiments (A performed best): Language modeling for speech (hunt bears?pears) Perplexity (predicting test corpus likelihood) Data recovery task (similar to sense disambiguation)Insensitive to exact value of β

• Most similar words to “guy”:

Measure Closest Words

A guy kid thing lot man mother doctor friend boy son

L guy kid lot thing man doctor girl rest son bit

PCrole people fire guy man year lot today way part

Typical common verb contexts: see get give tell take …

PC : an earlier attempt for similarity-based smoothing

43

Class-Based Generalization

• Obtain a cooccurrence-based clustering of words and model a word cooccurrence by word-class or class-class cooccurrence

• Brown et al., CL 1992: Mutual information clustering; class-based model interpolated to n-gram model

• Pereira, Tishby, Lee, ACL 1993: soft, top-down distributional clustering for bigram modeling

• Similarity/class-based methods: general effectiveness yet to be shown

44

Conclusions• (Relatively) simple models cover a wide

range of applications

• Usefulness in (hybrid) systems: automatic processing and knowledge acquisition

45

Discussion

Documents

Unsupervised Methods