45
1 Unsupervised Methods

Unsupervised Methods

  • Upload
    darva

  • View
    48

  • Download
    1

Embed Size (px)

DESCRIPTION

Unsupervised Methods. Association Measures. Association between items: assoc(x,y) term-term, term-document, term-category, … Simple measure: freq(x,y), log( freq(x,y))+1 Based on contingency table. Mutual Information. The item corresponding to x,y in the Mutual Information for X,Y: - PowerPoint PPT Presentation

Citation preview

Page 1: Unsupervised Methods

1

Unsupervised Methods

Page 2: Unsupervised Methods

2

Association Measures

• Association between items: assoc(x,y)– term-term, term-document, term-category, …

• Simple measure: freq(x,y), log(freq(x,y))+1

• Based on contingency table

Page 3: Unsupervised Methods

3

Mutual Information

• The item corresponding to x,y in the Mutual Information for X,Y:

• Disadvantage: the MI value is inflated for low freq(x,y)

• Examples: results for two NLP articles

MI x y P x yP x P y

P x yP x

P y xP y( , ) log ( , )

( ) ( ) log ( )( ) log ( | )

( )

Page 4: Unsupervised Methods

4

Log-Likelihood Ratio Test

• Comparing the likelihood of the data given two competing hypotheses (Dunning,93)

• Does not depend heavily on assumptions of normality, can be applied to small samples

• Used to test if p(x|y) = p(x|~y) = p(x), by comparing it to the general case (inequality)

• High log-likelihood score indicates that the data is much less likely if assuming equality

Page 5: Unsupervised Methods

5

Log-Likelihood (cont.)

• Likelihood function:

• The likelihood ratio:

• is asymptotically distributed

• High : the data is less likely given

H p p k k H k( , ,...; , ,...) ( ; )1 2 1 2

max ( ; )

max ( ; )

0H k

H k

2 log

2

0

2 log

Page 6: Unsupervised Methods

6

Log-Likelihood for Bigrams

p p x ya

a ckn

p p x yb

b dkn

p p xa b

a b c dk kn n

11

1

22

2

1 2

1 2

( | )

( |~ )

( )

Page 7: Unsupervised Methods

7

Log-Likelihood for Binomial

• Maximum obtained for:

H p n k pk p n k n

k( ; , ) ( )

1

H p p n k n k H p n k H p n k( , ; , , , ) ( ; , ) ( ; , )1 2 1 1 2 2 1 1 1 2 2 2

max ( , ; , , , )

max ( , ; , , , ),

p

p p

H p p n k n k

H p p n k n k1 2 1 1 2 2

1 2 1 1 2 21 2

pkn

pkn

pk kn n1

1

12

2

2

1 2

1 2

; ;

Page 8: Unsupervised Methods

8

Measuring Term Topicality• For query relevance ranking: Inverse Document

Frequency

• For term extraction:– Frequency– Frequency ratio for specialized vs. general corpus– Entropy of term co-occurrence distribution– Burstiness:

• Entropy of distribution (frequency) in documents• Proportion of topical documents for term (freq>1) within all

documents containing term (Katz, 1996)

Page 9: Unsupervised Methods

9

Similarity Measures

• Cosine:

• Min/Max:

• KL to Average:

attv freqlogattu freqlog

attv freqlogattu freqlogvusim

attatt

att

22 ,,

,,,

att

att

attvIattuI max

attvIattuI minvu sim

,,,

,,,,

att vattPuattP

vattPvattP

vattPuattP

uattPuattPvuA

2log

2log,

Page 10: Unsupervised Methods

10

A Unifying Schema of Similarity(with Erez Lotan)

• A general schema encoding most measures

• Identifies explicitly the important factors that determine (word) similarity

• Provides the basis for:– a general and efficient similarity computation

procedure– evaluating and comparing alternative measures

and components

Page 11: Unsupervised Methods

11

Mapping to Unified Similarity Scheme

joint(assoc(u,att),assoc(u,att))

joint(assoc(u,att),assoc(u,att))

007

110 03

0 0

17 004

uassoc(u,att)

count(u,att)

,att

attu assoc g uW

016050 8000

45 006

vassoc(v,att)

count(v,att)

,att

attv assoc g vW

)(),(),,(

),(),(

),(,

vWuWvuSJnorm

vuSJvuf

attv, assoc,attu, assoc jointvuSJ

SIM

vuBoth att

Page 12: Unsupervised Methods

12

Association and Joint Association

• assoc(u,att): quantify association strength– mutual information, weighted log frequency,

conditional probability (orthogonal to scheme)

• joint(assoc(u,att),assoc(v,att)): quantify the “similarity” of the two associations– ratio, difference, min, product

vuBoth att

attv, assoc,attu, assoc jointvuSJ,

),(

attv,freq ,attu,freq att vuBoth 00,

Page 13: Unsupervised Methods

13

Normalization• Global weight of a word vector:

– For cosine:

• Normalization factor:

– For cosine:

,

u Just att

attu assoc g uW attu,freq att uJust 0

vW,uW,vu,SJnorm vu,rNorm_Facto

uJust att

attu assocuW 2,

vWuWvuFactorNorm ,_

Page 14: Unsupervised Methods

14

The General Similarity Scheme

vWuW

vu,SJvu,sim

vWuWvuSJnorm

vuSJ

vu,rNorm_Facto

vu,SJ vu,sim

:cosine - exampleFor -

,,,

,

vuBoth attattv, assoc,attu, assoc jointvuSJ

,),(

where

Page 15: Unsupervised Methods

15

Min/Max Measures

sim u vassoc u att assoc v attassoc u att assoc v att

att

att( , )

min( ( , ), ( , ))max( ( , ), ( , ))

• May be viewed as:

)),(),,(max()(

)),(),,(max(),(),,(min(

),(joint

attvassocattuassocattweight

attvassocattuassocattvassocattuassoc

Page 16: Unsupervised Methods

16

Associations Used with Min/Max

• Log-frequency and Global Entropy Weight (Grefenstette, 1994):

• Mutual information (Dagan et al., 1993/5):

)()1),(log(),( attGewattufreqattuassoc

Gew attnrels

P v att P v attv( )log

( ) log( ( )) 11 [ , ]0 1

assoc u attP u att

P u P att

P att u

P att

P u att

P u( , ) log

( , )( ) ( )

log( )

( )log

( )

( )

Page 17: Unsupervised Methods

17

Cosine Measure

• Used for word similarity (Ruge, 1992) with: assoc(u,att)=ln(freq(u,att))

• Popular for document ranking (vector space)

cos( , )( , ) ( , )

( , ) ( , )u v

assoc u att assoc v att

assoc w att assoc w att

att

att att

12

22

assoc doc term tf idf

tffreq doc term

freq doc

( , )

( , )max ( , )

idfdocfreq

docfreq term log

max ( )( )

1

Page 18: Unsupervised Methods

18

Methodological BenefitsJoint work with Erez Lotan (Dagan 2000 and in preparation)

• Uniform understanding of similarity measure structure

• Modular evaluation/comparison of measure components

• Modular implementation architecture, easy experimentation by “plugging” alternative measure combinations

Page 19: Unsupervised Methods

19

Empirical Evaluation• Thesaurus for query expansion (e.g. “insurance laws”):

Similar words for law :

Word Similarity Judgmentregulation 0.050242 +rule 0.048414 +legislation 0.038251 + guideline 0.035041 + commission 0.034499 - bill 0.033414 + budget 0.031043 - regulator 0.031006 + code 0.030998 + circumstance 0.030534 -

•Precision and comparative Recall at each point in the list

Page 20: Unsupervised Methods

20

Comparing Measure Combinations

Recall

Precision

• Min/Max schemes worked better than cosine and Jensen-Shannon (almost by 20 points); stable over association measures

Page 21: Unsupervised Methods

21

Effect of Co-occurrence Type on Semantic Similarity

Page 22: Unsupervised Methods

22

Computational Benefits

• Complexity reduced by “sparseness” factor – #non-zero cells / total #cells Two orders of magnitude in corpus data

v1 … vj vmu

i1 . . . n

atti

Similarity Results j m1 . . .

• Efficient implementation through sparse matrix indexing By computing over common attributes only (both )

attributes

words

att1

atti

attn

Page 23: Unsupervised Methods

23

General Scheme - Conclusions• A general mathematical scheme• Identifies the important factors for measuring

similarity• Efficient general procedure based on scheme• Empirical comparison of different measure

components (measure structure and assoc)• Successful application in an Internet crawler for

thesaurus construction (small corpora)

Page 24: Unsupervised Methods

24

Clustering Methods

• Input: A set of objects (words, documents)

• Output: A set of clusters (sets of elements)

• Based on a criterion for the quality of a class, which guides cluster split/merge/modification– a distance function between objects/classes– a global quality function

Page 25: Unsupervised Methods

25

Clustering Types

• Soft / Hard• Hierarchical / Flat• Top-down / bottom-up• Predefined number of clusters or not• Input:

– all point-to-point distances – original vector representation for points,

computing needed distances during clustering

Page 26: Unsupervised Methods

26

Applications of Clustering• Word clustering

– Constructing a hierarchical thesaurus – Compactness and generalization in word

cooccurrence modeling (will be discussed later)

• Document clustering– Browsing of document collections and search

query output– Assistance in defining a set of supervised

categories

Page 27: Unsupervised Methods

27

Hierarchical Agglomerative Clustering Methods (HACM)

1. Initialize every point as a cluster2. Compute a merge score for all cluster pairs3. Perform the best scoring merge4. Compute the merge score between the new cluster and all other clusters5. If more than one cluster remains, return to 3

Page 28: Unsupervised Methods

28

Types of Merge Score• Minimal distance between the two candidates for the merge.

Alternatives for cluster distance:– Single link: distance between two nearest points– Complete ling: distance between two furthest points– Group average: average pairwise distance for all points– Centroid: distance between the two cluster centroids

• Based on the “quality” of the merged class:– Ward’s method: minimal increase in total within-group sum of

squares (average squared distance to centroid)

• Based on a global criterion (in Brown et al., 1992: minimal reduction in average mutual information)

Page 29: Unsupervised Methods

29

Unsupervised Statistics and Generalizations for Classification

• Many supervised methods use cooccurrence statistics as features or probability estimates– eat a {peach,beach}– fire a missile vs. fire the prime minister

• Sparse data problem: if alternative cooccurrences never occurred, how to estimate their probabilities, or their relative “strength” as features?

Page 30: Unsupervised Methods

30

Application: Semantic Disambiguation

Weapon

Bombs

grenade

Actions

Cause_movement

throw drop

Traditional AI-style approach Manually encoded semantic preferences/constraints

<object – verb>

Anaphora resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995)

The terrorist pulled the grenade from his pocket and

threw it at the policeman ?

Page 31: Unsupervised Methods

31

Statistical Approach

Corpus(text collection)

<verb–object: throw-grenade> 20 times

<verb–object: throw-pocket> 1 time

“Semantic” Judgment

• Semantic confidence combined with syntactic preferences it grenade

• “Language modeling” for disambiguation

Page 32: Unsupervised Methods

32

What about sense disambiguation?(for translation)

I bought soap bars I bought window bars

sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’)

? ?

Corpus(text collection)

Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 times

Sense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times

• “Hidden” senses – supervised labeling required?

Page 33: Unsupervised Methods

33

Solution: Mapping to Another LanguageEnglish(-English)-Hebrew Dictionary:

bar1 ‘chafisa’ soap ‘sabon’ window ‘chalon’bar2 ‘sorag’

Map ambiguous constructs to second language (all possibilities):

<noun-noun: soap-bar> 1 <noun-noun: ‘cahfisat-sabon’> 20 times 2 <noun-noun: ‘sorag-sabon’> 0 times

<noun-noun: window-bar> 1 <noun-noun: ‘cahfisat-chalon’> 0 times 2 <noun-noun: ‘sorag-chalon’> 15 times

Hebrew Corpus

• Exploiting ambiguities difference

• Principle – intersecting redundancies(Dagan and Itai 1994)

Page 34: Unsupervised Methods

34

Selection Model Highlights• Multinomial model, under certain linguistic assumptions

• Selection “confidence” – lower bound for odds-ratio:

)(

11lnln 1 iConf

nnZ

n

n

p

p

jij

i

j

i

• Overlapping ambiguous constructs are resolved through constraint propagation, by decreasing confidence order.

• Results (HebrewEnglish):Coverage: ~70% Precision within coverage: ~90%– ~20% improvement over choosing most frequent translation

(the common baseline)

Page 35: Unsupervised Methods

35

Data Sparseness and Similarity

?

<verb–object: ‘hidpis-tikiya’>

<verb–object: print-folder> 0 times <verb–object: print-file_cabinet> 0 times

• Standard approach: “back-off” to single term frequency

• Similarity-based inference:

print<verb-object>

folder

filedirectory

record…

Similarfile_cabinet

cupboardcloset

suitcase…

Similar

print<verb-object>

Page 36: Unsupervised Methods

36

Computing Distributional Similarity folder

file

printeraseopen

retrievebrowse

save…

Similar

attP

uattPlog

uPattP

uattPlogattuI 22

,,

• Association between word u (“folder”) and its “attributes” (context words/features) is based on mutual information:

att

att

attvIattuI max

attvIattuI minvu sim

,,,

,,,,

• Similarity between u and v (weighted Jaccard, [0,1]):

Page 37: Unsupervised Methods

37

Disambiguation AlgorithmSelection of preferred alternative:

• Hypothesized similarity-based frequency derived from average association for similar words(incorporating single term frequency)

• Comparing hypothesized frequencies

print<verb-object>

folder

filedirectory

record…

Similarfile_cabinet

cupboardcloset

suitcase…

Similar

print<verb-object>

Page 38: Unsupervised Methods

38

Computation and Evaluation• Heuristic search used to speed computation of k most similar

words

• Results (HebrewEnglish):• 15% coverage increase, while decreasing precision by 2%• Accuracy 15% better than back-off to single word

frequency

(Dagan, Marcus and Markovitch 1995)

Page 39: Unsupervised Methods

39

Probabilistic Framework - Smoothing• Counts are obtained from a sample of the probability space:

sample

• Maximum Likelihood Estimate proportional to sample counts:

MLE estimate – 0 probability for unobserved events

• Smoothing discounts observed events, leaving probability “mass” to unobserved events:

discounted estimate for observed eventspositive estimate for unobserved events

Page 40: Unsupervised Methods

40

Smoothing Conditional Attribute Probability

• Good-Turing smoothing scheme – discount & redistribute:

0),(

0),(:

1)(

0),( )|()|( uattcount

uattcountattunorm

uattcountuattPuattP

d

• Katz seminal back-off scheme (speech language modeling):

0),( )()()|( uattcountattPunormuattP

• Similarity-based smoothing: (Dagan, Lee, Pereira 1999)

)|(),(),(

1 )|(

where

0),( )|()()|(

uattPuufuuf

uattP

uattcountuattPunormuattP

uSIM

uSIM

SIM

SIM

Page 41: Unsupervised Methods

41

Similarity/Distance Functions for Probability Distributions

• L1 norm

),2(),(

,

vuLvuf

vattPuattP vuL

LSIM

att

• Jensen-Shannon divergence (KL-distance to the average)

attPattPattPP

PPPD

PPPDrqA

vuvu

vuvKL

vuuKL

2

1

2 s.t.

222

1,

Information loss by approximating u and v by their average

)),(exp(),( vuAvuf ASIM

β controls the relative influence of close vs. remote neighbors

Page 42: Unsupervised Methods

42

Sample Results

• Several smoothing experiments (A performed best): Language modeling for speech (hunt bears?pears) Perplexity (predicting test corpus likelihood) Data recovery task (similar to sense disambiguation)Insensitive to exact value of β

• Most similar words to “guy”:

Measure Closest Words

A guy kid thing lot man mother doctor friend boy son

L guy kid lot thing man doctor girl rest son bit

PCrole people fire guy man year lot today way part

Typical common verb contexts: see get give tell take …

PC : an earlier attempt for similarity-based smoothing

Page 43: Unsupervised Methods

43

Class-Based Generalization

• Obtain a cooccurrence-based clustering of words and model a word cooccurrence by word-class or class-class cooccurrence

• Brown et al., CL 1992: Mutual information clustering; class-based model interpolated to n-gram model

• Pereira, Tishby, Lee, ACL 1993: soft, top-down distributional clustering for bigram modeling

• Similarity/class-based methods: general effectiveness yet to be shown

Page 44: Unsupervised Methods

44

Conclusions• (Relatively) simple models cover a wide

range of applications

• Usefulness in (hybrid) systems: automatic processing and knowledge acquisition

Page 45: Unsupervised Methods

45

Discussion