Upload
wilfred-brown
View
214
Download
0
Embed Size (px)
Citation preview
1
CS 430: Information Discovery
Lecture 25
Cluster Analysis 2
Thesaurus Construction
2
Course Administration
Next week
• No lecture on Tuesday
Assignments
• Assignment 3 grades have been returned
Final examination
• Friday, Dec 14, 9:00 - 10:30 a.m.
• There will not be an early examination
3
Similarities: Incidence array
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1 1
n 3 2 2 3 2 3 3
4
Term similarity matrix
alpha bravo charlie delta echo foxtrot golf
alpha 0.2 0.2 0.5 0.2 0.33 0.5
bravo 0.2 0.5 0.2 0.5 0.4 0.2
charlie 0.2 0.5 0.2 0.5 0.4 0.2
delta 0.5 0.2 0.2 0.2 0.33 0.5
echo 0.2 0.5 0.5 0.2 0.4 0.2
foxtrot 0.33 0.4 0.4 0.33 0.4 0.33
golf 0.5 0.2 0.2 0.5 0.2 0.33
Using incidence matrix and dice weighting
5
Example -- single link
alpha delta golf bravo echo charlie foxtrot
1
23
6
4
5
This style of diagram is called a dendrogram.
6
Example 2: Concept Spaces for Scientific Terms
Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms.
Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May 1996. Federating Diverse Collections of Scientific Literature
7
Methodology
Approach: Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept.
Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links.
[24.5 hours of CPU on 16-node Silicon Graphics supercomputer.]
Data set 2: 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories.
[ Four days of CPU on 64-processor Convex Exemplar.]
8
Concept Space
A concept space is a similarity matrix based on co-occurrence of terms.
In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by an existing classification scheme.
9
Objectives
• Semantic retrieval (using concept spaces for term suggestion)
• Semantic interoperability (vocabulary switching across subject domains)
• Semantic indexing (concept identification of document content)
• Information representation (information units for uniform manipulation)
10
Use of Concept Space: Term Suggestion
11
Future Use of Concept Space: Vocabulary Switching
"I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."
12
Visual thesaurus for browsing large collections of geographic images
Methodology:
• Divide images into small regions.
• Create a similarity measure based on properties of these images.
• Use cluster analysis tools to generate clusters of similar images.
• Provide alternative representations of clusters.
Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May 1997. (http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visualThesaurus.html)
13
14
Self Organizing Maps (SOM)
15
Decisions in creating a thesaurus
1. Which terms should be included in the thesaurus?
2. How should the terms be grouped?
16
Terms to include
• Only terms that are likely to be of interest for content identification
• Ambiguous terms should be coded for the senses likely to be important in the document collection
• Each thesaurus class should have approximately the same frequency of occurrence
• Terms of negative discrimination should be eliminated
after Salton and McGill
17
Discriminant value
Discriminant value is the degree to which a term is able to discriminate between the documents of a collection
= (average document similarity without term k) - (average document similarity with term k)
Good discriminators decrease the average document similarity
Note that this definition uses the document similarity.
18
Incidence array
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1 1
7
3
4
4
19
Document similarity matrix
D1 D2 D3 D4
D1 0.65 0.76 0.76
D2 0.65 0.00 0.87
D3 0.76 0.00 0.25
D4 0.76 0.87 0.25
Average similarity = 0.55
20
Discriminant value
Average similarity = 0.55
without average similarity DV
alpha 0.53 -0.02
bravo 0.56 +0.01
charlie 0.56 +0.01
delta 0.53 -0.02
echo 0.56 +0.01
foxtrot 0.52 -0.03
golf 0.53 -0.02
21
Phrase construction
In a thesaurus, term classes may contain phrases.
Informal definitions:
pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence)
phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency
cohesion (i, j) = pair-frequency (i, j)
frequency(i)*frequency(j)
22
Phrase construction
Salton and McGill algorithm
1. Computer pair-frequency for all terms.
2. Reject all pairs that fall below a certain threshold
3. Calculate cohesion values
4. If cohesion above a threshold value, consider word pair as a phrase.
Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics