Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003

Automatic Term Weighting, Lexical Statistics and ……Quantitative Terminology

Kyo Kageura

National Institute of Informatics

July 05, 2003

Project

To rescue/recover the sphere of lexicology To release the richness and productivity of

lexico-conceptual sets from the dominance of discourse …… while maintaining the traceable procedure in the

process of doing this and starting from textual corpora

Contents

Sphere of Texts and Sphere of Lexicon/ology Three (representative) methods of automatic

term weighting and their meanings From corpus-based lexical statistics to (still)

corpus-based quantitative lexicology Measuring lexical productivity in lexicon (i.e.

lexicological concept of productivity) from textual data, with some experiments

Conclusions

Textual Sphere and Lexicological Sphere

terms

lexicon

lexicologycomplex terms

quantitative lexicology

So what about talking about lexicology when talking about corpus-based…

Lexicological Sphere

Textual Sphere

This exists

Lexicological Sphere and Texts Lexicology deals with actual set of words

which does not mean it’s natural history Lexicological model with expectations

addresses “realistic possibility of existence,” not permissible forms or fantasy land thus actual data is required primary language data is texts

Thus becomes recovery of lexicological characteristics the task of lexicology

Automatic Term Weighting (ATW) To review some representative ATW

methods gives important insights into the current topic while at the same time giving insights into ATWs

We look at Tfidf (its info-theoretic interpretation by Aizawa) Term representativeness (by Hisamitsu) Lexical measure (by Nakagawa)

which goes from texts to lexicology, almost.

ATW1: tfidf

d1 d2 … dD

t1

… fij

tT

Tfidf and many other similar measures, in fact most of what are used in IR, are based on the document-term matrix which has formal duality.

Thus the weight of terms is always and only meaningful vis-à-vis the given set of documents or its population (Dfitf thus makes sense, as in probabilistic model).

ATW2: Term representativeness You shall know the meaning of a word by the

company it keeps (or see friends to know a person … if there is any, anyway)

To calculate the weight of a term ti, take the distribution of words that accompany ti in a certain window size and calculate the distance between this and the distribution of random chunk of the same window size (NB: size normalisation is necessary due to LNRE nature of language data).

ATW2: Term representativeness This method discards the factor of dominant

discourse or minor discourse at the level of observed texts (or does not do favor to people who randomly buy friends by money).

This method calculates the characteristic that the term ti, if appears at all, can attract at the level of discourse (depending on the nature of window the method takes, of course).

ATW3: Nakagawa’s method Observe the number of different elements

(element types) that accompany ti within the complex lexical units in texts.

This reflects, therefore, a nature of lexical productivity of the focal element ti, but together with the degree of its use in discourse (texts)

ATW to Quantitative Lexicology To characterise lexicological nature of

elements from their occurrence in texts: As in the method of term representativeness in

Hisamitsu, the “discourse size” factor should be reduced, more essentially;

As in Nakagawa’s method, the point of observation should be limited to complex terms (or those which are supposed to be registered or can be registered to the lexicon/lexicological sphere).

A Quantitative Terminonlogical Study Aim: To recover the productivity of

constituent elements of simplex and complex terms as head.

Observe, like Nakagawa, the window range of simplex and complex terms in texts, e.g.

＜理論 /物理＞と＜教育 /心理＞は＜観察＞できる＜範囲＞では似通っており、＜計算 /機 /科学＞は＜理論 /物理＞より高い＜値＞で＜推移＞している。

Some preconditions/assumptions Corpus and the target terminological space

should: belong to and represent the same domain cover the same period of time in general matches qualitatively

We are concerned with defining a measure which can compare “productivity” of elements in the same lexicological/terminological sphere.

Definition of measures (a)

f(i,N): frequency of ti in the text of size N This is the extent of use in discourse, nothing to

do with lexicological productivity d(i,N): number of different complex words

whose head is ti in the text of size N the first manifestation of lexicological productivity basically identical to Nakagawa (2000) thus this is the point of departure

Definition of measures (b)

d(i,N) means the manifestation of the productivity of ti as it occurs in the corpus

d(i,N) is sensitive to the extent of use of the focal element in the textual corpus, e.g. the following can be the case…

X=N X=2N

d(i,X) 500 600

d(j,X) 400 800

Definition of measures (c)

Better measure for manifested productivity

d(i,λN) ： the overall transition pattern of d(i,λN) whereλtakes a positive real value (a la Hisamitsu).

The measure for potential productivity

ｄ (i) = d(i,λN);λ→∞ ： discard all the quantitative factor

Can be computed by LNRE models

The measures and prob. distributions Three distributions

1) The occurrence probability of heads in theoretical lexicological space.

2) The occurrence probability of modifiers for each head.

3) The probability of use of the head in the text. Relations…

f(i,N) ⇔ 3) d(i) ⇔ 1) d(i,N) ⇔ 2),3)

Experiments (1/5)

Artificial intelligence abstracts in Japanese

4 elements, i.e. 「 System 」「 Model ｣（ general ） and 「 knonwledge 」「 information 」 (specific) are observed

#Abst #Token （ Smp/Cmp ）

#Type （ Smp/Cmp ）

1816 299846 / 230708 8764 / 23243

Experiments (2/5)

f(i,N) f 単 (i,N) f 複 (i,N) d(i,N)

system 1970 723 1247 502

model 1015 328 687 263

knowledge 1191 748 443 137

information 637 369 268 155

Experiments (3/5)

Experiments (4/5)

LNRE p-value MSE d(i)

system GIGP 0.96 2.19 273,402,688,337

model IGP 0.47 2.88 3,676,671,255

knowledge LogN 0.88 2.72 689

information IGP 0.84 2.32 667

Experiments (5/5)

f(i,N) S ＞ K ＞ M ＞ I

d(i,λN) S ＞ M ＞ I ＞ K

d(i) S ＞ M ＞ K ＞ I

General elements, such as “system” or “model,” have high lexicological productivity, while subject-specific elements, such as “knowledge” or “information,” have rather low productivity.

Summary

Starting from the observation of ATW methods and going into examining corpus-based quantitative terminological study, we clarified the position of lexicology/lexicon clarified the basic framework of quantitative

lexicology/terminology, with relevant measures. gave some corresponding distributions gave the framework of interpretation to measures carried out experiments …

Remaining problems

Concepts of “lexicologisation” and “word” To be registered to the lexicon To be consolidated as a lexical unit within the

syntagmatic stream of language manifestations Distribution of complex words in texts and

word unit “reference+head” vs. “modifier+head”

The former is related to an essential concept(ualisation) of lexicon/lexicology…

Documents

Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003