Upload
avery-pierce
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Automatic Term Weighting, Lexical Statistics and ……Quantitative Terminology
Kyo Kageura
National Institute of Informatics
July 05, 2003
Project
To rescue/recover the sphere of lexicology To release the richness and productivity of
lexico-conceptual sets from the dominance of discourse …… while maintaining the traceable procedure in the
process of doing this and starting from textual corpora
Contents
Sphere of Texts and Sphere of Lexicon/ology Three (representative) methods of automatic
term weighting and their meanings From corpus-based lexical statistics to (still)
corpus-based quantitative lexicology Measuring lexical productivity in lexicon (i.e.
lexicological concept of productivity) from textual data, with some experiments
Conclusions
Textual Sphere and Lexicological Sphere
terms
lexicon
lexicologycomplex terms
quantitative lexicology
So what about talking about lexicology when talking about corpus-based…
Lexicological Sphere
Textual Sphere
This exists
Lexicological Sphere and Texts Lexicology deals with actual set of words
which does not mean it’s natural history Lexicological model with expectations
addresses “realistic possibility of existence,” not permissible forms or fantasy land thus actual data is required primary language data is texts
Thus becomes recovery of lexicological characteristics the task of lexicology
Automatic Term Weighting (ATW) To review some representative ATW
methods gives important insights into the current topic while at the same time giving insights into ATWs
We look at Tfidf (its info-theoretic interpretation by Aizawa) Term representativeness (by Hisamitsu) Lexical measure (by Nakagawa)
which goes from texts to lexicology, almost.
ATW1: tfidf
d1 d2 … dD
t1
… fij
tT
Tfidf and many other similar measures, in fact most of what are used in IR, are based on the document-term matrix which has formal duality.
Thus the weight of terms is always and only meaningful vis-à-vis the given set of documents or its population (Dfitf thus makes sense, as in probabilistic model).
ATW2: Term representativeness You shall know the meaning of a word by the
company it keeps (or see friends to know a person … if there is any, anyway)
To calculate the weight of a term ti, take the distribution of words that accompany ti in a certain window size and calculate the distance between this and the distribution of random chunk of the same window size (NB: size normalisation is necessary due to LNRE nature of language data).
ATW2: Term representativeness This method discards the factor of dominant
discourse or minor discourse at the level of observed texts (or does not do favor to people who randomly buy friends by money).
This method calculates the characteristic that the term ti, if appears at all, can attract at the level of discourse (depending on the nature of window the method takes, of course).
ATW3: Nakagawa’s method Observe the number of different elements
(element types) that accompany ti within the complex lexical units in texts.
This reflects, therefore, a nature of lexical productivity of the focal element ti, but together with the degree of its use in discourse (texts)
ATW to Quantitative Lexicology To characterise lexicological nature of
elements from their occurrence in texts: As in the method of term representativeness in
Hisamitsu, the “discourse size” factor should be reduced, more essentially;
As in Nakagawa’s method, the point of observation should be limited to complex terms (or those which are supposed to be registered or can be registered to the lexicon/lexicological sphere).
A Quantitative Terminonlogical Study Aim: To recover the productivity of
constituent elements of simplex and complex terms as head.
Observe, like Nakagawa, the window range of simplex and complex terms in texts, e.g.
<理論 /物理>と<教育 /心理>は<観察>できる<範囲>では似通っており、<計算 /機 /科学>は<理論 /物理>より高い<値>で<推移>している。
Some preconditions/assumptions Corpus and the target terminological space
should: belong to and represent the same domain cover the same period of time in general matches qualitatively
We are concerned with defining a measure which can compare “productivity” of elements in the same lexicological/terminological sphere.
Definition of measures (a)
f(i,N): frequency of ti in the text of size N This is the extent of use in discourse, nothing to
do with lexicological productivity d(i,N): number of different complex words
whose head is ti in the text of size N the first manifestation of lexicological productivity basically identical to Nakagawa (2000) thus this is the point of departure
Definition of measures (b)
d(i,N) means the manifestation of the productivity of ti as it occurs in the corpus
d(i,N) is sensitive to the extent of use of the focal element in the textual corpus, e.g. the following can be the case…
X=N X=2N
d(i,X) 500 600
d(j,X) 400 800
Definition of measures (c)
Better measure for manifested productivity
d(i,λN) : the overall transition pattern of d(i,λN) whereλtakes a positive real value (a la Hisamitsu).
The measure for potential productivity
d (i) = d(i,λN);λ→∞ : discard all the quantitative factor
Can be computed by LNRE models
The measures and prob. distributions Three distributions
1) The occurrence probability of heads in theoretical lexicological space.
2) The occurrence probability of modifiers for each head.
3) The probability of use of the head in the text. Relations…
f(i,N) ⇔ 3) d(i) ⇔ 1) d(i,N) ⇔ 2),3)
Experiments (1/5)
Artificial intelligence abstracts in Japanese
4 elements, i.e. 「 System 」「 Model 」( general ) and 「 knonwledge 」「 information 」 (specific) are observed
#Abst #Token ( Smp/Cmp )
#Type ( Smp/Cmp )
1816 299846 / 230708 8764 / 23243
Experiments (2/5)
f(i,N) f 単 (i,N) f 複 (i,N) d(i,N)
system 1970 723 1247 502
model 1015 328 687 263
knowledge 1191 748 443 137
information 637 369 268 155
Experiments (3/5)
Experiments (4/5)
LNRE p-value MSE d(i)
system GIGP 0.96 2.19 273,402,688,337
model IGP 0.47 2.88 3,676,671,255
knowledge LogN 0.88 2.72 689
information IGP 0.84 2.32 667
Experiments (5/5)
f(i,N) S > K > M > I
d(i,λN) S > M > I > K
d(i) S > M > K > I
General elements, such as “system” or “model,” have high lexicological productivity, while subject-specific elements, such as “knowledge” or “information,” have rather low productivity.
Summary
Starting from the observation of ATW methods and going into examining corpus-based quantitative terminological study, we clarified the position of lexicology/lexicon clarified the basic framework of quantitative
lexicology/terminology, with relevant measures. gave some corresponding distributions gave the framework of interpretation to measures carried out experiments …
Remaining problems
Concepts of “lexicologisation” and “word” To be registered to the lexicon To be consolidated as a lexical unit within the
syntagmatic stream of language manifestations Distribution of complex words in texts and
word unit “reference+head” vs. “modifier+head”
The former is related to an essential concept(ualisation) of lexicon/lexicology…