Upload
cornelius-sharp
View
212
Download
0
Embed Size (px)
Citation preview
Rubryx
Document Classification Technology
Authors: V.N. Polyakov, V.V. Sinitsin
State of the ArtState of the ArtClassification Task is a part of IR taskThere are some successful decisionsThere are benchmarks (most popular is Reuters-
21578 text categorization test collection )The better levels of measure F1 are from 0.753 to
0.92 (Sebastiani, 1999)Existing technologies of machine learning are not
low-cost (large volume of manual work is needed)
Rubryx TechnologyRubryx Technology
General FeaturesMethod DescriptionFormal Task DescriptionMachine Learning TechnologyDictionary Development TechnologyExamples Selection TechnologyTests Results and New HeuristicsApplications and Tools
General FeaturesGeneral FeaturesRubryx can be characterized as follows: is based on a controlled dictionary; uses collocations in ranking texts; uses machine learning technology; uses hard-classification; uses multi-label text categorization; uses both category-pivoted and document-pivoted text
categorizationMoreover, another characteristic feature of the program can
be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.
Method DescriptionMethod Description
1. Compile a directory and general thematic dictionary
2. Select sample texts for the category (five documents) by expert for every rubric
3. Generate a micro-dictionary of special format for the category (rubric) based on frequency of occurance of termsfrom general dictionary in the texts-examples. Set a threshold for every rubric
4. Carry out a complete classification under the category
According to Sebastiani (1999), a general task of classification is defined as follows. To assign a Boolean value to each pair (d i, cj) D x C, where D is a domain of documents, and C is a set of pre-defined categories (rubrics). A value of Pij for a document i equaled as True means that the document files under a category j, in case of Pij = False – it doesn’t. Let there be a terminological dictionary containing sets L1, L2, L3 where L1 is a set of one-word terms, L2 – two-word terms, L3 – three-word terms A set of documents: D = {d1, …, dn} W1i, W2i, W3i are sets of one-, two-,and three-word terms from a document di 1. Selection of sample documents for a category cj. D*j is a subset of samples D*j D 2. Generation of a micro-dictionary for the category (M1j, M2j and M3j). Find intersection of sets of terms from sample documents and dictionary: M1j = W11* W12* … W1n* L1 - one-word terms M2j = W21* W22* … W2n* L2 - two-word terms M3j = W31* W32* … W3n* L3 - three-word terms 3. Classifying E1ij = M1j W1i E2ij = M2 W2i for i = 1 … n (1) E3ij = M3 W3i Find cardinal number: N1ij = | E1ij |, N2ij = | E2ij |, N3ij = | E3ij | K1ij, K2ij, K3ij are intermediate coefficients of one-, two-, and three-word terms from a document di and category cj K1ij = ( N1ij / | M1j | ) 100% K2ij = ( N2ij / | M2j |) 100% for i = 1 … n K3ij = ( N3ij / | M3j | ) 100% Kij is a relevance coefficient of a document di to a category cj
3
35.123.112.0 ijKijKijKKij
(2)
1, if Kij K* Piо = 0, if Kij < K* K* is a threshold of K Pij is a conditional probability of filing a document di under category cj
Formal Task DescriptionFormal Task Description
1. Compile a directory
2. Select sample texts for the category (five documents) by expert for every rubric
3. Generate a micro-dictionary
4. Set a threshold for every rubric
Machine Learning Technology
Machine Learning Technology
5. After these four steps Rubryx is ready for using
Dictionary Development Technology
Dictionary Development Technology
1. We use an electronic terminological dictionary for whole directory in special formats: three files for one-word, two-word and three-word terms accordingly
3. Terms are placed in micro-dictionary if it was occurred in M samples at least
4. Final micro-dictionary can by corrected by expert
Remark:1. Using collocations give us lexical meaning disambiguation2. Frequencies are normalized to text size of 1000 words2. Usually M=2
2.For every sample we determine list of terms in used format with frequency of occurance
Examples Selection Technology
Examples Selection Technology
1. Samples are selected by expert
Samples are the most relevant documents to each rubric
2. It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning
3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work
Preliminary Results of Rubryx Testing on the Reuters-21578
text categorization test collection
Preliminary Results of Rubryx Testing on the Reuters-21578
text categorization test collection
Measure F1 = 0.85 on “places” and “topic” category Measure F1 is 1 on “exchanges” category Categories “people” and “org” need new dictionaries of proper
names development Some new heuristics were generated to improve results in
categories “places” and “topic”:(taking in account position of terms in clause, taking in account
grouping of terms in text, taking in account proper names)
Summary of Advantages and know-how
Summary of Advantages and know-how
Lexical meaning based approach Using collocations give us lexical meaning
disambiguation We use an electronic terminological dictionary and micro-
dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly
It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning
Comparable quality of classification with low-cost machine learning
Applications and ToolsApplications and ToolsRubryx – text classification program (versions 1 and 2,See site www.sowsoft.com/rubryx )
DicTools – utility for dictionary development
Spider – application program for text collection from Internet with preliminary classification
Dictionaries
Rubryx – text classification program Rubryx – text classification program
Status: Completed application
DicTools – utility for dictionary development
DicTools – utility for dictionary development
Status: Completed application
Spider – application program for text collection from
Internet with preliminary classification
Spider – application program for text collection from
Internet with preliminary classification
Application collects from start www-address all pages relevant to interested rubric.
1. We input category and starting URL
2. Spider goes recursively all links and loads pages.All pages are classified and not interesting link paths are cut.
3.As result we have sufficient economy of traffic and time.
Status: Evaluation and testing
English DictionariesEnglish Dictionaries
Natural Language Processing (7775 terms)Geography (5941 terms)Metallurgy (4946 terms)Politechnical (37488 terms) Economics (1806 terms)Names of market exchanges (69080 terms)
PublicationsPublications
V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, 120-126 (2001) (Article in Russian with abstract in English)
V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM-2003. Moscow, MSLU, 137-143 (2003)
Contact InformationContact InformationVladimir N. Polyakov Moscow State Linguistic University [email protected]
Vladimir V. Sinitsyn Moscow State Steel and Alloys Institute (Technological University)[email protected]
Rubryx HomePages (shareware):www.sowsoft.com/rubryx/www.rubryx.narod.ru