Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin

Rubryx

Document Classification Technology

Authors: V.N. Polyakov, V.V. Sinitsin

State of the ArtState of the ArtClassification Task is a part of IR taskThere are some successful decisionsThere are benchmarks (most popular is Reuters-

21578 text categorization test collection )The better levels of measure F1 are from 0.753 to

0.92 (Sebastiani, 1999)Existing technologies of machine learning are not

low-cost (large volume of manual work is needed)

Rubryx TechnologyRubryx Technology

General FeaturesMethod DescriptionFormal Task DescriptionMachine Learning TechnologyDictionary Development TechnologyExamples Selection TechnologyTests Results and New HeuristicsApplications and Tools

General FeaturesGeneral FeaturesRubryx can be characterized as follows: is based on a controlled dictionary; uses collocations in ranking texts; uses machine learning technology; uses hard-classification; uses multi-label text categorization; uses both category-pivoted and document-pivoted text

categorizationMoreover, another characteristic feature of the program can

be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.

Method DescriptionMethod Description

1. Compile a directory and general thematic dictionary

2. Select sample texts for the category (five documents) by expert for every rubric

3. Generate a micro-dictionary of special format for the category (rubric) based on frequency of occurance of termsfrom general dictionary in the texts-examples. Set a threshold for every rubric

4. Carry out a complete classification under the category

According to Sebastiani (1999), a general task of classification is defined as follows. To assign a Boolean value to each pair (d i, cj) D x C, where D is a domain of documents, and C is a set of pre-defined categories (rubrics). A value of Pij for a document i equaled as True means that the document files under a category j, in case of Pij = False – it doesn’t. Let there be a terminological dictionary containing sets L1, L2, L3 where L1 is a set of one-word terms, L2 – two-word terms, L3 – three-word terms A set of documents: D = {d1, …, dn} W1i, W2i, W3i are sets of one-, two-,and three-word terms from a document di 1. Selection of sample documents for a category cj. D*j is a subset of samples D*j D 2. Generation of a micro-dictionary for the category (M1j, M2j and M3j). Find intersection of sets of terms from sample documents and dictionary: M1j = W11* W12* … W1n* L1 - one-word terms M2j = W21* W22* … W2n* L2 - two-word terms M3j = W31* W32* … W3n* L3 - three-word terms 3. Classifying E1ij = M1j W1i E2ij = M2 W2i for i = 1 … n (1) E3ij = M3 W3i Find cardinal number: N1ij = | E1ij |, N2ij = | E2ij |, N3ij = | E3ij | K1ij, K2ij, K3ij are intermediate coefficients of one-, two-, and three-word terms from a document di and category cj K1ij = ( N1ij / | M1j | ) 100% K2ij = ( N2ij / | M2j |) 100% for i = 1 … n K3ij = ( N3ij / | M3j | ) 100% Kij is a relevance coefficient of a document di to a category cj

3

35.123.112.0 ijKijKijKKij

(2)

1, if Kij K* Piо = 0, if Kij < K* K* is a threshold of K Pij is a conditional probability of filing a document di under category cj

Formal Task DescriptionFormal Task Description

1. Compile a directory

2. Select sample texts for the category (five documents) by expert for every rubric

3. Generate a micro-dictionary

4. Set a threshold for every rubric

Machine Learning Technology

Machine Learning Technology

5. After these four steps Rubryx is ready for using

Dictionary Development Technology

Dictionary Development Technology

1. We use an electronic terminological dictionary for whole directory in special formats: three files for one-word, two-word and three-word terms accordingly

3. Terms are placed in micro-dictionary if it was occurred in M samples at least

4. Final micro-dictionary can by corrected by expert

Remark:1. Using collocations give us lexical meaning disambiguation2. Frequencies are normalized to text size of 1000 words2. Usually M=2

2.For every sample we determine list of terms in used format with frequency of occurance

Examples Selection Technology

Examples Selection Technology

1. Samples are selected by expert

Samples are the most relevant documents to each rubric

2. It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning

3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work

Preliminary Results of Rubryx Testing on the Reuters-21578

text categorization test collection

Preliminary Results of Rubryx Testing on the Reuters-21578

text categorization test collection

Measure F1 = 0.85 on “places” and “topic” category Measure F1 is 1 on “exchanges” category Categories “people” and “org” need new dictionaries of proper

names development Some new heuristics were generated to improve results in

categories “places” and “topic”:(taking in account position of terms in clause, taking in account

grouping of terms in text, taking in account proper names)

Summary of Advantages and know-how

Summary of Advantages and know-how

Lexical meaning based approach Using collocations give us lexical meaning

disambiguation We use an electronic terminological dictionary and micro-

dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly

It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning

Comparable quality of classification with low-cost machine learning

Applications and ToolsApplications and ToolsRubryx – text classification program (versions 1 and 2,See site www.sowsoft.com/rubryx )

DicTools – utility for dictionary development

Spider – application program for text collection from Internet with preliminary classification

Dictionaries

http://www.sowsoft.com/rubryx

Rubryx – text classification program Rubryx – text classification program

Status: Completed application



Status: Completed application

Spider – application program for text collection from

Internet with preliminary classification

Spider – application program for text collection from

Internet with preliminary classification

Application collects from start www-address all pages relevant to interested rubric.

1. We input category and starting URL

2. Spider goes recursively all links and loads pages.All pages are classified and not interesting link paths are cut.

3.As result we have sufficient economy of traffic and time.

Status: Evaluation and testing

English DictionariesEnglish Dictionaries

Natural Language Processing (7775 terms)Geography (5941 terms)Metallurgy (4946 terms)Politechnical (37488 terms) Economics (1806 terms)Names of market exchanges (69080 terms)

PublicationsPublications

V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, 120-126 (2001) (Article in Russian with abstract in English)

V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM-2003. Moscow, MSLU, 137-143 (2003)

Contact InformationContact InformationVladimir N. Polyakov Moscow State Linguistic University [email protected]

Vladimir V. Sinitsyn Moscow State Steel and Alloys Institute (Technological University)[email protected]

Rubryx HomePages (shareware):www.sowsoft.com/rubryx/www.rubryx.narod.ru

Documents

Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin