Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University

Complex Linguistic Features for Text

Classification:A Comprehensive Study

Alessandro Moschitti and Roberto BasiliAlessandro Moschitti and Roberto BasiliUniversity of Texas at Dallas, University of Texas at Dallas,

University of Rome Tor VergateUniversity of Rome Tor Vergate

ECIR 2004ECIR 2004

AbstractAbstract

Previous researches on advanced representations for Previous researches on advanced representations for document retrieval have shown that statistical document retrieval have shown that statistical state-of-state-of-the-artthe-art models are not improved by a variety of models are not improved by a variety of different linguistic representations.different linguistic representations.

Phrases, word senses and syntactic relations derived Phrases, word senses and syntactic relations derived by NLP techniques were observed ineffective to by NLP techniques were observed ineffective to increase retrieval accuracy.increase retrieval accuracy.

For Text Categorization (TC), fewer and less definitive For Text Categorization (TC), fewer and less definitive studies on the use of advanced document studies on the use of advanced document representations are available.representations are available.

AbstractAbstract

In this paper, extensive experimentations on In this paper, extensive experimentations on representative classifiers (Rocchio and SVM) have representative classifiers (Rocchio and SVM) have been carried out to study how some NLP techniques been carried out to study how some NLP techniques impact TC.impact TC.

Cross validation over 4 different corpora in two Cross validation over 4 different corpora in two languages allowed us to gather an languages allowed us to gather an overwhelming overwhelming evidenceevidence that complex nominals, proper nouns and that complex nominals, proper nouns and word sensesword senses are not adequate to improve TC accuracy. are not adequate to improve TC accuracy.

IntroductionIntroduction Several attempts to design complex and effective Several attempts to design complex and effective

features for document retrieval and filtering were features for document retrieval and filtering were carried out.carried out. Document LemmasDocument Lemmas

Base form of morphological categoriesBase form of morphological categories

PhrasesPhrases simple n-gramssimple n-grams, e.g. officials said, e.g. officials said Noun PhrasesNoun Phrases such as Named Entities such as Named Entities <<head, modifierhead, modifier11, …, modifier, …, modifiernn> tuples > tuples

Word sensesWord senses Defined by means of an explanation like in a dictionary entryDefined by means of an explanation like in a dictionary entry Defined by using other words that share the same sense, like Defined by using other words that share the same sense, like

in a thesaurus, e.g. WordNetin a thesaurus, e.g. WordNet

Phrases and Document RetrievalPhrases and Document Retrieval

Both the goals of using phrases and word senses are Both the goals of using phrases and word senses are increasing the precision of concept matching.increasing the precision of concept matching.

In TREC conferences, phrases were experimented and In TREC conferences, phrases were experimented and some conclusions were made:some conclusions were made:

1.1. The higher computational cost of the employed NLP The higher computational cost of the employed NLP algorithms prevents their application in operative IR algorithms prevents their application in operative IR scenario.scenario.

2.2. The experimented NLP representations can increase The experimented NLP representations can increase basic retrieval models (e.g. SMART), but no basic retrieval models (e.g. SMART), but no improvement for advanced statistical retrieval models. improvement for advanced statistical retrieval models. [Strzalkowski, 1998][Strzalkowski, 1998]

Word Senses and Document Word Senses and Document RetrievalRetrieval

In [Smeaton, 1998], NLP resources like WordNet were In [Smeaton, 1998], NLP resources like WordNet were experimented, instead of NLP techniques.experimented, instead of NLP techniques. Positive results were obtained only after the sense Positive results were obtained only after the sense

were manually validated.were manually validated. The WSD performance ranging between 60-70% was not The WSD performance ranging between 60-70% was not

adequate to improve document retrieval.adequate to improve document retrieval.

For text indexing and query expansion, semantic For text indexing and query expansion, semantic information taken directly from WordNet without information taken directly from WordNet without performing WSD is not helping IR at all.performing WSD is not helping IR at all.

The high computational cost of the adopted NLP The high computational cost of the adopted NLP algorithms, the small improvement produced and the algorithms, the small improvement produced and the lack of accurate WSD tools are the reasons for the lack of accurate WSD tools are the reasons for the failure of NLP in IR.failure of NLP in IR.

Text Categorization and IRText Categorization and IR Since TC is a subtask of IR, why should we try to use the Since TC is a subtask of IR, why should we try to use the

same NLP techniques for TC?same NLP techniques for TC? In TC both set of positive and negative documents In TC both set of positive and negative documents

describing categories are available.describing categories are available. Categories differ from queries as they are static, i.e., a Categories differ from queries as they are static, i.e., a

predefined set of training documents stably define the predefined set of training documents stably define the target category.target category.

Effective WSD algorithms can be applied to documents Effective WSD algorithms can be applied to documents whereas this was not the case for queries (especially for whereas this was not the case for queries (especially for short queries).short queries).

Recent evaluation in SENSEVAL has shown accuracies of 70% Recent evaluation in SENSEVAL has shown accuracies of 70% for verbs, 75% for adjectives and 80% for nouns.for verbs, 75% for adjectives and 80% for nouns.

As TC is a relatively new research area, there are fewer As TC is a relatively new research area, there are fewer studies that employ NLP techniques for it, and several studies that employ NLP techniques for it, and several researches report noticeable improvements over the researches report noticeable improvements over the bag-of-bag-of-wordswords..

The GoalThe Goal In this paper, the impact of richer document In this paper, the impact of richer document

representations on TC has been deeply investigated on representations on TC has been deeply investigated on four corpora in two languages by using cross validation four corpora in two languages by using cross validation analysis.analysis.

Phrase and sense representations have been Phrase and sense representations have been experimented on three classification systems.experimented on three classification systems. RocchioRocchio [J. Rocchio, 1971], an efficient classifier [J. Rocchio, 1971], an efficient classifier The Parameterized Rocchio Classifier (The Parameterized Rocchio Classifier (PRCPRC) [A. Moschitti, ) [A. Moschitti,

2003]2003] SVM-lightSVM-light [T. Joachims, 1999], one [T. Joachims, 1999], one state-of-the-artstate-of-the-art TC model TC model

Richer representations can be really useful only if:Richer representations can be really useful only if:a.a. Accuracy increases with respect to the bag-of-word baseline Accuracy increases with respect to the bag-of-word baseline

for the different systems, orfor the different systems, or

b.b. They improve computationally efficient classifiers so that They improve computationally efficient classifiers so that they approach the accuracy of stat-of-art models.they approach the accuracy of stat-of-art models.

Natural Language Feature Natural Language Feature EngineeringEngineering

The linguistic features used to train classifiers:The linguistic features used to train classifiers: POS-tag informationPOS-tag information

The Brill tagger is used (~95% accuracy).The Brill tagger is used (~95% accuracy). PhrasesPhrases

Proper Nouns: Proper Nouns: persons, locations, artifactspersons, locations, artifacts.. Complex nominals expressing domain concepts, Complex nominals expressing domain concepts,

e.g., e.g., bond issuesbond issues or or beach wagonbeach wagon.. Word senses.Word senses.

Automatic Phrase ExtractionAutomatic Phrase Extraction For proper nounsFor proper nouns

The detection is achieved by applying a grammar that takes The detection is achieved by applying a grammar that takes into account capital letters of nouns, e.g., into account capital letters of nouns, e.g., International Bureau International Bureau of Lawof Law..

For complex nominalFor complex nominal Based on an integration of symbolic and statistical model Based on an integration of symbolic and statistical model

presented in [R. Basili, 1997].presented in [R. Basili, 1997]. Three steps:Three steps:

1.1.The detection of atomic terms The detection of atomic terms htht, e.g. , e.g. issueissue. .

2.2.The identification of admissible candidates, I.e. linguistic The identification of admissible candidates, I.e. linguistic structure headed by structure headed by htht (satisfying linguistically principled (satisfying linguistically principled grammars).grammars).

3.3.The selection of the final complex nominals via a statistical filter The selection of the final complex nominals via a statistical filter such as the mutual information.such as the mutual information.

The phrases were extracted per category.The phrases were extracted per category.

WSD AlgorithmsWSD Algorithms Using WordNet to assign noun (higher accuracy) senses.Using WordNet to assign noun (higher accuracy) senses.

Three WSD algorithmsThree WSD algorithms Baseline Baseline : assign each noun it most frequent sense.: assign each noun it most frequent sense. An algorithm base on glosses information An algorithm base on glosses information ::

Exploits the definition of each synset.Exploits the definition of each synset.

e.g. {e.g. {hithit,, noun noun}}#1#1 = = (a successful stroke in an athletic contest (especially (a successful stroke in an athletic contest (especially

in baseball); “he came all the way around on Williams‘ hit”)in baseball); “he came all the way around on Williams‘ hit”) Selects the sense whose Selects the sense whose local contextlocal context (definition of synset) (definition of synset)

best matches the best matches the global context global context (context of target noun).(context of target noun). Counting the number of nouns that are in both contexts.Counting the number of nouns that are in both contexts.

WSD developed by the LCC (WSD developed by the LCC (Language Computer Language Computer CorporationCorporation))

The one won the SENSEVAL. [A. Kilgarriff, 2000]The one won the SENSEVAL. [A. Kilgarriff, 2000]

Experiments on Linguistic Experiments on Linguistic FeaturesFeatures

The evaluation of phrases and POS informationThe evaluation of phrases and POS information Using Rocchio, PRC and SVM over Using Rocchio, PRC and SVM over Reuters3Reuters3, ,

Ohsumed and ANSA collectionsOhsumed and ANSA collections

The evaluation of semantic information (sense)The evaluation of semantic information (sense) Using SVM on Using SVM on Reuters-21578Reuters-21578 and 20NewsGroups and 20NewsGroups

corpora.corpora.

Experimental Set-UpExperimental Set-Up Document collectionsDocument collections

The The Reuters-21578 Reuters-21578 corpus, Aptcorpus, Apté split: including 12,902 é split: including 12,902 documents for 90 classes with a fixed split between testing documents for 90 classes with a fixed split between testing and training (3,299 vs. 9,603).and training (3,299 vs. 9,603).

The The Reuters3Reuters3 corpus: including 11,099 documents for 93 corpus: including 11,099 documents for 93 classes, with a split of 3,309 vs. 7,789 between testing and classes, with a split of 3,309 vs. 7,789 between testing and training.training.

The The ANSAANSA collection: including 16,000 news items in Italian collection: including 16,000 news items in Italian from the ANSA news agency, 8 target categories.from the ANSA news agency, 8 target categories.

The The OhsumedOhsumed collection: including 50,216 medical collection: including 50,216 medical abstracts. Only the first 20,000 documents in 23 abstracts. Only the first 20,000 documents in 23 MeSHMeSH diseases categoriesdiseases categories are used. are used.

The 20NewsGroups corpus (The 20NewsGroups corpus (20NG20NG): including 19,997 articles ): including 19,997 articles for 20 categories taken from the Usenet newsgroups for 20 categories taken from the Usenet newsgroups collection, different from Reuters and Ohsumed with it’s collection, different from Reuters and Ohsumed with it’s larger vocabulary.larger vocabulary.

Experimental Set-UpExperimental Set-Up 2 2 set of tokens are considered as baselinesset of tokens are considered as baselines

TokensTokens set which contains a larger number of features set which contains a larger number of features and should provide the most general and should provide the most general bag-of-wordsbag-of-words results.results.

Linguistic-TokensLinguistic-Tokens (i.e. only the nouns, verbs or (i.e. only the nouns, verbs or adjectives), which are selected using the POS-adjectives), which are selected using the POS-information.information.

+CN indicates the proper nouns and other complex +CN indicates the proper nouns and other complex nominals are used as features for the classifiers.nominals are used as features for the classifiers.

+POS indicates features are tokens augmented with +POS indicates features are tokens augmented with their POS tags in context.their POS tags in context.

The NLP-derived features are added to the standard The NLP-derived features are added to the standard token sets, instead of replacing some of them.token sets, instead of replacing some of them.

Experimental Set-UpExperimental Set-Up

Evaluation (microaverage for global performance)Evaluation (microaverage for global performance) Breakeven Point (BEP): Breakeven Point (BEP): precisionprecision ＝＝ recallrecall

FF11 measure: 2 measure: 2PRPR ／／ ((PP ＋＋ RR))

Cross-Corpora/Classifier Validations Cross-Corpora/Classifier Validations of Phrases and POS-informationof Phrases and POS-information

Table 2. Breakeven points of PRC over Reuters3 corpus. The linguistic features are added to the Linguistic-Tokens set.

Linguistic features improve the result?Linguistic features improve the result?

An alternative feature set could perform higher than An alternative feature set could perform higher than the the bag-of-wordsbag-of-words in a single experiment. in a single experiment.

The classifier parameters could be better suited for a The classifier parameters could be better suited for a particular particular training/test-settraining/test-set split. split.

20 random generated splits (70%-30%) for cross 20 random generated splits (70%-30%) for cross validation.validation.


Table 3. Rocchio, PRC and SVM performances on different feature sets of the Reuters3 corpus


Table 4. Rocchio, PRC and SVM performances on different feature sets of the Ohsumed corpus

Neonatal Neonatal is improved by the extended features → should be is improved by the extended features → should be consider as the normal consider as the normal record of casesrecord of cases..

Cross Validation on Word SensesCross Validation on Word Senses

Compare performance of SVM over Compare performance of SVM over TokensTokens and over and over Semantic feature sets (=Semantic feature sets (=TokensTokens + disambiguated noun + disambiguated noun senses)senses)

An indicative evaluation for WSD algorithms:An indicative evaluation for WSD algorithms:

(250 manually disambiguated nouns from (250 manually disambiguated nouns from Reuters-21578Reuters-21578 docs)docs)

Baseline: 78.43%Baseline: 78.43% Algorithm 1 (gloss-based): 77.12%Algorithm 1 (gloss-based): 77.12% Algorithm 2 (LCC): 80.55%Algorithm 2 (LCC): 80.55%


Table 6. Performance of SVM on the Reuters-21578 corpus.

Semantic information (WSD) enhance the classifier?


Table 7. SVM μf1 performances on 20NewsGroups.

When the words are richer in term of possible senses, When the words are richer in term of possible senses, the baseline performs lower than Alg2.the baseline performs lower than Alg2.

When all the nouns are replaced with their When all the nouns are replaced with their disambiguated senses, lower (from 1 to disambiguated senses, lower (from 1 to 3%)performances are obtained than the 3%)performances are obtained than the bag-of-wordsbag-of-words..

Why Do Phrases Not Help?Why Do Phrases Not Help?

Two possible properties of phrases as explanations.Two possible properties of phrases as explanations. Loss of coverageLoss of coverage::

word information cannot be easily subsumed by the phrase word information cannot be easily subsumed by the phrase information, e.g. information, e.g. George_BushGeorge_Bush → → Bush Bush

Poor effectivenessPoor effectiveness:: The information added by word sequences is poorer than word set.The information added by word sequences is poorer than word set. Two necessary conditions for a phrase to be better than its word Two necessary conditions for a phrase to be better than its word

set:set: Words in the sequence should appear not sequentially in some incorrect Words in the sequence should appear not sequentially in some incorrect

documents, e.g. documents, e.g. GeorgeGeorge and and BushBush appear non sequentially in a sport document appear non sequentially in a sport document.. All the correct documents that contain one of the compounding words (e.g. All the correct documents that contain one of the compounding words (e.g.

GeorgeGeorge or or BushBush) should at the same time contain the whole sequence.) should at the same time contain the whole sequence.

Why Do Senses Not Help?Why Do Senses Not Help?

The senses of a noun in documents of a category tend The senses of a noun in documents of a category tend to be always the same.to be always the same.

Moreover, different categories are characterized by Moreover, different categories are characterized by different words rather than different senses.different words rather than different senses.

A general view: A general view: textual representations are always very good at textual representations are always very good at capturing the overall semantics of documents, at least as good as capturing the overall semantics of documents, at least as good as linguistically justified representationslinguistically justified representations..

IR methods oriented to textual representations of document IR methods oriented to textual representations of document semantics should be firstly investigated and they should stress the semantics should be firstly investigated and they should stress the role of words as vehicles of natural language semantics (as opposed role of words as vehicles of natural language semantics (as opposed to logic systems of semantic types, like ontologies).to logic systems of semantic types, like ontologies).

ConclusionsConclusions This paper reports the study of advanced document This paper reports the study of advanced document

representation for TC.representation for TC.

Several combination of different feature sets have been Several combination of different feature sets have been extensively experimented with three classifiers Rocchio, extensively experimented with three classifiers Rocchio, PRC and SVM over 4 corpora in two languages.PRC and SVM over 4 corpora in two languages.

The results have shown that both semantic (word senses) The results have shown that both semantic (word senses) and syntactic information (phrases and POS-tags) cannot and syntactic information (phrases and POS-tags) cannot achieve the goal of improvement.achieve the goal of improvement.

The outcome of this careful analysis is not a negative The outcome of this careful analysis is not a negative statement on the role of complex linguistic features in TC statement on the role of complex linguistic features in TC but suggests that the elementary textual representation but suggests that the elementary textual representation based on words is very effective.based on words is very effective.

Documents

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University