Upload
morna
View
38
Download
0
Embed Size (px)
DESCRIPTION
N-gram Tokenization for Indian Language Text Retrieval. Paul McNamee [email protected] 13 December 2008. Talk Outline. Introduction Monolingual Experiments from CLEF 2000-2007 Words Stemmed words (Snowball) Character n-grams (n=4,5) N-gram stems - PowerPoint PPT Presentation
Citation preview
N-gram Tokenization for Indian Language Text Retrieval
Paul McNamee
13 December 2008
13 December 2008
Talk Outline
Introduction Monolingual Experiments from CLEF 2000-2007
Words Stemmed words (Snowball) Character n-grams (n=4,5) N-gram stems Automatically segmented words (Morfessor algorithm) Skipgrams (n-grams with skips)
Why are n-grams effective? Bilingual Experiments (CLEF) FIRE Results Summary
13 December 2008
Morphological Processes
Inflection box, boxes (plural); actor (male), actress (female)
Conjugation write, written, writing; swim, swam, swum
Derivation sleep, sleepy; play (verb), player (noun), playful
(adjective)
Word Formation Compounding: news + paper = newspaper; air + port =
airport Clipping: professor -> prof; facsimile-> fax Acronyms: GOI = Government of India
13 December 2008
Why Do We Normalize Text?
It seems desirable to group related words together for query/document processing
Why? To make lexicographers happy? To improve system performance?
If performance is the goal, then it ought not to matter whether the indexing terms look like morphemes, or not
13 December 2008
Rule-Based Stemming: Snowball
Applicable to alphabetic languages
An approximation to lemmatization
Identify a root morpheme by chopping off prefixes and suffixes
Used for Dutch, English, Finnish, French, German, Italian, Spanish, and Swedish
Snowball rulesets also exist for Hungarian and Portuguese
No Indian language support
Most stemmers are rule-based-ing => juggling => juggl-es => juggles => juggl-le => -l juggle => juggl
The Snowball project provides high quality, rule-based stemmers for many European languages
http://snowball.tartarus.org/
13 December 2008
N-gram Tokenization
Advantages: simple, address morphology, surrogate for short phrases, robust against spelling & diacritical errors, language-independence
Disadvantages: conflation (e.g., simmer, slimmer, glimmer, immerse), n-grams incur both speed and disk usage penalties
Represent text as overlapping substrings
Fixed length of n of 4 or 5 is effective in alphabetic languages
For text of length m, there are m-n+1 n-grams
s w i m m e r s
_ s w i m
s w i m m
w i m m e
i m m e r
m m e r s
m e r s _
13 December 2008
Single N-gram Stemming
Traditional (rule-based) stemming attempts to remove the morphologically variable portion of words Negative effects from over- and under-conflation
Hungarian Bulgarian
_hun (20547) _bul (10222)
hung (4329) bulg (963)
unga (1773) ulga (1955)
ngar (1194) lgar (1480)
gari (2477) gari (2477)
aria (11036) aria (11036)
rian (18485) rian (18485)
ian_ (49777) ian_ (49777)
Short n-grams covering affixes occur frequently - those around the morpheme tend to occur less often. This motivates the following approach:
(1) For each word choose the least frequently occurring character 4-gram (using a 4-gram index)
(2) Benefits of n-grams with run-time efficiency of stemming
Continues work in Mayfield and McNamee, ‘Single N-gram Stemming’, SIGIR 2003
13 December 2008
Statistical Segmentation
Morfessor Algorithm Given a dictionary list, learns to split
words into segments A form of statistical stemming based
on Minimum Description Length (MDL)
> 70% of world languages have concatenative morphology
Creutz & Lagus, ACL-2002 http://www.cis.hut.fi/projects/morpho
2007 Morphology Challenge Successful on an IR task Multiple segments per word are
generated
Examples affect+ion+ate author+ized juggle+d juggle+r+s sea+gull+s
See McNamee, Nicholas, & Mayfield, ‘Don’t Have a Stemmer? Be un+concern+ed’, SIGIR 2008
13 December 2008
Character Skipgrams
Character n-grams: robust matching technique Skipgrams: super robust matching
Some letters are omitted (essentially a wildcard match) sw*m matches swim / swam / swum f**t matches foot / feet
Skip bi-grams for fuzzy matching Pirkola et al. (2002): learning cross-lingual translation mappings in
related languages Mustafa (2004): monolingual Arabic retrieval
Example: 4,2 skipgrams for Hopkins 4 letters, 2 skips hkin, hpin, hpkn, hoin, hokn, hopn oins, okns, okis, opns, opis, opks Note: more skipgrams than plain n-grams
Slight gains in Czech, Hungarian, Persian Application to OCR’d docs?
13 December 2008
Generating Indexing Terms
Word Snowball
Morfessor 5-grams
authored author author+ed _auth, autho, uthor, thore, hored, ored_
authorized author author+ized _auth, autho, uthor, thori, horiz, orize, rized, ized_
authorship authorship
author+ship _auth, autho, uthor, thors, horsh, orshi, rship, ship_
reauthorization
reauthor re+author+ization
_reau, reaut, eauth, autho, uthor, thori, horiz, oriza, rizat, izati, zatio, ation, tion_
afoot afoot a+foot _afoo, afoot, foot_
footballs footbal football+s _foot, footb, ootba, otbal, tbaall, balls, alls_
footloose footloos foot+loose _foot, footl, ootlo, otloo, tloos, loose, oose_
footprint footprint foot+print _foot, footp, ootpr, otpri, tprin, print, rint_
feet feet feet _feet, feet_
juggle juggl juggle _jugg, juggl, uggle, ggle_
juggled juggl juggle+d _jugg, juggl, uggle, ggled, gled_
jugglers juggler juggle+r+s _jugg, juggl, uggle, ggler, glers, lers_
13 December 2008
JHU/APL HAIRCUT System
The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT)
Uses state-of-the-art statistical language model Ponte & Croft, ‘A Language Modeling Approach to Information
Retrieval,’ SIGIR-98 Miller, Leek, and Schwartz, ‘A Hidden Markov Model Information
Retrieval System’, SIGIR-99.
Typically set λ to 0.5
Language-neutralSupports large dictionariesUsed at TREC (10x), CLEF (9x), NTCIR(2x)
€
P(D |Q)∝ λ ⋅P(t |D)t∈Q
∏ + (1− λ ) ⋅P(t |C)
13 December 2008
CLEF Ad Hoc Test Sets (2000 – 2007)
#docs size 00 01 02 03 04 05 06 07
Bulgarian (BG) 69 k 213 MB 49 50 50 149
Czech (CS) 82 k 178 MB 50 50
Dutch (NL) 190 k 540 MB 50 50 56 156
English (EN) 170 k 580 MB 33 47 42 54 42 50 49 50 367
Finnish (FI) 55 k 137 MB 30 45 45 120
French (FR) 178 k 470 MB 34 49 50 52 49 50 49 333
German (DE) 295 k 660 MB 37 49 50 56 192
Hungarian (HU) 50 k 105 MB 50 48 50 148
Italian (IT) 157 k 363 MB 34 47 49 51 181
Portuguese (PT) 107 k 340 MB 46 50 50 146
Russian (RU) 17 k 68 MB 28 34 62
Spanish (ES) 453 k 1086 MB 49 50 57 156
Swedish (SV) 143 k 352 MB 49 53 102
13 December 2008
Tokenization Alternatives
Stemming Effective in Romance
languages Not always available
N-grams Language-neutral Large gains in complex
languages
Other techniques Statistical stemming beats
words Segmentation Single n-gram stems
No run-time penalty
13 December 2008
Monolingual Tokenization
13 December 2008
IR & Language Family
5-gram Gains Tied to morphological
complexity Small improvements in
Romance family
Estimating Complexity Mean word length
Spearman rho = 0.77
Information-theoretic approach Spearman rho = 0.67 Kettunen et al., Juola
HU
FI
DE
CS
SV
NL
HU
FICS
DERUSV
BG
13 December 2008
Why are N-grams Effective?
(1) Spelling N-grams localize single
letter spelling errors In news about 1 in 2000
words is misspelled
(2) Phrasal Clues Word spanning n-grams
hint at phrases Only slight differences
observed
13 December 2008
(3) Because of Morphological Variation?
N-grams might gain their power by controlling for morphological variation N-grams focused on root morphemes tend to match
across inflected forms
Juola (1998) and Kettunen (2006) did experiments ‘removing’ morphology from language Such as replacing each surface form with a 6-digit
number
I compared words and 5-grams under normal and permuted letter conditions golfer: legfro golfed: dofegl golfing: ligfron
13 December 2008
Source of N-gram Power
Idea: remove morphology from a language Letter order of words was randomly permuted
golfer -> legfro, team-> eamt golfing, golfer, golfed no longer share a morpheme
4 conditions: {words,5-grams} x {normal,shuffled}
13 December 2008
Corpus-Based Translation
Given aligned parallel texts and a particular term to translate Find set of documents
(sentences) in the source language containing the term
Examine corresponding foreign documents
Extract ‘good’ candidate(s) Goodness can be based on term
similarity measures (Dice, MI, IBM Model 1, etc.) The Rosetta Stone was discovered in
1799 by Napoleonic forces in Egypt. British physicist Thomas Young determined that cartouches were names of royalty. In 1821 Jean François Champollion began deciphering hieroglyphics using parallel data in Demotic and Greek
El precio del petróleo aumentó ayer. La economía reaccionó agudamente …
The price of oil increased yesterday. The economy reacted sharply …
13 December 2008
Character n-grams can be statistically translated, just like words
N-grams (such as n=4,5) are smaller than words May capture affixes and morphological roots
‘work’ (from working) maps to ‘abaj’ (as in trabajaba) ‘yrup’ (from syrup) maps to ‘rabe’ (as in jarabe)
Suitable with Proper Nouns ‘therl’ (from Netherlands) to ‘ses b’ (as in Países Bajos)
German Italian
word milch latte
stem milch latt
4-grams milcilch
lattlatt
5-grams _milcmilchilch_
_latt_lattlatte
French Dutch
word lait melk
stem lait melk
4-grams lait melk
5-grams _laitlait_
_melkmelk_
N-gram Translations
13 December 2008
Corpus Size Genre CLEF Languages
Bible 785k Religious CZ, DE, EN, ES, FI, FR, IT, NL, PT, RU, SV
JRC/Acquis 32M EU Law BG, CZ, DE, EN, ES, FI, FR, HU, IT, NL, PT, RU, SV
Europarl 33M Parlimentary Debate DE, EN, ES, FI, FR, IT, NL, PT, SV
OJEU 84M Governmental Affairs DE, EN, ES, FI, FR, IT, NL PT, SV
Parallel Sources
Bible: Therefore was the name of it called Babel; because Jehovah did there confound the language of all the earth: and from thence did Jehovah scatter them abroad upon the face of all the earth.
Acquis: (24) In order to contribute to the conservation of octopus and in particular to protect the juveniles, it is necessary to establish, in 2006, a minimum size of octopus from the maritime waters under the sovereignty or jurisdiction of third countries and situated in the CECAF region pending the adoption of a regulation amending Regulation (EC) No 850/98.
Europarl: Mr President, the tsunami tragedy should be no less significant to the world’s leaders and to Europe than 11 September.
OJEU: 11. Trafficking in women for sexual exploitation. A4-0372/97. Resolution on the Communi- cation from the Commission to the Council and the European Parliament on trafficking in women for the purpose of sexual exploitation (COM(96)0567 - C4-0638/96). The European Parliament,
13 December 2008
Effectiveness & Corpus Size
English queries translated using Europarl Corpus sub-sampled from 1 to 100%.
13 December 2008
Effectiveness by size (2)
13 December 2008
FIRE Index Characteristics
Vocabulary size in ILs seems abnormally small Possibly a bug in my pre-processing or tokenization,
perhaps related to Unicode (e.g., continuation or modification characters)
13 December 2008
Tokenization for FIRE 2008
Difficult to interpret results with anomalous vocabulary Need Failure Analysis Performance using words in ILs seems quite depressed Hindi 5-gram run had good relative performance
Difference vs. 4-grams much larger than typically seen
13 December 2008
Relative Gains w/ Relevance Feedback
Query expansion using top 10 documents 50 terms (words), 150 terms (4/5-grams), 400 terms (sk41) Fairly effective: 20-40% gains
13 December 2008
In Conclusion
Compared several forms of representing text In European languages n-grams obtain 20% gain over
words Rule-based stemming good in Romance languages Morfessor segments, n-gram stems better than words, not
as good as Snowball stemmer
N-grams gains Greatest in morphologically richer languages Lost when morphology ‘removed’ from language
FIRE N-grams and RF also effective in ILs Must resolve vocabulary issue Difficulty finding parallel text, but would like to investigate
bilingual retrieval