Upload
mitchell-pitts
View
219
Download
0
Embed Size (px)
Citation preview
Agenda for today
• Questions Chapter 1• Chapter 2: Term vocabulary & posting lists• Chapter 2: Posting lists with positions• Homework/lab assignment
Chapter 2 Overview
Preprocessing of documents• choose the unit of indexing (granularity)• tokenization (removing punctuation, splitting
in words)• stop list?• normalization: case folding, stemming versus
lemmatizing, ...• extensions to postings lists
Tokens, types and terms
token each separate word in the texttype same words belong to one type
(index) term finally included in the indexindex term is an equivalence class
of tokens and/or types
Tokens, types and terms
The Lord of the Rings
• Number of tokens?• 5• Number of types?• 4• Number of terms?• 4? 2? 1?
26-01-12
Equivalence classes
• Casefolding• Diacritics• Stemming/lemmatisation• Decompounding• Synonym lists• Variant spellings
26-01-12
Equivalence classes
• Implicit: mapping rules• Relational: query expansion• Relational: double indexing• Mapping should be done:
– Indexing– Querying
26-01-12
Words and word forms
• Inflection (D: verbuiging/vervoeging)- changing a word to express person, case,
aspect, ...- for determiners, nouns, pronouns, adjectives:
declination (D: verbuiging)- for verbs: conjugation (D: vervoeging)
• Derivation (D: afleiding)- formation of a new word from another word
(e.g. by adding an affix (prefix or suffix) or changing the grammatical category)
Inflection examples
DeterminersE: the D: de, het G: der, des, dem, den, die, das
AdjectivesE: young D: jonge, jonge G: junger, junge, junges,
jungenNouns
E: man, men D: man, mannen G: mann, mannes, Verbs
E write / writes / wrote / writtenD schrijf/ schrijft /schrijven / schreef/ schreven /
geschrevenG schreibe/ schreibst / schreibt / schreiben / schrieben /
geschrieben
Derivation examples
to browse -> a browserred -> to redden, reddish Google -> to google
arm(s) -> to arm, to disarm -> disarmament, disarming
Stemming and lemmatizing
verb forms inform, informs, informed, informingderivations information, informative, informal??stem inform lemma inform, information, informative,
informal
verb forms sing: sings, sang, sung, singingderivations singer, singers, song, songs stem sing, sang, sung, song, lemma sing, singer, song
Discussion
Why is stemming used when lemmatizing is much more precise?
Lemmatizing is a more complex processit needs - a vocabulary (problem: new words)- morphologic analysis (knowledge of inflection rules)- syntactic analysis, parsing (noun or verb?)
Compound splitting
Marketingjargon -> marketing AND jargon
• Increased retrieval• Decreased precision• Must be applied to both query and index!• But what to do with the query marketing
jargon ?• And with spreekwoord appel boom ?
26-01-12
Chapter 2 Overview
Preprocessing of documents• choose the unit of indexing (granularity)• tokenization (removing punctuation, splitting
in words)• stop list?• normalization: case folding, stemming versus
lemmatizing, ...• extensions to postings lists
Efficient merging of postings
For X AND Y, we have to intersect 2 listsMost documents will contain only one of the two
terms
Skip pointers
• Makes intersection of 2 lists more efficient• think of millions of list items
• How many skip pointers and where?• Trade-off:
• More pointers, often useful but small skips.
• Less pointers …• Heuristic: distance √n, evenly distributed
Skip pointers: useful?
Yes, certainly in the pastWith very fast CPUs less important
Especially in a rather static indexIf a list keeps changing less effective
Extensions of the simple term indexTo support phrase queries• “information retrieval”• “retrieval of information”
Different approaches• biword indexes• phrase indexes• positional indexes• combinations
Biword and phrase indexes
• Holding terms together in the index• Simple biword index:
• retrieval of, of information • Sophisticated: POS tagger selects nouns
• N x* N retrieval of this information• Phrase index: includes variable lengths of word
sequences • terms of 1 and 2 words both included
Positional index
Add in the postings lists for each doc the list of positions of the termfor phrase queriesfor proximity search
Example[information, 4] : [1:<4,22, 35>, 2:<5,17, 30>, …][retrieval, 2] : [1:<5,20>, 2:<18,31>]
Combination schemes
Often queried combinations: phrase indexnames of persons and organizationesp. combinations of common terms (!)find out from query log
For other phrases a positional indexWilliams e.a.: next word index added
H.E. Williams, J.Zobel, and D.Bahle (2004) Fast Phrase Querying With Combined Indexes (ACM Dig Library):
Phrase querying with a combination of three approaches
(next word index, phrase index and inverted file)... is more than 60% faster on average than using an
inverted index alone ... requires structures that total only 20% of the size
of the collection.