Upload
ujjawal
View
592
Download
0
Embed Size (px)
Citation preview
INFORMATION RETRIEVAL
TOKENIZATION Tokenization is the process of breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called tokens. A tokenizer relies on simple heuristic. Example
A continuous stream of alphabets are part of one token Tokens are separated by white spaces or punctuation Punctuations and white space may or may not be included in the resulting list of
tokens
STOP WORD REMOVAL Words filtered out prior to or after processing of natural language data No definite word list Features
Extremely common words Contribute minimal in helping selecting documents
Most common example Function words such as the, is, at, which, on Most common words including lexical words
Strategy – Sort the terms by collection frequency Take the most frequent documents
Advantage – using a stop list greatly reduces the number of postings a system has to store
Exemption – phrase search (“Flight to London”)
LEMMATISATION AND STEMMING
The goal is to reduce inflectional and derivationally related forms of a word to a common base form. Example
am, are, is => becar, cars, car’s, cars’ => car
The result of this mapping of text would be
The boy’s cars are different colours => the boy car be differ colour
DIFFERENCE Stemming usually refers to process that chops off the end of words.
Includes removal of derivational affixes Lemmatisation refers to doing things properly with the use of vocabulary
and morphological analysis of words Aims to remove the inflectional ending Returns the base or dictionary form called lemma
word => sawstemming => sLemmatisation => see, saw
PORTER’S ALGORITHM Consist of 5 phrase of word reductions, applied sequentially Within each phrase there are various conventions to select rules
Measure of a word – loosely check the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of the rule as a suffix rather than as part of stem of the word
(m>1) EMENT ->Would map replacement to replace but not cement to c
Porter stemmer stems all of the following words – operate, operating, operates, operation, operative, operatives, operational to oper
We will loose considerable precision Operational and research Operating and system Operative and dentistry
MORE ALGORITHMS Lookup Algorithm
Looks for the inflected form in a lookup table Simple, fast and easy exception handling New/unfamiliar words are not handled
The production technique The lookup table is generally produced unautomatically Ex. run => running, runs, runned, runly The last two forms are valid but unlikely
Suffix-scripted algorithm A set of rules provide path for algorithm
if the word ends in 'ed', remove the 'ed' if the word ends in 'ing', remove the 'ing' if the word ends in 'ly', remove the 'ly'
NAMED ENTITY RECOGNITION
Subtask of information extraction Seeks to locate and classify elements into pre-defined categories such as
names of person, organization, location, quantities, monetary values
Takes unannotated block of text. likeJim bought 300 shares of Acme Corp. in 2006
and produces unannotated block of text that highlights the names of entity
[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time
PARTS OF SPEECH Linguistic category of words Noun- any abstract or concrete entity. A person, place, thing, idea Pronoun- a substitute for noun/noun phrase Adjective – a qualifier of a noun Verb- an action, occurrence, or a state of being Adverb – any qualifier of an adjective Preposition – any establisher of relation or syntactic content Conjunction – any syntactic connector Interjection – an emotional greeting
PARSING Process of analysing a string of symbols Analysis of a sentence by a computer into its constituents Results in parse tree showing their syntactic relation to each other