Information retrieval

INFORMATION RETRIEVAL

TOKENIZATION Tokenization is the process of breaking a stream of text up into words,

phrases, symbols, or other meaningful elements called tokens. A tokenizer relies on simple heuristic. Example

A continuous stream of alphabets are part of one token Tokens are separated by white spaces or punctuation Punctuations and white space may or may not be included in the resulting list of

tokens

STOP WORD REMOVAL Words filtered out prior to or after processing of natural language data No definite word list Features

Extremely common words Contribute minimal in helping selecting documents

Most common example Function words such as the, is, at, which, on Most common words including lexical words

Strategy – Sort the terms by collection frequency Take the most frequent documents

Advantage – using a stop list greatly reduces the number of postings a system has to store

Exemption – phrase search (“Flight to London”)

LEMMATISATION AND STEMMING

The goal is to reduce inflectional and derivationally related forms of a word to a common base form. Example

am, are, is => becar, cars, car’s, cars’ => car

The result of this mapping of text would be

The boy’s cars are different colours => the boy car be differ colour

DIFFERENCE Stemming usually refers to process that chops off the end of words.

Includes removal of derivational affixes Lemmatisation refers to doing things properly with the use of vocabulary

and morphological analysis of words Aims to remove the inflectional ending Returns the base or dictionary form called lemma

word => sawstemming => sLemmatisation => see, saw

PORTER’S ALGORITHM Consist of 5 phrase of word reductions, applied sequentially Within each phrase there are various conventions to select rules

Measure of a word – loosely check the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of the rule as a suffix rather than as part of stem of the word

(m>1) EMENT ->Would map replacement to replace but not cement to c

Porter stemmer stems all of the following words – operate, operating, operates, operation, operative, operatives, operational to oper

We will loose considerable precision Operational and research Operating and system Operative and dentistry

MORE ALGORITHMS Lookup Algorithm

Looks for the inflected form in a lookup table Simple, fast and easy exception handling New/unfamiliar words are not handled

The production technique The lookup table is generally produced unautomatically Ex. run => running, runs, runned, runly The last two forms are valid but unlikely

Suffix-scripted algorithm A set of rules provide path for algorithm

if the word ends in 'ed', remove the 'ed' if the word ends in 'ing', remove the 'ing' if the word ends in 'ly', remove the 'ly'

NAMED ENTITY RECOGNITION

Subtask of information extraction Seeks to locate and classify elements into pre-defined categories such as

names of person, organization, location, quantities, monetary values

Takes unannotated block of text. likeJim bought 300 shares of Acme Corp. in 2006

and produces unannotated block of text that highlights the names of entity

[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time

PARTS OF SPEECH Linguistic category of words Noun- any abstract or concrete entity. A person, place, thing, idea Pronoun- a substitute for noun/noun phrase Adjective – a qualifier of a noun Verb- an action, occurrence, or a state of being Adverb – any qualifier of an adjective Preposition – any establisher of relation or syntactic content Conjunction – any syntactic connector Interjection – an emotional greeting

PARSING Process of analysing a string of symbols Analysis of a sentence by a computer into its constituents Results in parse tree showing their syntactic relation to each other

Data & Analytics

Information retrieval