1

Information retrieval concept, practice and challenge

Embed Size (px)

Citation preview

Page 1: Information retrieval   concept, practice and challenge

Information RetrievalConcept, Practice and ChallengeGUEST LEC TURE BY GAN KENG HOON

14 APRIL 2016

SCHOOL OF COMPUTER SC IENCES , UNIVERS IT I SAINS MALAYS IA .

1

Page 2: Information retrieval   concept, practice and challenge

Outlines

Concept

PracticeChallenge1. Conceptual Model

2. Retrieval Unit3. Document

Representation4. Information Needs5. Indexing6. Retrieval Functions7. Evaluation

1. Search Engine

1. Cross Lingual IR2. Big Data3. Personalization4. Domain Specific IR5. ……

2

Page 3: Information retrieval   concept, practice and challenge

Concept

3

Page 4: Information retrieval   concept, practice and challenge

A Conceptual Model for IR

Documents

Document Representation

Information Needs

Query

Retrieved Documents

Indexing Formulation

Retrieval Function

Relevance Feedback

4

Page 5: Information retrieval   concept, practice and challenge

Definitions of IR“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). (Manning, 2008)

5

Page 6: Information retrieval   concept, practice and challenge

Document/Retrieval Unit◦ Web pages, email, books, news stories, scholarly papers, text

messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.

◦ Retrieval unit can be ◦ Part of document, e.g. a paragraph, a slide, a page etc. ◦ In the form different structure, html, xml, text etc. ◦ In different sizes/length.

6

Page 7: Information retrieval   concept, practice and challenge

Document RepresentationFull Text Representation◦ Keep everything. Complete. ◦ Require huge resources. Too much may not be good.

Reduced (partial) Content Representation◦ Remove not important contents e.g. stopwords.◦ Standardization to reduce overlapped contents e.g. stemming.◦ Retain only important contents, e.g. noun phrases, header etc.

7

Page 8: Information retrieval   concept, practice and challenge

Document RepresentationThink of representation as some ways of storing the document.

Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order.

Document 1: "The cat sat on the hat"Document 2: "The dog ate the cat and the hat"From these two documents, a word list is constructed:{ the, cat, sat, on, hat, dog, ate, and }The list has 8 distinct words.

Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}

8

Page 9: Information retrieval   concept, practice and challenge

Information Needs Those things that you want Google to give you answer are information needs. Example of my search history◦ Query: weka text classification◦ Information Need: I want to find the tutorial about using weka for text

classification.

◦ Query: Dell i7 laptop◦ Information Need: I want to find the information any dell laptop that runs on intel

i7 processor. Actually, I want to buy, so an online store would be relevant.

9

Page 10: Information retrieval   concept, practice and challenge

Information NeedsNormally, you are required to formulate your information needs into some keywords, known as query.

Simple Query◦ Few keywords or more.

Boolean Query◦ ‘neural network AND speech recognition’

Special Query◦ 400 myr in usd

10

Page 11: Information retrieval   concept, practice and challenge

Retrieved DocumentsFrom the original collection, a subset of documents are obtained.

What is the factor that determines what document to return?

Simple Term Matching Approach 1. Compare the terms in a document and query.2. Compute “similarity” between each document in the collection and

the query based on the terms they have in common.3. Sorting the document in order of decreasing similarity with the query.4. The outputs are a ranked list and displayed to the user - the top ones

are more relevant as judged by the system.

11

Page 12: Information retrieval   concept, practice and challenge

IndexingConvert documents into representation or data structure to improve the efficiency of retrieval. To generate a set of useful terms called indexes. Why?◦ Many variety of words used in texts,

but not all are important.◦ Among the important words, some

are more contextually relevant.

Some basic processes involved◦ Tokenization◦ Stop Words Removal ◦ Stemming ◦ Phrases◦ Inverted File

12

Page 13: Information retrieval   concept, practice and challenge

Indexing (Tokenization)Convert a sequence of characters into a sequence of tokens with some basic meaning.

“The cat chases the mouse.”

“Bigcorp's 2007 bi-annual report showed profits rose 10%.”

thecatchasesthemouse

bigcorp2007biannualreportshowedprofitsrose10%

13

Page 14: Information retrieval   concept, practice and challenge

Indexing (Tokenization)Token can be single or multiple terms.

“Samsung Galaxy S7 Edge, redefines what a phone can do.”

samsung galaxy s7 edge redefineswhataphone cando

samsunggalaxy s7 edge redefineswhata ….

or

14

Page 15: Information retrieval   concept, practice and challenge

Indexing (Tokenization)Common Issues1. Capitalized words can have different meaning from lower case words◦ Bush fires the officer. Query: Bush fire ◦ The bush fire lasted for 3 days. Query: bush fire

2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake◦ rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree,

england's ten largest cities, shriner's

15

Page 16: Information retrieval   concept, practice and challenge

Indexing (Tokenization)

3. Numbers can be important, including decimals ◦ nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat,

288358

4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations◦ I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Note: tokenizing steps for queries must be identical to steps for documents

16

Page 17: Information retrieval   concept, practice and challenge

Indexing (Stopping)Top 50 Words from AP89 News Collection

Recall,

Indexes should be useful term links to a document. Are the terms on the right figure useful?

17

Page 18: Information retrieval   concept, practice and challenge

Indexing (Stopping)Stopword list can be created from high-frequency words or based on a standard list

Lists are customized for applications, domains, and even parts of documents◦ e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about which words to use at query time

18

Page 19: Information retrieval   concept, practice and challenge

Indexing (Stemming)Many morphological variations of words◦ inflectional (plurals, tenses)◦ derivational (making verbs nouns etc.)

In most cases, these have the same or very similar meanings

Stemmers attempt to reduce morphological variations of words to a common stem◦ usually involves removing suffixes

Can be done at indexing time or as part of query processing (like stopwords)

19

Page 20: Information retrieval   concept, practice and challenge

Indexing (Stemming)Porter Stemmer ◦ Algorithmic stemmer used in

IR experiments since the 70s◦ Consists of a series of rules

designed to the longest possible suffix at each step

◦ Produces stems not words◦ Example Step 1 (right figure)

20

Page 21: Information retrieval   concept, practice and challenge

Indexing (Stemming)Comparison between two stemmers.

21

Page 22: Information retrieval   concept, practice and challenge

Indexing (Phrases)Recall, token, meaningful tokens are better indexes, e.g. phrases. Text processing issue – how are phrases recognized?Three possible approaches:◦ Identify syntactic phrases using a part-of-speech (POS) tagger◦Use word n-grams◦ Store word positions in indexes and use proximity operators in

queries

22

Page 23: Information retrieval   concept, practice and challenge

Indexing (Phrases)POS taggers use statistical models of text to predict syntactic tags of words◦ Example tags: ◦ NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past

participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”, “will”).

Phrases can then be defined as simple noun groups, for example

23

Page 24: Information retrieval   concept, practice and challenge

Indexing (Phrases)Pos Tagging Example

24

Page 25: Information retrieval   concept, practice and challenge

Indexing (Phrases)Example Noun Phrases

* Other method like N-Gram

25

Page 26: Information retrieval   concept, practice and challenge

Indexing (Inverted Index)Recall, indexes are designed to support search.Each index term is associated with an inverted list◦ Contains lists of documents, or lists of word occurrences in documents,

and other information.◦ Each entry is called a posting. ◦ The part of the posting that refers to a specific document or

location is called a pointer◦ Each document in the collection is given a unique number◦ Lists are usually document-ordered (sorted by document number)

26

Page 27: Information retrieval   concept, practice and challenge

Indexing (Inverted Index)Sample collection. 4 sentences from Wikipedia entry for Tropical Fish

27

Page 28: Information retrieval   concept, practice and challenge

Indexing (Inverted Index)Simple inverted index.

28

Page 29: Information retrieval   concept, practice and challenge

Indexing (Inverted Index)Inverted index with counts.

Support better ranking algorithms.

29

Page 30: Information retrieval   concept, practice and challenge

Indexing (Inverted Index)Inverted index with positions.

Support proximity matching.

30

Page 31: Information retrieval   concept, practice and challenge

Retrieval FunctionRankingDocuments are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm

31

Page 32: Information retrieval   concept, practice and challenge

Retrieval Function (Boolean Retrieval)Advantages◦ Results are predictable, relatively easy to explain◦ Many different features can be incorporated◦ Efficient processing since many documents can be eliminated from search

Disadvantages◦ Effectiveness depends entirely on user◦ Simple queries usually don’t work well◦ Complex queries are difficult

32

Page 33: Information retrieval   concept, practice and challenge

Retrieval Function (Boolean Retrieval)Sequence of queries driven by number of retrieved documents◦ e.g. “lincoln” search of news articles◦ president AND lincoln◦ president AND lincoln AND NOT (automobile OR car)◦ president AND lincoln AND biography AND life AND birthplace AND gettysburg

AND NOT (automobile OR car)◦ president AND lincoln AND (biography OR life OR birthplace OR gettysburg)

AND NOT (automobile OR car)

33

Page 34: Information retrieval   concept, practice and challenge

Retrieval Function (Vector Space Model)Ranked based method.

Documents and query represented by a vector of term weights

Collection represented by a matrix of term weights

34

Page 35: Information retrieval   concept, practice and challenge

Retrieval Function (Vector Space Model)

borneo daily new north straits times

D1 0 0 1 0 1 1

D2 0 1 1 0 1 0

D3 1 0 0 1 0 1

D1: new straits timesD2: new straits dailyD3 : north borneo times

Vector of terms

35

Page 36: Information retrieval   concept, practice and challenge

Retrieval Function (Vector Space Model)

borneo daily new north straits times

D1 0 0 0.176 0 0.176 0.176

D2 0 0.477 0.176 0 0.176 0

D3 0.477 0 0 0.477 0 0.176

idf (borneo) = log(3/1) =0.477idf (daily) = log(3/1) = 0.477idf (new) = log(3/2) =0.176idf (north) = log(3/1) = 0.477idf (straits) = log(3/2) = 0.176idf (times) = log(3/2) = 0.176then multiply by tf

tf.idf weightTerm frequency weight measures importance in document:

Inverse document frequency measures importance in collection:

36

Page 37: Information retrieval   concept, practice and challenge

Retrieval Function (Vector Space Model)Documents ranked by distance between points representing query and documents◦ Similarity measure more common than a distance or dissimilarity

measure◦ e.g. Cosine correlation

37

Page 38: Information retrieval   concept, practice and challenge

Retrieval Function (Vector Space Model)Consider two documents D1, D2 and a query QQ = “straits times”Compare against collection, (borneo, daily, new, north, straits, times)Q = (0, 0, 0, 0, 0.176, 0.176)D1 = (0, 0, 0.176, 0, 0.176, 0.176)D2 = (0, 0.477, 0.176, 0, 0.176, 0)

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷,𝑄𝑄 =0∗0 + 0∗0 + 0.176∗0 + 0∗0.176 + 0.176∗0.176 +(0.176∗0.176)

0.1762+0.1762+0.1762 (0.1762+0.1762)=0.816

Find Cosine (D2,Q).Which document is more relevant?

38

Page 39: Information retrieval   concept, practice and challenge

EvaluationA must to evaluate the retrieval function, preprocessing steps etc.

Standard Collection◦ Task specific◦ Human experts are used to judge relevant results.

Performance Metric ◦ Precision◦ Recall

39

Page 40: Information retrieval   concept, practice and challenge

Evaluation (Collection)Test collections consisting of documents, queries, and relevance judgments, e.g.,

40

Page 41: Information retrieval   concept, practice and challenge

Evaluation (Collection)

41

Page 42: Information retrieval   concept, practice and challenge

Evaluation (Collection)Obtaining relevance judgments is an expensive, time-consuming process◦ who does it?◦ what are the instructions?◦ what is the level of agreement?

42

Page 43: Information retrieval   concept, practice and challenge

Evaluation (Collection)Exhaustive judgments for all documents in a collection is not practical

Pooling technique is used in TREC◦ top k results (for TREC, k varied between 50 and 200) from the rankings

obtained by different search engines (or retrieval algorithms) are merged into a pool

◦ duplicates are removed◦ documents are presented in some random order to the relevance judges

Produces a large number of relevance judgments for each query, although still incomplete

43

Page 44: Information retrieval   concept, practice and challenge

Evaluation (Effectiveness Measures)A is set of relevant documents, B is set of retrieved documents

44

Page 45: Information retrieval   concept, practice and challenge

Evaluation (Ranking Effectiveness)

45

Page 46: Information retrieval   concept, practice and challenge

Practice SEARCH ENGINE

46

Page 47: Information retrieval   concept, practice and challenge

Search Engine The most relevant application of Information Retrieval.

Do you agree?

Search on the Web is a daily activity for many people throughout the world.

47

Page 48: Information retrieval   concept, practice and challenge

What about Database RecordsDatabase records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)◦ e.g., bank records with account numbers, balances, names, addresses, social

security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order to find matches

Text is more difficult

48

Page 49: Information retrieval   concept, practice and challenge

Search Query vs DB QueryExample bank database query◦ Find records with balance > $50,000 in branches located in Amherst, MA.◦ Matches easily found by comparison with field values of records

Example search engine query◦ bank scandals in western mass◦ This text must be compared to the text of entire news stories

49

Page 50: Information retrieval   concept, practice and challenge

Dimensions of IRIR is more than just text, and more than just web search◦ although these are central

People doing IR work with different media, different types of search applications, and different tasks

50

Page 51: Information retrieval   concept, practice and challenge

Other MediaNew applications increasingly involve new media◦ e.g., video, photos, music, speech

Like text, content is difficult to describe and compare◦ text may be used to represent them (e.g. tags)

IR approaches to search and evaluation are appropriate

51

Page 52: Information retrieval   concept, practice and challenge

Dimensions of IR

Content Applications TasksText Web search Ad hoc searchImages Vertical search FilteringVideo Enterprise search ClassificationScanned docs Desktop search Question answeringAudio Forum searchMusic P2P search

Literature search

52

Page 53: Information retrieval   concept, practice and challenge

IR and Search Engines

Relevance-Effective ranking

Evaluation-Testing and measuring

Information needs-User interaction

Performance-Efficient search and indexing

Incorporating new data-Coverage and freshness

Scalability-Growing with data and users

Adaptability-Tuning for applications

Specific problems-e.g. Spam

Information Retrieval Search Engines

53

Page 54: Information retrieval   concept, practice and challenge

Search Engine IssuesPerformance◦Measuring and improving the efficiency of search ◦ e.g., reducing response time, increasing query throughput, increasing indexing speed

◦ Indexes are data structures designed to improve search efficiency◦ designing and implementing them are major issues for search engines

54

Page 55: Information retrieval   concept, practice and challenge

Search Engine IssuesDynamic data◦ The “collection” for most real applications is constantly changing in

terms of updates, additions, deletions◦ e.g., web pages

◦Acquiring or “crawling” the documents is a major task◦ Typical measures are coverage (how much has been indexed) and freshness (how recently

was it indexed)

◦Updating the indexes while processing queries is also a design issue

55

Page 56: Information retrieval   concept, practice and challenge

Search Engine IssuesScalability◦Making everything work with millions of users every day, and many

terabytes of documents◦Distributed processing is essential

Adaptability◦ Changing and tuning search engine components such as ranking

algorithm, indexing strategy, interface for different applications

56

Page 57: Information retrieval   concept, practice and challenge

Challenge

57

Page 58: Information retrieval   concept, practice and challenge

Let’s Define the Challenge Together1. Cross Lingual IR2. Big Data3. Personalization4. Domain Specific IR5. Multi modal IR 6. ….

58

Page 59: Information retrieval   concept, practice and challenge

IR Research DirectionLatest Research at Google http://research.google.com/pubs/InformationRetrievalandtheWeb.html

59

Page 60: Information retrieval   concept, practice and challenge

Acknowledgement

Thank you for Watching …

This presentation is prepared with some adapted contents from the following sources.

a. Introduction to Information Retrieval, C. D. Manning, P. Raghavan and H. Schutze, 2009.

b. Introduction to Information Retrieval, IR Summer School 2001, Mounia, Lalmas.

c. Search Engines: Information Retrieval in Practice, D. Metzler, T. Strohman and W. B. Croft, 2009.

60