40
1 Intelligent Information Retrieval (and Web Search) Professor Celso A A Kaestner, PhD. Brazil

Intelligent Information Retrieval (and Web Search)

Embed Size (px)

DESCRIPTION

Intelligent Information Retrieval (and Web Search). Professor Celso A A Kaestner, PhD. Brazil. Site: www.dainf.ct.utfpr.edu.br/~kaestner/Konstanz/iir.htm. Introduction. Introduction: Information Retrieval. IR: representation, storage, organization of, and access to information items; - PowerPoint PPT Presentation

Citation preview

Page 1: Intelligent Information Retrieval (and Web Search)

1

Intelligent Information Retrieval(and Web Search)

Professor Celso A A Kaestner, PhD.

Brazil

Page 2: Intelligent Information Retrieval (and Web Search)

2

Site:www.dainf.ct.utfpr.edu.br/~kaestner/Konstanz/iir.htm

Page 3: Intelligent Information Retrieval (and Web Search)

3

Introduction

Page 4: Intelligent Information Retrieval (and Web Search)

4

Introduction: Information Retrieval

• IR: representation, storage, organization of, and access to information items;

• Focus is on the user information need;• User information need:

– Find all docs containing information on college football teams which: (1) are maintained by an university and (2) participate in the national tournament.

• Emphasis is on the retrieval of information (not data).

Page 5: Intelligent Information Retrieval (and Web Search)

5

Data retrieval x Information retrieval

• Data Retrieval:– which docs. contain a set of keywords?– well defined semantics;– a single erroneous object implies failure!

• Information Retrieval (IR):– information about a subject or topic;– semantics is frequently loose;– small errors are tolerated.

• IR system:– interpret contents of information items;– generate a ranking which reflects relevance;– notion of relevance is most important.

Page 6: Intelligent Information Retrieval (and Web Search)

6

Information Retrieval (IR)

• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web is the most recent “killer app.”

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents efficiently.

Page 7: Intelligent Information Retrieval (and Web Search)

7

Typical IR Task

• Given:– A corpus of textual natural-language

documents.– A user query in the form of a textual

string.

• Find:– A ranked set of documents that are

relevant to the query.

Page 8: Intelligent Information Retrieval (and Web Search)

8

IR System

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 9: Intelligent Information Retrieval (and Web Search)

9

Relevance

• Relevance is a subjective judgment and may include:– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted

source).– Satisfying the goals of the user and

his/her intended use of the information (information need).

Page 10: Intelligent Information Retrieval (and Web Search)

10

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim in the document.

• Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

Page 11: Intelligent Information Retrieval (and Web Search)

11

Problems with Keywords

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

Page 12: Intelligent Information Retrieval (and Web Search)

12

Beyond Keywords

• We will cover the basics of keyword-based IR, but…

• We will focus on extensions and recent developments that go beyond keywords.

• We will cover the basics of building an efficient IR system, but…

• We will focus on basic capabilities and algorithms rather than system’s issues that allow scaling to industrial size databases.

Page 13: Intelligent Information Retrieval (and Web Search)

13

Intelligent IR

• Taking into account the meaning of the words used.

• Taking into account the order of words in the query.

• Adapting to the user based on direct or indirect feedback.

• Taking into account the authority of the source.

Page 14: Intelligent Information Retrieval (and Web Search)

14

IR System Architecture

TextDatabase

DatabaseManager

Indexing

Index

QueryOperations

Searching

RankingRanked

Docs

UserFeedback

Text Operations

User Interface

RetrievedDocs

UserNeed

Text

Query

Logical View

Inverted file

Page 15: Intelligent Information Retrieval (and Web Search)

15

IR System Components

• Text Operations forms index words (tokens).– Standardization (caps …)– Stopword removal– Stemming

• Indexing constructs an inverted index of word to document pointers.

• Searching retrieves documents that contain a given query token from the inverted index.

• Ranking scores all retrieved documents according to a relevance metric.

Page 16: Intelligent Information Retrieval (and Web Search)

16

IR System Components (continued)

• User Interface manages interaction with the user:– Query input and document output.– Relevance feedback.– Visualization of results.

• Query Operations transform the query to improve retrieval:– Query expansion using a thesaurus.

– Query transformation using relevance feedback.

Page 17: Intelligent Information Retrieval (and Web Search)

17

• IR at the center of the stage:– Advent of the Web changed this

perception once and for all:• universal repository of knowledge; • free (low cost) universal access;• no central editorial board;• many problems though: IR seen as key to

finding the solutions!

IR and the Web

Page 18: Intelligent Information Retrieval (and Web Search)

18

IR and the Web

• And more: • Most of the human task employ the

treatment of information in textual and/ or graphic form (Lyman, 2003);

• How Much Information project (Berkeley):

www.sims.berkeley.edu/how-much-info-2003.

• Each person generates 800 Mbytes / year.

Page 19: Intelligent Information Retrieval (and Web Search)

19

In 2002: 5 Exabytes of new information;• Magnetic media (HD’s): 92%; • Films: 7%;• Print material: 0,01%;• Optical media: 0,002%.

5 Exabytes = 5 million Terabytes = 5.000.000.000.000.000.000 bytes;

2 times the amount of 1999, given an increasing rate of 30% / year.

IR and the Web

Page 20: Intelligent Information Retrieval (and Web Search)

20

Information flow - radio, TV, Internet:• 18 Exabytes of new information in 2002;• 3,5 times of the amount stored;• Telephone lines (and cell phones): 98%;• 320 million hours of radio and TV

transmissions, with 70 million new hours, with 81 Gigabytes of texts.

IR and the Web

Page 21: Intelligent Information Retrieval (and Web Search)

21

Email:• 31 billion of e-mails / year = 400.000 Tbytes

of new information;

The Internet (Web):• 170 Tbytes of information = 17 times the

printed content of the US Library of Congress.

IR and the Web

Page 22: Intelligent Information Retrieval (and Web Search)

22

Search sites:• “Yahoo”, “Google”, etc. = the 1st option of

access for the users;• A typical Internet user: 11 h 20 m / month;• Access to the desired information = 1 / 3 of

the period;• The user is obliged to verify if the received

information is the desired one, and several times is impossible to recover the information needed.

IR and the Web

Page 23: Intelligent Information Retrieval (and Web Search)

23

• Information Glut or Information Overload: is the main challenge to be surpassed by automatic text treatment systems.

IR and the Web

Page 24: Intelligent Information Retrieval (and Web Search)

24

Web Search

• Application of IR to HTML documents on the World Wide Web.

• Differences:– Must assemble document corpus by

spidering the web.– Can exploit the structural layout

information in HTML (XML).– Documents change uncontrollably.– Can exploit the link structure of the web.

Page 25: Intelligent Information Retrieval (and Web Search)

25

Web Search System

Query String

IRSystem

RankedDocuments

1. Page12. Page23. Page3 . .

Documentcorpus

Web Spider

Page 26: Intelligent Information Retrieval (and Web Search)

26

Other IR-Related Tasks

• Automated document categorization• Information filtering (spam filtering)• Information routing• Automated document clustering• Recommending information or products• Information extraction• Information integration• Question answering

Page 27: Intelligent Information Retrieval (and Web Search)

27

History of IR

• 1960-70’s:– Initial exploration of text retrieval systems

for “small” corpora of scientific abstracts, and law and business documents.

– Development of the basic Boolean and vector-space models of retrieval.

– Prof. Salton and his students at Cornell University are the leading researchers in the area.

Page 28: Intelligent Information Retrieval (and Web Search)

28

IR History Continued

• 1980’s:– Large document database systems, many

run by companies:• Lexis-Nexis• Dialog• MEDLINE

Page 29: Intelligent Information Retrieval (and Web Search)

29

IR History Continued

• 1990’s:– Searching FTPable documents on the

Internet• Archie• WAIS

– Searching the World Wide Web• Lycos• Yahoo• Altavista

Page 30: Intelligent Information Retrieval (and Web Search)

30

IR History Continued

• 1990’s continued:– Organized Competitions

• NIST TREC

– Recommender Systems• Ringo• Amazon• NetPerceptions

– Automated Text Categorization & Clustering

Page 31: Intelligent Information Retrieval (and Web Search)

31

Recent IR History

• 2000’s– Link analysis for Web Search

• Google

– Automated Information Extraction• Whizbang• Fetch• Burning Glass

– Question Answering• TREC Q/A track

Page 32: Intelligent Information Retrieval (and Web Search)

32

Recent IR History

• 2000’s continued:– Multimedia IR

• Image• Video• Audio and music

– Cross-Language IR• DARPA Tides

– Document Summarization

Page 33: Intelligent Information Retrieval (and Web Search)

33

Related Areas

• Database Management

• Library and Information Science

• Artificial Intelligence

• Natural Language Processing

• Machine Learning

Page 34: Intelligent Information Retrieval (and Web Search)

34

Database Management

• Focused on structured data stored in relational tables rather than free-form text.

• Focused on efficient processing of well-defined queries in a formal language (SQL).

• Clearer semantics for both data and queries.• Recent move towards semi-structured data

(XML) brings it closer to IR.

Page 35: Intelligent Information Retrieval (and Web Search)

35

Library and Information Science

• Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization).

• Concerned with effective categorization of human knowledge.

• Concerned with citation analysis and bibliometrics (structure of information).

• Recent work on digital libraries brings it closer to CS & IR.

Page 36: Intelligent Information Retrieval (and Web Search)

36

Artificial Intelligence

• Focused on the representation of knowledge, reasoning, and intelligent action.

• Formalisms for representing knowledge and queries:– First-order Predicate Logic– Bayesian Networks– Others …

• Recent work on web ontologies and intelligent information agents brings it closer to IR.

Page 37: Intelligent Information Retrieval (and Web Search)

37

Natural Language Processing

• Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse.

• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

Page 38: Intelligent Information Retrieval (and Web Search)

38

Natural Language Processing:IR Directions

• Methods for determining the sense of an ambiguous word based on context (word sense disambiguation).

• Methods for identifying specific pieces of information in a document (information extraction).

• Methods for answering specific NL questions from document corpora.

Page 39: Intelligent Information Retrieval (and Web Search)

39

Machine Learning

• Focused on the development of computational systems that improve their performance with experience.

• Automated classification of examples based on learning concepts from labeled training examples (supervised learning).

• Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

Page 40: Intelligent Information Retrieval (and Web Search)

40

Machine Learning:IR Directions

• Text Categorization– Automatic hierarchical classification (Yahoo).– Adaptive filtering/routing/recommending.– Automated spam filtering.

• Text Clustering– Clustering of IR query results.– Automatic formation of hierarchies (Yahoo).

• Learning for Information Extraction• Text Mining• Text Summarization