36
Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Dr. Xia Lin Associate Professor Associate Professor College of Information Science and College of Information Science and Technology Technology Drexel University Drexel University

Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Embed Size (px)

Citation preview

Page 1: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Information Retrieval SystemsInfo624 – Week 1

Dr. Xia LinDr. Xia LinAssociate ProfessorAssociate Professor

College of Information Science and TechnologyCollege of Information Science and Technology

Drexel UniversityDrexel University

Page 2: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Self-Introduction My Journey in AmericaMy Journey in America

Atlanta, GAAtlanta, GA Denton, TXDenton, TX College Park, MDCollege Park, MD San Jose, CASan Jose, CA White Plains, NYWhite Plains, NY Lexington, KYLexington, KY Philadelphia, PAPhiladelphia, PA

Page 3: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Have you ever asked: How could search engines find the information I How could search engines find the information I

request so quickly, out of millions and millions request so quickly, out of millions and millions of web pages?of web pages?

Which statement do you like the best?Which statement do you like the best? It is easy to find just about anything on the It is easy to find just about anything on the

Web.Web. It’s impossible to find anything on the Web; I It’s impossible to find anything on the Web; I

always find so many things that I don’t want.always find so many things that I don’t want. How about these:How about these:

I like search engines very much.I like search engines very much. I hate search engines!I hate search engines!

Page 4: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

More questions

What kinds of questions are easy What kinds of questions are easy (difficult) to find on the Web?(difficult) to find on the Web? Why?Why?

Are there any ways to make it easier?Are there any ways to make it easier? What solutions are we looking forWhat solutions are we looking for

Technical solution?Technical solution?Cognitive solution? Cognitive solution?

Page 5: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Google is the solution? Everyone likes Google.Everyone likes Google.

True or False?True or False? What would happen if Google disppears on What would happen if Google disppears on

the Web tomorrow morning?the Web tomorrow morning? Better Search Results than Google? Better Search Results than Google?

CNN report, Jan. 5, 2004CNN report, Jan. 5, 2004 vivisimovivisimo.com/.com/ Grokker2Grokker2 TouchGraphTouchGraph

Page 6: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

How to defeat Google?

Microsoft WayMicrosoft Way I will buy you,I will buy you, Or I will netscape you!Or I will netscape you!

Open Source WayOpen Source Way Watch Watch NutchNutch

Under the leadership of Doug Cutting Under the leadership of Doug Cutting The “Linux” of search enginesThe “Linux” of search engines

Page 7: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Course Overview

What this course is …aboutWhat this course is …about How people search and find How people search and find

information.information. How computers store and retrieve How computers store and retrieve

information.information. How computer systems are designed How computer systems are designed

to help people find information they to help people find information they need.need.

Page 8: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Course Overview The course will emphasize onThe course will emphasize on

UnderstandingUnderstanding of of TheoriesTheoriesToolsToolsAlgorithms, andAlgorithms, andEvaluationsEvaluations

for Information Retrieval Systems.for Information Retrieval Systems.

Page 9: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Course Overview What this course is What this course is NOTNOT ... ...

An algorithm design courseAn algorithm design courseWe might use several related algorithms, We might use several related algorithms,

not study them in detailsnot study them in detailsOur textbook could be used for such a Our textbook could be used for such a

coursecourse a system development coursea system development course

Except some assignments may require you Except some assignments may require you to compile some C procedures.to compile some C procedures.

We look at an IR system as a whole, not as We look at an IR system as a whole, not as individual components individual components

Page 10: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Required skills Know how to create html pagesKnow how to create html pages Have access to a Web serverHave access to a Web server

If you don’t, the best way is to apply an If you don’t, the best way is to apply an dunx1 account from Drexel.dunx1 account from Drexel.

Make sure you request Make sure you request • Web server access.Web server access.• Shell access.Shell access.

Have access to a C compilerHave access to a C compiler Having Dunx1 Shell access will do it. Having Dunx1 Shell access will do it.

Page 11: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Project Idea -1 Install and implement an IR system:Install and implement an IR system:

Index a sample document collection Index a sample document collection or a Web siteor a Web site

Test and evaluate all the Test and evaluate all the functionalities of the system.functionalities of the system.

Compare this IR system with others.Compare this IR system with others. Demonstrate the implementation in Demonstrate the implementation in

class. class.

Page 12: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Project idea -- 2 Conduct an evaluation experiment on Conduct an evaluation experiment on

one or two selected IR systemsone or two selected IR systems Identify the systemsIdentify the systems Install the systems, if necessaryInstall the systems, if necessary Design the experimental methodsDesign the experimental methods Test the experimental methodsTest the experimental methods Analyze the data and write the final Analyze the data and write the final

report.report.

Page 13: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Project idea -3 Customize an IR systemCustomize an IR system

Using an open source retrieval Using an open source retrieval softwaresoftwareApache Apache LuceneLucene

Implementing a crawler Implementing a crawler With some open source codesWith some open source codes

Designing a new retrieval interfaceDesigning a new retrieval interface

Page 14: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

What is IR?

IR is a branch of applied computer IR is a branch of applied computer science focusing on the representation, science focusing on the representation, storage, organization, access, and storage, organization, access, and distribution of information. distribution of information.

IR involves helping users find IR involves helping users find information that matches their information that matches their information needs. information needs.

System-centered View

User-centered

Page 15: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

IR Systems

IR systems contain three components:IR systems contain three components: SystemSystem PeoplePeople Documents (information items)Documents (information items)

User

System

Documents

Page 16: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Web brings IR to the Center of the Stage

IR has become a center of the focus in IR has become a center of the focus in the Web era. Its theories, techniques, the Web era. Its theories, techniques, and applications have reached many and applications have reached many fields where processing large amount of fields where processing large amount of information is essential.information is essential.

Page 17: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Challenges of IR

User InformationSearch/select

Info. Needs Queries Stored Information

Translating info.needs to queries

Matching queriesTo stored information

Query result evaluation:Does the information found match user’s

information needs?

Page 18: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Examples: Where can I find information needed for my Where can I find information needed for my

term project?term project? Challenges: Challenges:

How do you translate the question to a How do you translate the question to a query?query?

What info. needs to store in the system in What info. needs to store in the system in order to answer the question?order to answer the question?

Which system will match the request Which system will match the request best? best?

Page 19: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Examples: Which IST course is most useful?Which IST course is most useful?

Challenges:Challenges:Information may not exist anywhereInformation may not exist anywhereIt’s personal opinion. It’s personal opinion.

Where is bin Laden now?Where is bin Laden now? Challenges: Challenges:

Intelligence AnalysisIntelligence AnalysisNeed the first-hand information Need the first-hand information

Page 20: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Abstraction Principles First Abstraction PrincipleFirst Abstraction Principle

Abstract data from the “real world” Abstract data from the “real world” And make them available to the And make them available to the

system.system. Second Abstraction PrinciplesSecond Abstraction Principles

Abstract the user’s information needs Abstract the user’s information needs into a form the system understands.into a form the system understands.

Page 21: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Users The userThe user

anyone who need to find some informationanyone who need to find some information The user groupsThe user groups

group by their knowledge of the systemgroup by their knowledge of the systemnovice users vs. experienced usersnovice users vs. experienced usersend users vs. information specialistsend users vs. information specialists

group by their domain knowledgegroup by their domain knowledgeDomain experts vs. general publicDomain experts vs. general public

group by information needsgroup by information needsneed to locate a particular itemneed to locate a particular itemneed some informationneed some informationneed all information on a subjectneed all information on a subject

Page 22: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

User’s Information Needs People depend on information to carry out People depend on information to carry out

their daily activities.their daily activities. need to accomplish some goals.need to accomplish some goals. need to solve some problems.need to solve some problems.

People realize a lack of informationPeople realize a lack of information perceive a gap in their knowledge stateperceive a gap in their knowledge state

ASK -- Anomalous State of KnowledgeASK -- Anomalous State of Knowledge desire to fill the gapdesire to fill the gap

Reality Goals?

Page 23: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

User’s information needsReality Goals?Reality Goals?Reality Goals?Reality Goals?Reality Goals?

Info. Needs

Info. Systems

??

Page 24: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Queries

Reality Goals?Reality Goals?Reality Goals?Reality Goals?Reality Goals?

Info. Needs

Info. Systems

?? RequestProblems

Data

??

First AbstractionPrinciple

Second AbstractionPrinciple

Page 25: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Data and Information DataData

String of symbols associated with objects, String of symbols associated with objects, people, and eventspeople, and events

Values of an attributeValues of an attributeData need not have meaning to Data need not have meaning to

everyoneeveryoneData must be interpreted with Data must be interpreted with

associated attributes. associated attributes.

Page 26: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Data and Information InformationInformation

The meaning of the data interpreted by a person The meaning of the data interpreted by a person or a systemor a system

Data that changes the state of a person or system Data that changes the state of a person or system that perceives it.that perceives it.

Data that reduces uncertainty. Data that reduces uncertainty. if data contain no uncertainty, there are no if data contain no uncertainty, there are no

information with the data. information with the data. Examples: Examples: It snows in the winter.It snows in the winter.

It does not snow this It does not snow this winter.winter.

Page 27: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Information and Knowledge knowledgeknowledge

Structured information Structured information through structuring, information through structuring, information

becomes understandablebecomes understandable Processed Information Processed Information

through processing, information through processing, information becomes meaningful and usefulbecomes meaningful and useful

information shared and agreed upon information shared and agreed upon within a communitywithin a community

Data

information

knowledge

Page 28: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Text Strings of ASCII symbols or Unicode Strings of ASCII symbols or Unicode

structured by the authorstructured by the author indexed by information service providers indexed by information service providers

Representation of natural languages people use Representation of natural languages people use To convey meaningsTo convey meanings To communicate between readers and To communicate between readers and

authors. authors. Data or information?Data or information?

If it can be understood, it’s information.If it can be understood, it’s information.by Whom? A person or a system?by Whom? A person or a system?

Page 29: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Documents

Logical unit of textLogical unit of text articles, books, articles, books, links, web pages links, web pages

Other components that come with the Other components that come with the texttext figures, charts, graphicsfigures, charts, graphics multimedia multimedia

Page 30: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Textual Data Repository of human intellectualsRepository of human intellectuals

Rich and diverse resources for all answers. Rich and diverse resources for all answers. If it is written, it is there (in text)If it is written, it is there (in text)

Meaningful and understandable (to users). Meaningful and understandable (to users). Simple ASCII representationSimple ASCII representation Free of pre-formatted structuresFree of pre-formatted structures

continuous continuous separated into documentsseparated into documents

Easy to process by the computer Easy to process by the computer Machine Intensive (not labor intensive)Machine Intensive (not labor intensive)

Page 31: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Problems with Text MassiveMassive

Any IR system needs the capability of large scale Any IR system needs the capability of large scale data processing.data processing.

Use of indexes and various representations are Use of indexes and various representations are required.required.

InconsistentInconsistent It’s a human languageIt’s a human language

Syntactical and semantic variances Syntactical and semantic variances • Same information expressed in different ways. Same information expressed in different ways. • Different information expressed in similar ways. Different information expressed in similar ways.

IncompleteIncomplete It uses common knowledge. It uses common knowledge. It’s an open system.It’s an open system.

Page 32: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Retrieval RetrievalRetrieval

What do we retrieve?What do we retrieve?DataDataInformation Information KnowledgeKnowledge

We retrieve documents that contains text We retrieve documents that contains text which carries information.which carries information. Information can be anywhere Information can be anywhere in the text, in the links, in the process of in the text, in the links, in the process of

text.text.

Page 33: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Information Retrieval

Are they the same?Are they the same? Text retrievalText retrieval Document retrievalDocument retrieval Information retrievalInformation retrieval

Page 34: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Information Retrieval

Conceptually, information retrieval is used to Conceptually, information retrieval is used to cover all related problems in finding needed cover all related problems in finding needed information information

Historically, information retrieval is about Historically, information retrieval is about document retrieval, emphasizing document as document retrieval, emphasizing document as the basic unitthe basic unit

Technically, information retrieval refers to Technically, information retrieval refers to (text) string manipulation, indexing, matching, (text) string manipulation, indexing, matching, querying, etc.querying, etc.

Page 35: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Summary

The goal of IR systems is to help users find The goal of IR systems is to help users find information that satisfies their information information that satisfies their information needs.needs.

The main process of IR systems is to match data The main process of IR systems is to match data abstracted from the real world to queries abstracted from the real world to queries abstracted from user’s information needs.abstracted from user’s information needs.

Information retrieval is much more difficult than Information retrieval is much more difficult than data retrieval.data retrieval.

Page 36: Information Retrieval Systems Info624 – Week 1 Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Data Retrieval vs. Information Retrieval

Data retrievalData retrieval Information retrieval Information retrieval

ContentContent DataData Information Information

Data objectData object TableTable Document Document

MatchingMatching Exact matchExact match Partial match, best match Partial match, best match

Items wantedItems wanted MatchingMatching Relevant Relevant

Query languageQuery language SQL(artificial)SQL(artificial) Natural Natural

Query specificationQuery specification CompleteComplete Incomplete Incomplete

ModelModel DeterministicDeterministic Probabilistic Probabilistic

Highly structuredHighly structured less structure less structure