Upload
gordon-nicholson
View
216
Download
0
Embed Size (px)
Citation preview
Information Retrieval SystemsInfo624 – Week 1
Dr. Xia LinDr. Xia LinAssociate ProfessorAssociate Professor
College of Information Science and TechnologyCollege of Information Science and Technology
Drexel UniversityDrexel University
Self-Introduction My Journey in AmericaMy Journey in America
Atlanta, GAAtlanta, GA Denton, TXDenton, TX College Park, MDCollege Park, MD San Jose, CASan Jose, CA White Plains, NYWhite Plains, NY Lexington, KYLexington, KY Philadelphia, PAPhiladelphia, PA
Have you ever asked: How could search engines find the information I How could search engines find the information I
request so quickly, out of millions and millions request so quickly, out of millions and millions of web pages?of web pages?
Which statement do you like the best?Which statement do you like the best? It is easy to find just about anything on the It is easy to find just about anything on the
Web.Web. It’s impossible to find anything on the Web; I It’s impossible to find anything on the Web; I
always find so many things that I don’t want.always find so many things that I don’t want. How about these:How about these:
I like search engines very much.I like search engines very much. I hate search engines!I hate search engines!
More questions
What kinds of questions are easy What kinds of questions are easy (difficult) to find on the Web?(difficult) to find on the Web? Why?Why?
Are there any ways to make it easier?Are there any ways to make it easier? What solutions are we looking forWhat solutions are we looking for
Technical solution?Technical solution?Cognitive solution? Cognitive solution?
Google is the solution? Everyone likes Google.Everyone likes Google.
True or False?True or False? What would happen if Google disppears on What would happen if Google disppears on
the Web tomorrow morning?the Web tomorrow morning? Better Search Results than Google? Better Search Results than Google?
CNN report, Jan. 5, 2004CNN report, Jan. 5, 2004 vivisimovivisimo.com/.com/ Grokker2Grokker2 TouchGraphTouchGraph
How to defeat Google?
Microsoft WayMicrosoft Way I will buy you,I will buy you, Or I will netscape you!Or I will netscape you!
Open Source WayOpen Source Way Watch Watch NutchNutch
Under the leadership of Doug Cutting Under the leadership of Doug Cutting The “Linux” of search enginesThe “Linux” of search engines
Course Overview
What this course is …aboutWhat this course is …about How people search and find How people search and find
information.information. How computers store and retrieve How computers store and retrieve
information.information. How computer systems are designed How computer systems are designed
to help people find information they to help people find information they need.need.
Course Overview The course will emphasize onThe course will emphasize on
UnderstandingUnderstanding of of TheoriesTheoriesToolsToolsAlgorithms, andAlgorithms, andEvaluationsEvaluations
for Information Retrieval Systems.for Information Retrieval Systems.
Course Overview What this course is What this course is NOTNOT ... ...
An algorithm design courseAn algorithm design courseWe might use several related algorithms, We might use several related algorithms,
not study them in detailsnot study them in detailsOur textbook could be used for such a Our textbook could be used for such a
coursecourse a system development coursea system development course
Except some assignments may require you Except some assignments may require you to compile some C procedures.to compile some C procedures.
We look at an IR system as a whole, not as We look at an IR system as a whole, not as individual components individual components
Required skills Know how to create html pagesKnow how to create html pages Have access to a Web serverHave access to a Web server
If you don’t, the best way is to apply an If you don’t, the best way is to apply an dunx1 account from Drexel.dunx1 account from Drexel.
Make sure you request Make sure you request • Web server access.Web server access.• Shell access.Shell access.
Have access to a C compilerHave access to a C compiler Having Dunx1 Shell access will do it. Having Dunx1 Shell access will do it.
Project Idea -1 Install and implement an IR system:Install and implement an IR system:
Index a sample document collection Index a sample document collection or a Web siteor a Web site
Test and evaluate all the Test and evaluate all the functionalities of the system.functionalities of the system.
Compare this IR system with others.Compare this IR system with others. Demonstrate the implementation in Demonstrate the implementation in
class. class.
Project idea -- 2 Conduct an evaluation experiment on Conduct an evaluation experiment on
one or two selected IR systemsone or two selected IR systems Identify the systemsIdentify the systems Install the systems, if necessaryInstall the systems, if necessary Design the experimental methodsDesign the experimental methods Test the experimental methodsTest the experimental methods Analyze the data and write the final Analyze the data and write the final
report.report.
Project idea -3 Customize an IR systemCustomize an IR system
Using an open source retrieval Using an open source retrieval softwaresoftwareApache Apache LuceneLucene
Implementing a crawler Implementing a crawler With some open source codesWith some open source codes
Designing a new retrieval interfaceDesigning a new retrieval interface
What is IR?
IR is a branch of applied computer IR is a branch of applied computer science focusing on the representation, science focusing on the representation, storage, organization, access, and storage, organization, access, and distribution of information. distribution of information.
IR involves helping users find IR involves helping users find information that matches their information that matches their information needs. information needs.
System-centered View
User-centered
IR Systems
IR systems contain three components:IR systems contain three components: SystemSystem PeoplePeople Documents (information items)Documents (information items)
User
System
Documents
Web brings IR to the Center of the Stage
IR has become a center of the focus in IR has become a center of the focus in the Web era. Its theories, techniques, the Web era. Its theories, techniques, and applications have reached many and applications have reached many fields where processing large amount of fields where processing large amount of information is essential.information is essential.
Challenges of IR
User InformationSearch/select
Info. Needs Queries Stored Information
Translating info.needs to queries
Matching queriesTo stored information
Query result evaluation:Does the information found match user’s
information needs?
Examples: Where can I find information needed for my Where can I find information needed for my
term project?term project? Challenges: Challenges:
How do you translate the question to a How do you translate the question to a query?query?
What info. needs to store in the system in What info. needs to store in the system in order to answer the question?order to answer the question?
Which system will match the request Which system will match the request best? best?
Examples: Which IST course is most useful?Which IST course is most useful?
Challenges:Challenges:Information may not exist anywhereInformation may not exist anywhereIt’s personal opinion. It’s personal opinion.
Where is bin Laden now?Where is bin Laden now? Challenges: Challenges:
Intelligence AnalysisIntelligence AnalysisNeed the first-hand information Need the first-hand information
Abstraction Principles First Abstraction PrincipleFirst Abstraction Principle
Abstract data from the “real world” Abstract data from the “real world” And make them available to the And make them available to the
system.system. Second Abstraction PrinciplesSecond Abstraction Principles
Abstract the user’s information needs Abstract the user’s information needs into a form the system understands.into a form the system understands.
Users The userThe user
anyone who need to find some informationanyone who need to find some information The user groupsThe user groups
group by their knowledge of the systemgroup by their knowledge of the systemnovice users vs. experienced usersnovice users vs. experienced usersend users vs. information specialistsend users vs. information specialists
group by their domain knowledgegroup by their domain knowledgeDomain experts vs. general publicDomain experts vs. general public
group by information needsgroup by information needsneed to locate a particular itemneed to locate a particular itemneed some informationneed some informationneed all information on a subjectneed all information on a subject
User’s Information Needs People depend on information to carry out People depend on information to carry out
their daily activities.their daily activities. need to accomplish some goals.need to accomplish some goals. need to solve some problems.need to solve some problems.
People realize a lack of informationPeople realize a lack of information perceive a gap in their knowledge stateperceive a gap in their knowledge state
ASK -- Anomalous State of KnowledgeASK -- Anomalous State of Knowledge desire to fill the gapdesire to fill the gap
Reality Goals?
User’s information needsReality Goals?Reality Goals?Reality Goals?Reality Goals?Reality Goals?
Info. Needs
Info. Systems
??
Queries
Reality Goals?Reality Goals?Reality Goals?Reality Goals?Reality Goals?
Info. Needs
Info. Systems
?? RequestProblems
Data
??
First AbstractionPrinciple
Second AbstractionPrinciple
Data and Information DataData
String of symbols associated with objects, String of symbols associated with objects, people, and eventspeople, and events
Values of an attributeValues of an attributeData need not have meaning to Data need not have meaning to
everyoneeveryoneData must be interpreted with Data must be interpreted with
associated attributes. associated attributes.
Data and Information InformationInformation
The meaning of the data interpreted by a person The meaning of the data interpreted by a person or a systemor a system
Data that changes the state of a person or system Data that changes the state of a person or system that perceives it.that perceives it.
Data that reduces uncertainty. Data that reduces uncertainty. if data contain no uncertainty, there are no if data contain no uncertainty, there are no
information with the data. information with the data. Examples: Examples: It snows in the winter.It snows in the winter.
It does not snow this It does not snow this winter.winter.
Information and Knowledge knowledgeknowledge
Structured information Structured information through structuring, information through structuring, information
becomes understandablebecomes understandable Processed Information Processed Information
through processing, information through processing, information becomes meaningful and usefulbecomes meaningful and useful
information shared and agreed upon information shared and agreed upon within a communitywithin a community
Data
information
knowledge
Text Strings of ASCII symbols or Unicode Strings of ASCII symbols or Unicode
structured by the authorstructured by the author indexed by information service providers indexed by information service providers
Representation of natural languages people use Representation of natural languages people use To convey meaningsTo convey meanings To communicate between readers and To communicate between readers and
authors. authors. Data or information?Data or information?
If it can be understood, it’s information.If it can be understood, it’s information.by Whom? A person or a system?by Whom? A person or a system?
Documents
Logical unit of textLogical unit of text articles, books, articles, books, links, web pages links, web pages
Other components that come with the Other components that come with the texttext figures, charts, graphicsfigures, charts, graphics multimedia multimedia
Textual Data Repository of human intellectualsRepository of human intellectuals
Rich and diverse resources for all answers. Rich and diverse resources for all answers. If it is written, it is there (in text)If it is written, it is there (in text)
Meaningful and understandable (to users). Meaningful and understandable (to users). Simple ASCII representationSimple ASCII representation Free of pre-formatted structuresFree of pre-formatted structures
continuous continuous separated into documentsseparated into documents
Easy to process by the computer Easy to process by the computer Machine Intensive (not labor intensive)Machine Intensive (not labor intensive)
Problems with Text MassiveMassive
Any IR system needs the capability of large scale Any IR system needs the capability of large scale data processing.data processing.
Use of indexes and various representations are Use of indexes and various representations are required.required.
InconsistentInconsistent It’s a human languageIt’s a human language
Syntactical and semantic variances Syntactical and semantic variances • Same information expressed in different ways. Same information expressed in different ways. • Different information expressed in similar ways. Different information expressed in similar ways.
IncompleteIncomplete It uses common knowledge. It uses common knowledge. It’s an open system.It’s an open system.
Retrieval RetrievalRetrieval
What do we retrieve?What do we retrieve?DataDataInformation Information KnowledgeKnowledge
We retrieve documents that contains text We retrieve documents that contains text which carries information.which carries information. Information can be anywhere Information can be anywhere in the text, in the links, in the process of in the text, in the links, in the process of
text.text.
Information Retrieval
Are they the same?Are they the same? Text retrievalText retrieval Document retrievalDocument retrieval Information retrievalInformation retrieval
Information Retrieval
Conceptually, information retrieval is used to Conceptually, information retrieval is used to cover all related problems in finding needed cover all related problems in finding needed information information
Historically, information retrieval is about Historically, information retrieval is about document retrieval, emphasizing document as document retrieval, emphasizing document as the basic unitthe basic unit
Technically, information retrieval refers to Technically, information retrieval refers to (text) string manipulation, indexing, matching, (text) string manipulation, indexing, matching, querying, etc.querying, etc.
Summary
The goal of IR systems is to help users find The goal of IR systems is to help users find information that satisfies their information information that satisfies their information needs.needs.
The main process of IR systems is to match data The main process of IR systems is to match data abstracted from the real world to queries abstracted from the real world to queries abstracted from user’s information needs.abstracted from user’s information needs.
Information retrieval is much more difficult than Information retrieval is much more difficult than data retrieval.data retrieval.
Data Retrieval vs. Information Retrieval
Data retrievalData retrieval Information retrieval Information retrieval
ContentContent DataData Information Information
Data objectData object TableTable Document Document
MatchingMatching Exact matchExact match Partial match, best match Partial match, best match
Items wantedItems wanted MatchingMatching Relevant Relevant
Query languageQuery language SQL(artificial)SQL(artificial) Natural Natural
Query specificationQuery specification CompleteComplete Incomplete Incomplete
ModelModel DeterministicDeterministic Probabilistic Probabilistic
Highly structuredHighly structured less structure less structure