View
218
Download
4
Embed Size (px)
Citation preview
information retrieval 1-1
Chapter 1 Introduction
information retrieval 1-2
Motivation• Information retrieval
– To retrieve information which might be useful or relevant to the user
– Issue : Representation, Storage, Organization, Access• Information need (for reality) : 可用簡單 query 完成嗎 ?
– Find all the pages containing information on college tennis teams which
• (1) are maintained by an university in the USA and • (2) participate in the NCAA tennis tournament.
– To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.
information retrieval 1-3
寒假作業• 尋找“門聯”資料
information retrieval 1-4
一般化搜尋引擎架構圖Web browser
user
request
responseQuery server
Indexeddatabase
Web robot
Search engine
Internet Internet
information retrieval 1-5
Information versus Data Retrieval
• Data retrieval– Determine which documents of a collection
contain the keywords in the user query– Retrieve all objects which satisfy clearly defined
conditions in regular expression or relational algebra expression
– Data has a well defined structure and semantics– Solution to the user of a database system
• Information retrieval
information retrieval 1-6
Database Management
• A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)
• Exact match between the attributes used inquery formulations and those attached to the record.
SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’
information retrieval 1-7
Basic Concepts for IR • Content identifiers (keywords, index terms,
descriptors) characterize the stored texts.• degrees of coincidence between the sets of
identifiers attached to queries and documents
content analysisquery formulation
User task Logical viewof the documents
information retrieval 1-8
The User Task
• Convey the semantics of information need
• Retrieval and browsing
Retrieval
Browsing
Database
information retrieval 1-9
Logical View of Documents
• Full text representation
• A set of index terms– Elimination of stop-words– The use of stemming– The identification of noun groups– …
information retrieval 1-10
From full text to a set of index terms
document
structurerecognition
text+structure
accents,spacing,
etc.stopwords
noungroups
stemming
automaticor manualindexing
Structure(eg. E-mail )
text
full textindexterms
information retrieval 1-11
Indexing
• indexing: assign identifiers to text items.• assign: manual vs. automatic indexing• identifiers:
– objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …
– controlled vs. uncontrolled vocabulariesinstruction manuals( 操作手冊 ), terminological schedules( 專有名詞清單 ), …
– single-term vs. term phrase
information retrieval 1-12
The retrieval process
User Interface
Text Operations
QueryOperations
Searching
Ranking
Indexing
Index
DB ManagerModule
TextDatabase
Text
Text
logical viewlogical view
user need
userfeedback query
retrieved documents
rankeddocuments
information retrieval 1-13
Information Retrieval• generic information retrieval system
select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user
• functions– document search
the selection of documents from an existing collection of documents
– document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles
information retrieval 1-14
Detection Need• Definition
a set of criteria specified by the user which describes the kind of information desired.– queries in document search task– profiles in routing task
• forms– keywords– keywords with Boolean operators– free text– example documents– ...
information retrieval 1-15
search vs. routing
• The search process matches a single Detection Need against the stored corpus to return a subset of documents.
• Routing matches a single document against a group of Profiles to determine which users are interested in the document.
• Profiles stand long-term expressions of user needs.• Search queries are ad hoc in nature.• A generic detection architecture can be used for
both the search and routing.
information retrieval 1-16
Search• retrieval of desired documents from an existing corpus
• Retrospective search is frequently interactive.
• Methods
– indexing the corpus by keyword, stem and/or phrase
– apply statistical and/or learning techniques to better understand the content of the corpus
– analyze free text Detection Needs to compare with the indexed corpus or a single document
– ...
information retrieval 1-17
Document Detection: Search
information retrieval 1-18
Document Detection: Search(Continued)
• Document Corpus– the content of the corpus may be significant for
the performance in some applications
• Preprocessing of Document Corpus– stemming– a list of stop words– phrases, multi-term items– ...
information retrieval 1-19
Document Detection: Search(Continued)
• Building Index from Stems– key place for optimizing run-time performance
– cost to build the index for a large corpus
• Document Index– a list of terms, stems, phrases, etc.
– frequency of terms in the document and corpus
– frequency of the co-occurrence of terms within the corpus
– index may be as large as the original document corpus
information retrieval 1-20
Document Detection: Search(Continued)
• Detection Need– the user’s criteria for a relevant document
• Convert Detection Need to System Specific Query– first transformed into a detection query, and then a
retrieval query.
– detection query: specific to the retrieval engine, but independent of the corpus
– retrieval query: specific to the retrieval engine, and to the corpus
information retrieval 1-21
Document Detection: Search(Continued)
• Compare Query with Index
• Resultant Rank Ordered List of Documents– Return the top ‘N’ documents – Rank the list of relevant documents from the
most relevant to the query to the least relevant
information retrieval 1-22
Routing
information retrieval 1-23
Routing (Continued)
• Profile of Multiple Detection Needs– A Profile is a group of individual Detection
Needs that describes a user’s areas of interest.– All Profiles will be compared to each incoming
document (via the Profile index).– If a document matches a Profile the user is
notified about the existence of a relevant document.
information retrieval 1-24
Routing (Continued)
• Convert Detection Need to System Specific Query
• Building Index from Queries– similar to build the corpus index for searching– the quantify of source data (Profiles) is usually
much less than a document corpus– Profiles may have more specific, structured
data in the form of SGML tagged fields
information retrieval 1-25
Routing (Continued)
• Routing Profile Index– The index will be system specific and will make use of
all the preprocessing techniques employed by a particular detection system.
• Document to be routed– A stream of incoming documents is handled one at a
time to determine where each should be directed.
– Routing implementation may handle multiple document streams and multiple Profiles.
information retrieval 1-26
Routing (Continued)
• Preprocessing of Document– A document is preprocessed in the same manner that a
query would be set-up in a search
– The document and query roles are reversed compared with the search process
• Compare Document with Index– Identify which Profiles are relevant to the document
– Given a document, which of the indexed profiles match it?
information retrieval 1-27
Routing (Continued)
• Resultant List of Profiles– The list of Profiles identify which user should
receive the document
information retrieval 1-28
Summary
• Generate a representation of the meaning or content of each object based on its description.
• Generate a representation of the meaning of the information need.
• Compare these two representations to select those objects that are most likely to match the information need.
information retrieval 1-29
Documents* Queries
DocumentRepresentation
QueryRepresentation
Comparison
Basic Architecture of an Information Retrieval System
* 此模型可以延伸到其他媒體所呈現的資訊
information retrieval 1-30
Text retrieval system
User query
Query formulation
Representation of a query
Document collections
indexing
Representation of documents
Matching of similarity
Results of retrieval
user
Relevant results
information retrieval 1-31
Research Issues
• Given a set of description for objects in the collection and a description of an information need, we must consider
• Issue 1– What makes a good document representation?– What are retrievable units and how are they
organized?– How can a representation be generated from a
description of the document?
information retrieval 1-32
Research Issues (Continued)
• Issue 2How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user?
• Issue 3How can we compare representations to judge likelihood that a document matches an information need?
information retrieval 1-33
Research Issues (Continued)
• Issue 4How can we evaluate the effectiveness of the retrieval process?
information retrieval 1-34
Text Data Mining Tasks
• Information extraction -- facts, fill database
• Summarization( 自動摘要 )
• Categorization ( 分類 )
• Clustering ( 分群 )
• Associations ( 關聯性 )
• Temporal analysis of document collection
information retrieval 1-35
Information Extraction:Beyond Document Retrieval
• Question and Answering– Q: Who is the author of the book, "The Iron
Lady: A Biography of Margaret Thatcher"?A: Hugo Young
– Q: What was the monetary value of the Nobel Peace Prize in 1989?A: $469,000
information retrieval 1-36
Information Extraction
• Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
information retrieval 1-37
Information Extraction (Continued)
• What are the transducers or modules?
• What are their input and output?
• What structure is added?
• What information is lost?
• What is the form of the rules?
• How are the rules applied?
• How are the rules acquired?
information retrieval 1-38
Example: Parser• transducer: parser• input: the sequence of words or lexical items• output: a parse tree• information added: predicate-argument and
modification relations• information lost: no• rule form: unification grammars• application method: chart parser• acquisition method: manually
information retrieval 1-39
Modules
• Text Zonerturn a text into a set of text segments
• Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes
• Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones
• Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures
information retrieval 1-40
Modules (Continued)
• Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete
• Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence
• Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments
information retrieval 1-41
Modules (Continued)
• Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates
• Co-reference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text
• Template Generatorderive the templates from the semantic structures
information retrieval 1-42
<DOC><DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> 中國時報 </DOCSRC><TEXT>【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度出售「刺針」肩射防空飛彈 給南韓的第二天,美國與北韓今天在紐約恢復延擱已久的會談,這項預定三天的會談將以北韓的飛彈發展為重點,包括北韓準備部署射程 可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈 的報導。
美國國務院發言人柏恩斯說:「在有關北韓 飛彈擴散問題上,美方的確有多項關切之處。」美國官員也長期懷疑北韓正對伊朗和敘利亞輸出飛彈,並希望平壤加入禁止擴散此種武器的
information retrieval 1-43
國際公約。美國官員已知會北韓說,倘若北韓希望與美國建立正常的外交關係,就必須減少飛彈輸出。
這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停止生產、測試及出售飛彈給他國,尤其是敘利亞和伊朗兩國。
美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲分別為雙方的談判代表,會談預定在十三日結束。
柏恩斯說:「美方非常關心所有北韓本身,或是北韓與中共、伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚為重要。」
而為提昇南韓陸軍的自衛能力,美國於昨天宣布準備出售價值三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓,它說,這項交易不會使朝鮮半島的緊張局勢惡化。
information retrieval 1-44
五角大廈說:「這項設備與支援的銷售不會影響該區基本軍事均勢。」
國務院也表示全力支持此項包含兩百一十三座發射台、支援設備、零件與訓練的交易。
柏恩斯說:「這項交易獲得政府內每一個人的全力支持,它符合我們在朝鮮半島的政策。」他強調:「我們的第一優先是防衛南韓。」
如果國會同意,這將是華府對南韓出售防空 飛彈的第一筆交易。
</TEXT></DOC>
information retrieval 1-45
<ID="3"> 十一日 <ID="4" REF="3" > 今天 <ID="5“ REF="3"> 出售「刺針」肩射防空飛彈 給南韓的第二天
<ID="63" > 延擱已久的會談 <ID=“66” REF=“63”> 一九九六年四月在德國柏林舉行的首度會談 的後續談判 <ID="65" REF="63"> 這項有關北韓飛彈計劃的會談 <ID="70" REF="65"> 會談 <ID="69" REF="65"> 會談 <ID="64" REF="63"> 這項預定三天的會談
information retrieval 1-46
The Advanced Research and Development Activity (ARDA)
• a joint activity of the Intelligence Community (IC) and the Department of Defense (DOD) in late November 1998
• intelligence community's (IC) center for conducting advanced research and development related extracting intelligence from and providing security for information stored, transmitted, or manipulated by electronic means
情報
information retrieval 1-47
information retrieval 1-48
ARDA R&D Programs
• Information Exploitation– Pulling Information – Pushing Information– Visualizing and Navigating Information
• Quantum Information Science & Photonics
• Digital Network Intelligence
information retrieval 1-49
Pulling Information
• Providing answers to complex, multifaceted questions that analysts pose
• The analyst seeks to "pull" the answer out of multiple, very large, heterogeneous data sources that may physically reside in diverse locations
information retrieval 1-50
Pulling Information (Continued)
• Accepting complex questions in a form natural to the analyst.• Questions may include judgment terms and an acceptable answer
may need to be based upon conclusions and decisions reached by the system and may require the summarization, fusion, and synthesis of information drawn from multiple sources.
• Translating analytic questions into multiple queries appropriate to the various data sets to be searched.
• Finding relevant information in distributed, multimedia, multilingual, multi-agency data sets.
• Analyzing, fusing and summarizing information into a coherent answer.
• Providing the answer to the analyst in the form that he/she want
information retrieval 1-51
Pushing Information
• Providing information from multiple, very large, heterogeneous data sources that analysts do not ask
• The system discovers information in some profiling, clustering, pattern recognition, data mining, or other fashion and "pushes" this information to analysts that the system determines might have an interest.
information retrieval 1-52
Pushing Information (Continued)
• Profiling and blind clustering of new data.
• Detecting anomalies, patterns and changes in large volumes of data.
• Analyzing the nature and description of the anomalies, patterns, and changes.
• Alerting the appropriate analyst(s) of the newly discovered information.
information retrieval 1-53
Topics
• Introduction to Information Retrieval and Extraction
• Modeling• Retrieval Evaluation• Query Languages• Query Operations• Text and Multimedia Languages and Properties• Text Operations• Indexing and Searching
information retrieval 1-54
Topics (Continued)
• User Interfaces and Visualization
• Multimedia IR: Models and Languages
• Multimedia IR: Indexing and Searching
• Searching the Web
• Digital Libraries
• Information Extraction (Jerry R. Hobbs)• Text Data Mining (Marti Hearst)
information retrieval 1-55
TextIR
Retrieval Modelsand Evaluation
ImprovementsOn Retrieval
EfficientProcessing
Interfaces &Visualization
MultimediaModeling
& Searching
Human-ComputerInteraction for IR
MultimediaIR
Applicationsfor IR
Bibliographic
Systems
TheWeb
DigitalLibraries
information retrieval 1-56
Information Sources• Books
– Ricardo Baeza-Yates and Berthier Riberiro-Neto (1999) Modern Information Retrieval, Addison-Wesley.台灣進口商為 “華通書坊” 電話 : (03)5720317
– Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.
– Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.
– Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996.
– Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.
information retrieval 1-57
Information Sources
• Conference Proceedings– ACM SIGIR Annual International Conference
on Research and Development in Information Retrieval (1978-)
– ACM International Conference on Digital Libraries
– ACM Conference on Information Knowledge Management
– Text Retrieval Conference
information retrieval 1-58
Information Sources(Continued)
• Journals– ACM Transactions on Information Systems– Information Processing and Management (formerly
Information Storage and Retrieval)– Journal of the American Society for Information
Science (formerly American Documentation)– Journal of Documentation– Information Systems– Information Retrieval– Knowledge and Information Systems