View
195
Download
7
Category
Tags:
Preview:
Citation preview
Multimedia Database Management System - Chapter 4
Text Document!Indexing and Retrieval
Rachmat Wahid Saleh Insani, S.Kom
Multimedia Database Management System - Chapter 4
Objectives• Main differences between IR systems and DBMSs.
• General automatic indexing process and Boolean retrieval model.
• Vector space, probabilistic, and cluster-based retrieval models, respectively.
• Nontraditional IR methods.
• Performance measurement of IR.
• Compares performance of different retrieval techniques.
• WWW.
Multimedia Database Management System - Chapter 4
Differences between IR system and DBMS
• Indexing & Retrieval system.
• No structured records. No fixed attributes.
• Retrieval depend on degree of coincidence.
• Item retrieved may not be relevant.
• DBMS.
• Each record has a set of attributes.
• Retrieval based on exact match.
• Item retrieved definitely relevant.
Multimedia Database Management System - Chapter 4
Basic Document Retrieval Process
Multimedia Database Management System - Chapter 4
Basic Boolean!Retrieval Model
• Documents indexed by a set of keywords.
• Queries are represented by a set of keywords and logical operators.
Multimedia Database Management System - Chapter 4
File StructureA document which is retrieved, is called record. A record may consist
of many sentences and terms. A term can be in many records.
Multimedia Database Management System - Chapter 4
File Structure• File structure in IR systems
consist of:
• Flat file. One or more documents are stored in a file.
• Inverted file. Each term has separated index which stores the record identifiers for all records of that term.
• Signature file. Contains bit patterns that represent documents.
Multimedia Database Management System - Chapter 4
Inverted Files
In an inverted file, for each term a separate index is constructed that stores the record identifiers for all records
containing that term.
Multimedia Database Management System - Chapter 4
Extension of!Inverted File Operation
• We have ignored 2 important factors: term positions and term weights.
• Relationship between 2 or more terms can be strengthened by adding nearness parameters: within sentence and adjacency.
• I.e, term1 within sentence term2 means that term1 and term2 occur in a common sentence of a retrieved record.
• I.e, term1 adjacency term2 means that term1 and term2 occur adjacency in the retrieved documents.
Multimedia Database Management System - Chapter 4
General Structure of!Extended Inverted File OperationFor example, if an inverted file has the following entries:
• information: R99, 10, 8, 3; R155, 15, 3, 6; R166, 2, 3, 1
• retrieval: R77, 9, 7, 2; R99, 10, 8, 4; R166, 10, 2, 5
Which record will be retrieved if the query is “information within sentence retrieval”?
Multimedia Database Management System - Chapter 4
Term Operation and!Automatic Indexing
• A document contains many words. But not every words is useful, e.g., prepositions “of”, “the”, and “a”.
• Terms are processed with many operations, e.g., stemming, thesaurus, and weighting.
• Stemming. A fuse of related words.
• Thesaurus, List of synonymous terms and sometimes the relationship among them.
• Weighting,
Multimedia Database Management System - Chapter 4
Weighting Formula
• Wij, weight of term j in doc i,
• tfij, frequency of term j in doc i,
• N, total number of documents,
• dfi, number of documents contain term j
Wij = tfij ⋅ log(Ndf j)
Multimedia Database Management System - Chapter 4
Automatic Document Indexing
• The automatic indexing process consist of few steps:
• Identify words in title, abstract, or document.
• Eliminate stop words.
• Identify synonyms.
• Stem words using certain algorithms.
• Count stem frequencies in each document.
• Calculate term weights.
• Create the inverted file based on the above terms and weights.
Multimedia Database Management System - Chapter 4
Vector Space Retrieval Model
There is a fixed set of index terms to represent documents and queries.
!
!
• Tik, weight of term k in document i
• Qjk, weight of term k in query j
• N, total number of term in docs and queries
Di = [Ti1,Ti2,Ti3,...,Tik ,...,TiN ]Qi = [Qj1,Qj2,Qj3,...,Qjk ,...,QjN ]
Multimedia Database Management System - Chapter 4
Vector Space Retrieval Model
• To compensate differences in document sizes and query sized, the similarity between document Di and Query Qi is calculated as follows:
S(Di ,Qj ) =Tik ⋅Qjk
k=1
N
∑
Tik2 ⋅ Qjk
2
k=1
N
∑k=1
N
∑
Multimedia Database Management System - Chapter 4
Relevance Feedback Techniques
• Relevance feedback takes users’ judgements about the relevance of documents and uses them to modify query or document indexes. Users’ judgement uses to modify query which the rules are:
- Relevant terms that occur, are added to the original query or term weight increased.
- Irrelevant term that occur are deleted from query or term weight reduced.
Multimedia Database Management System - Chapter 4
Relevance Feedback Techniques
• Document index terms are modified using query terms, so the change made affect other users. Document modification uses the following rules, based on relevance feedback:
- Terms in the query, but not in user-judged relevant document, are added to the document index list with an initial weight.
- Weights of index terms in the query and also in relevant document are increased by a certain amount.
- Weights of index terms not in the query and also in relevant document are decreased by a certain amount.
Multimedia Database Management System - Chapter 4
Traditional IR Method Issues
• Individual words do not contain all the information encoded in language.
• One word may have multiple meanings.
• A number of words may have a similar meaning.
• Phrases have meanings beyond the sum of individual words.
Multimedia Database Management System - Chapter 4
A Way to Improve!IR Performance
• Natural Language Processing
• Knowledge-based IR Model.
Multimedia Database Management System - Chapter 4
Performance Measurement
• Information retrieval performance measured using three parameters:
- Retrieval speed.
- Recall.
- Precision.
Multimedia Database Management System - Chapter 4
Performance Comparison Among Different IR Techniques• Automatic indexing is as good as manual indexing.
• Retrieval performance of partial match techniques is better than exact match techniques.
• The use of relevance feedback will improve the retrieval performance.
• Significant user input produces higher retrieval performance than no or limited user input.
• The use of domain knowledge and user profile significantly improve the retrieval performance.
Multimedia Database Management System - Chapter 4
World Wide WebWWW is a collection of interlinked documents distributed
around the world.
Multimedia Database Management System - Chapter 4
Introduction to WWW• Hypertext document. An information management
in which data is stored in a network of nodes connected by computer-supported links. It is made up of a number of nodes and links.
• Hypermedia, an extension of hypertext in that anchors and nodes can be any type of media e.g., graphics, audio, video, etc.
• WWW is the integration of hypermedia and the Internet.
Multimedia Database Management System - Chapter 4
Architecture of WWW
Multimedia Database Management System - Chapter 4
Resource Discovery• Resource Discovery is a process of finding and retrieving
information on the Internet.
• Locations of documents in WWW and Internet are specified using Uniform Resource Locator (URL). The general format is protocol://server-name[:port]/document-name.
• Two ways to find and retrieve documents on the Internet:
• Organizing-Browsing
• Searching
Multimedia Database Management System - Chapter 4
Major Difference Between!IR Systems and WWW Search Engine• WWW documents are distributed around the Internet. IR
system documents are centrally located.
• The number of WWW documents is much greater than IR system documents.
• WWW documents are more dynamic and heterogeneous.
• WWW documents are structured with HTML, IR system documents are normally plain text.
• WWW search engine are used by more users and more frequently than IR systems.
Multimedia Database Management System - Chapter 4
General Structure of WWW Search Engine
Multimedia Database Management System - Chapter 4
SpiderIt visits a Web page, reads it, and then follows links to other pages within the site. The spider may return to the site on a regular basis, such as every month or two, to look for changes.
Multimedia Database Management System - Chapter 4
IndexA collection of a copy of every Web page that the spider finds. If a Web page changes, this book is updated with new information.
Multimedia Database Management System - Chapter 4
Search EngineA program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it estimates is most relevant.
Multimedia Database Management System - Chapter 4
Search Engine Example
Multimedia Database Management System - Chapter 4
Recommended