22
Phrase Based Indexing By Bala Abirami

Phrase Based Indexing and Information Retrivel

Embed Size (px)

DESCRIPTION

Slide on Phrase based indexing and information retrivel.

Citation preview

Page 1: Phrase Based Indexing and Information Retrivel

Phrase Based IndexingBy

Bala Abirami

Page 2: Phrase Based Indexing and Information Retrivel

• Introduction of Phrase Based Indexing• What is Phrase Based Indexing?• Back ground of Invention• Summary on Invention• Spam Detection

Page 3: Phrase Based Indexing and Information Retrivel

Introduction

• An information retrieval system uses phrases to index, retrieve, organize and describe documents.

• It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US

• Application filed: July, 2004

• Published: January, 2006

Page 4: Phrase Based Indexing and Information Retrivel

Background of Invention

• Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.

• A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.

• The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like

Page 5: Phrase Based Indexing and Information Retrivel

Cont…

• Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".

• Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.

Page 6: Phrase Based Indexing and Information Retrivel

Summary

An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.

1. Identifying Phrases and Related Phrases2. Indexing Documents w.r.t Phrases3. Ranking Documents w.r.t Phrases4. Creating description for the document5. Elimination of Duplicate Documents

Page 7: Phrase Based Indexing and Information Retrivel

Identifying Phrase and Related Phrases

• Based on a phrase's ability to predict the presence of other phrases in a document.

• It looks to identify phrases that have frequent and/or distinguished/unique usage

• Prediction measure is used for identifying related phrases

• Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases

• Information gain = actual co-occurrence rate : expected co-occurrence rate

Page 8: Phrase Based Indexing and Information Retrivel

Cont…

• Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.

• Example:

Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,

Page 9: Phrase Based Indexing and Information Retrivel

Indexing documents based on related Phrases

• An information retrieval system indexes documents in the document collection by the valid or good phrases.

• Posting List = documents that contain the phrase

• Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase

Page 10: Phrase Based Indexing and Information Retrivel

Ranking

• Ranking documents is based on two factors 1. Ranking Documents based on

Contained Phrases 2. Ranking Documents based on Anchor

Phrases• Document Score = Body Hit Score + Anchor Hit

Score• For Example: Body Hit Score = 0.30, Anchor

Hit Score = 0.70• Document Score = 0.30 + 0.70

Page 11: Phrase Based Indexing and Information Retrivel

Phrase Extension

• The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.

• A user may enter an incomplete phrase in a search query, such as "President of the“

Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."

Page 12: Phrase Based Indexing and Information Retrivel

Descriptions for Documents

• Phrase information is used to create description of a document.

• System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.

• Ranks the sentences based on the count.• Selects some number of top ranking sentences

as description and includes it in the search results.

Page 13: Phrase Based Indexing and Information Retrivel

Eliminating Duplicate documents

• Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.

• The description is stored in association with every document in a hash table.

• The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.

• The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.

Page 14: Phrase Based Indexing and Information Retrivel
Page 15: Phrase Based Indexing and Information Retrivel

Functions of Indexing system • Indentifies Phrases in documents• Indexing Documents according to the

phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user• Provides those queries to the search system

Page 16: Phrase Based Indexing and Information Retrivel

Functions of Searching System

• Searching for documents relevant to the search query

• Identifies the phrases in the search query• Ranking the documents

Functions of Presentation system

• Modifying the search results including removing of duplicate content.

• Generating topical descriptions of documents and provides modified

Page 17: Phrase Based Indexing and Information Retrivel

Spam Detection

• “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.

• Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .

Page 18: Phrase Based Indexing and Information Retrivel

Cont…

• A phrase based indexing system knows the number of related phrases in a document.

• A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.

• A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.

Page 19: Phrase Based Indexing and Information Retrivel

Advantages of Phrase Based Indexing

• Detecting Duplicate Pages

• Spam Detection

• Save time

Page 20: Phrase Based Indexing and Information Retrivel

Other Patent Applications

• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information retrieval system

• Efficient Phrase Based Document Indexing for Document Clustering

Page 21: Phrase Based Indexing and Information Retrivel

According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines

Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent

Page 22: Phrase Based Indexing and Information Retrivel

Thank you