Phrase Based Indexing and Information Retrivel

Phrase Based IndexingBy

Bala Abirami

• Introduction of Phrase Based Indexing• What is Phrase Based Indexing?• Back ground of Invention• Summary on Invention• Spam Detection

Introduction

• An information retrieval system uses phrases to index, retrieve, organize and describe documents.

• It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US

• Application filed: July, 2004

• Published: January, 2006

Background of Invention

• Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.

• A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.

• The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like

Cont…

• Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".

• Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.

Summary

An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.

1. Identifying Phrases and Related Phrases2. Indexing Documents w.r.t Phrases3. Ranking Documents w.r.t Phrases4. Creating description for the document5. Elimination of Duplicate Documents

Identifying Phrase and Related Phrases

• Based on a phrase's ability to predict the presence of other phrases in a document.

• It looks to identify phrases that have frequent and/or distinguished/unique usage

• Prediction measure is used for identifying related phrases

• Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases

• Information gain = actual co-occurrence rate : expected co-occurrence rate

Cont…

• Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.

• Example:

Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,

Indexing documents based on related Phrases

• An information retrieval system indexes documents in the document collection by the valid or good phrases.

• Posting List = documents that contain the phrase

• Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase

Ranking

• Ranking documents is based on two factors 1. Ranking Documents based on

Contained Phrases 2. Ranking Documents based on Anchor

Phrases• Document Score = Body Hit Score + Anchor Hit

Score• For Example: Body Hit Score = 0.30, Anchor

Hit Score = 0.70• Document Score = 0.30 + 0.70

Phrase Extension

• The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.

• A user may enter an incomplete phrase in a search query, such as "President of the“

Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."

Descriptions for Documents

• Phrase information is used to create description of a document.

• System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.

• Ranks the sentences based on the count.• Selects some number of top ranking sentences

as description and includes it in the search results.

Eliminating Duplicate documents

• Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.

• The description is stored in association with every document in a hash table.

• The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.

• The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.

Functions of Indexing system • Indentifies Phrases in documents• Indexing Documents according to the

phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user• Provides those queries to the search system

Functions of Searching System

• Searching for documents relevant to the search query

• Identifies the phrases in the search query• Ranking the documents

Functions of Presentation system

• Modifying the search results including removing of duplicate content.

• Generating topical descriptions of documents and provides modified

Spam Detection

• “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.

• Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .

Cont…

• A phrase based indexing system knows the number of related phrases in a document.

• A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.

• A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.

Advantages of Phrase Based Indexing

• Detecting Duplicate Pages

• Spam Detection

• Save time

Other Patent Applications

• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information retrieval system

• Efficient Phrase Based Document Indexing for Document Clustering

According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines

Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent

Thank you

Technology

Phrase Based Indexing and Information Retrivel