The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja

The Anatomy of a Large-Scale Hyper textual Web Search Engine

S. Brin, L. Page

Presenter :-

Abhishek Taneja

Why Google was introduced or required?

Because there were problems with existing search engines. For example,

• Human maintained Lists/indices

-- subjective, expensive to build and maintain

-- slow to improve

-- cannot cover all the esoteric topics

Automated Search Engines

-- Rely on keywords matching

-- Easy to mislead them

June-2010 CAM-2Why Google?

Some facts about Google

• Why Google is called Google

-- Because it is a common spelling of googol or 10^100 and fits well with their goal of building very large-scale search engine.

• Just to let you know that we are talking about Google of year 1997. Much of the modules it incorporated then were made open source. So we know a lot about Google of year 1997. But we do not know much about Google of 2010 because most of its modules are proprietary.

June-2010 CAM-3Facts about Google

Goals behind Google

• Scalability

-- Number of pages indexed.

-- Number of queries handled.

• Quality

-- To provide high quality search results

• Eliminating junk results -- Using link structure and anchor text for quality filtering.

• To push more development and understanding into the academic realm.

• To increase usability.

• To setup a space lab-like environment where researchers or even students can propose and do interesting experiments on Google’s large scale web data.

June-2010 CAM-4Goals

Features of Google Search Engine

• Uses link structure of the web to calculate a quality ranking for each web page called page rank.

• The probability that a random surfer visits a page is called its page rank. It gives some approximation of page’s importance and quality.

• PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))

-- Where PR(A) is the Page Rank of Page A .

-- PR(T1) is the Page Rank of a site pointing to Page A

-- C(T1) is the number of links off that page which points to A

-- PR(Tn) /C(Tn) means we do that for each page pointing to Page A

-- Where T1…Tn is the set of pages with incoming links to page A

-- d is a dampening factor. It is the probability at each page the random surfer will get bored and request another random page. Nominally this is set to 0.85

June-2010 CAM-5Features

Features of Google (cont.)

• Anchor Text.

-- Google utilizes the data in anchor text and associates it with the page the link points to. For example, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs and databases.

• This search engine has location information for all hits and so it makes extensive use of proximity in search.

• Google keeps track of some visual presentation details such as font size of words. Words in <h1>, <b> tags are weighted higher than other words.

• Full raw HTML of pages is available in repository.

June-2010 CAM-6Features(cont.)

Google Architecture

888

Google Architecture (cont.)

It sends lists of URLs to be fetched to the

crawlers

Compresses and stores web pages in a

repository

Multiple crawlers run in parallel. Each crawler keeps its own DNS

lookup cache and ~300 open connections open at once.

Reads the repository, un compresses the documents and parses them. Stores link information in anchors file and makes

Hit lists

The indexer parses out all links in a web page and Stores important

information about them in it.

Converts relative URLs into absolute URLs &

into doc IDs

Contains Entire html of every web page. Each document is prefixed by

docID, length, and URL.

9


Maps absolute URLs into doc IDs stored in Doc Index. Stores anchor text in “barrels”. Generates

database of links (pairs of doc Ids).

Parses & distributes hit lists into “barrels.”

Creates inverted index whereby document list containing doc ID

and hit lists can be retrieved given word ID.

In-memory hash table that maps words to word Ids. Contains

pointer to doc list in barrel which word Id falls into.

Partially sorted forward indexes sorted by doc ID. Each barrel

stores hit lists for a given range of word IDs.

Doc ID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also

contains URL info if doc has been crawled. If not just contains URL.


List of word Ids produced by Sorter and lexicon

created by Indexer used to create new lexicon used by

searcher. Lexicon stores ~14 million words.

2 kinds of barrels. Short barrel which contains hit list

& which includes title or anchor hits. Long barrel for

all hit lists.

Results and Performance

• Performance of a search engine depends on quality of its search results and quality of search results are judged by its users.

-- So After collecting lots of feedback from users and researchers, it was found out that the results were of good quality. For example, Google at that time to was able to produce top search results with no broken links.

• Google also placed heavy importance on the proximity of word occurrences. For example, search results for Bill Clinton does not produce independent results for Bill and Clinton.

• Storage efficiency was achieved by using compression techniques like zlib, bzip.

• System Performance was increased by optimizing the indexer, running sorters in parallel, optimizing the data structures to store the information.

June-2010 CAM-11Results and Performance

Conclusions

• Google is designed to be a scalable search engine.

• Primary goal is to provide high quality search results over a rapidly growing world wide web.

• Google employs a number of techniques to improve search quality including page rank, anchor text and proximity of information.

• Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

June-2010 CAM-12Conclusions

Pros of the paper

• A landmark paper which gives an insight into the search engine architecture of Google.

• First known public description of Page Rank.

• New ways of ranking proposed based on link structure which comes very close to the notion of “Relevant” documents.

June-2010 CAM-13Pros of the paper

Cons of paper

• As we know by now that the paper is about Google of year 1997 and so number of Goals proposed were not being implemented. For example, to make Google a part of academic realm.

• Judging the quality of webpage by only page rank and data in anchor text is not sufficient.

June-2010 CAM-14Cons of the paper

Documents

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja