Web indexing finale

Web Indexing

About

• Includes back-of-book-style indexes to individual websites or an intranet.

• Creation of keyword metadata to provide a more useful vocabulary for Internet.

• It is also becoming important for periodical websites with increase in there number.

Purpose

• Collects, parses, and stores data to facilitate fast and accurate information retrieval.

• Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science.

• Popular engines focus on the full-text indexing of online, natural language documents.

Purpose(contd..)

• Media types such as video and audio and graphics are also searchable.

• Cache-based search engines permanently store the index along with the corpus.

• Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.

Back-of-the-book-style

• Back-of-the-book-style web indexes may be called "web site A-Z indexes.“

• The implication with "A-Z" is that there is an alphabetical browse view or interface.

• A-Z index could be used to index multiple sites, rather than the multiple pages of a single site, this is unusual.

Metadata web indexing

• Metadata web indexing involves assigning keywords or phrases to web pages or web sites within a meta-tag.

• The web page or web site can be retrieved with a search engine that is customized to search the keywords field.

• This may or may not involve using keywords restricted to a controlled vocabulary list.

Purpose

• To optimize speed and performance in finding relevant documents for a search query.

• The search engine would scan every document in the corpus, which would require considerable time and computing power, without indexing.

• Additional computer storage required to store the index & increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Index Design Factors

• Merge factors– indexer must first check whether it is updating old

content or adding new content.– similar in concept to the SQL Merge command

and other merge algorithms.

• Storage techniques – information should be data compressed or

filtered.

Index Design Factors(Contd..)

• Index size – Computer storage required to support the index.

• Lookup speed – Quickly a word can be found in the inverted index.– Speed of finding an entry in a data structure,

compared with how quickly it can be updated or removed.

Index Design Factors(Contd..)• Maintenance– Index is maintained over time

• Fault tolerance – service must be reliable.– dealing with index corruption, – determining whether bad data can be treated in

isolation,– dealing with bad hardware, – partitioning, – schemes such as hash-based or composite

partitioning – replication.

Index Data Structures

• Suffix tree– Structured like a tree, supports linear time lookup.– Built by storing the suffixes of words.– Support extendable hashing, which is important for

search engine indexing.– Used for searching for patterns in DNA sequences and

clustering.

Index Data Structures(Contd..)

• Tree– ordered tree data structure that is used to store an

associative array where the keys are strings. – faster than a hash table but less space-efficient.

• Inverted index – Stores a list of occurrences of each atomic search

criterion.• Citation index– Stores citations or hyperlinks between documents to

support citation analysis.

Index Data Structures(Contd..)

• Ngram index– Stores sequences of length of data to support

other types of retrieval or text mining.

• Term document matrix – Used in latent semantic analysis, stores the

occurrences of words in documents in a two-dimensional sparse matrix.

Indexes vs. Taxonomies

• Hierarchical taxonomy vs. alphabetical index• Two-step process of taxonomy development

and content linking vs. integrated indexing/index creation

• Each is more suitable for different kinds of content.

• Sometimes have both, as different means to access the same content.

Challenges in Parallelism

• Management of parallel computing processes.• Many opportunities for race conditions and

coherent faults.• Collision between two competing tasks.• Search engine's architecture may involve

distributed computing, where the search engine consists of several machines operating in unison.

• It more difficult to maintain a fully-synchronized, distributed, parallel architecture.

Index Merging

• Inverted index is filled via a merge or rebuild. • Rebuild is similar to a merge but first deletes

the contents of the inverted index.• A merge conflates newly indexed documents,

typically residing in virtual memory, with the index cache residing on one or more computer hard drives.

• Inverted index is a word-sorted forward index.

The Forward Index

• Stores a list of words for each document.• As documents are parsing, it is better to

immediately store the words per document.• Sorted to transform it to an inverted index.• Forward index to an inverted index is only a

matter of sorting the pairs by the words.• Essentially a list of pairs consisting of a

document and a word, collated by the document.

Compression

• Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge.

• Compression to reduce the size of the indices on disk.

• Tradeoff is the time and processing power required to perform compression and decompression.

A Scenario• A full text, Internet search engine:– 6,000,000,000 different web pages exist as of the

year 2008.– 250 words on each webpage (based on the

assumption they are similar to the pages of a novel).– 8 bits (or 1 byte) to store a single character.– average number of characters in any given word on

a page may be estimated at 5– average personal computer comes with 100 to 250

gigabytes of usable space.

Under Same Senario

• an uncompressed index (assuming a non-conflated, simple, index) for 6 billion web pages would need to store 1500 billion word entries.

• At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.

• The index can be reduced to a fraction of this size, using proper algorithm.

Document Parsing

• Breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices.

• words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization.

Document Parsing(Contd..)

• Also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis.

• 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Challenges in Natural Language Processing

• Word Boundary Ambiguity • Language Ambiguity • Diverse File Formats • Faulty Storage

Tokenization

• Computers do not understand structure of natural language document and cannot automatically recognize words and sentences.

• Program the computer to identify what constitutes an individual or distinct word, referred to as a token.

• Program is commonly called a tokenizer or parser or laxer.

Format Analysis• HTML• ASCII text files (a text document without specific computer

readable formatting)• Adobe's Portable Document Format (PDF)• PostScript (PS)• LaTex• UseNet net news server formats• XML and derivatives like RSS• SGML• Multimedia meta data formats like ID3• Microsoft Word• Microsoft Excel• Microsoft Powerpoint• IBM Lotus Notes

Format Analysis(Compressed)

• ZIP - Zip archive file• RAR - Roshal ARchive File• CAB - Microsoft Windows Cabinet File• Gzip - File compressed with gzip• BZIP - File compressed using bzip2• Tape ARchive (TAR), Unix archive file, not (itself)

compressed• TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files

compressed with Compress, GZIP or BZIP2

Education

Web indexing finale