Web – Based Information Retrieval System. World Wide Web or Web - a massive collection of web pages stored on the millions of computers across the world

Web – Based Web – Based Information Information Retrieval SystemRetrieval System

World Wide Web or WebWorld Wide Web or Web

- a massive collection of web pages - a massive collection of web pages stored on the millions of computers across the stored on the millions of computers across the world that are linked by the internetworld that are linked by the internet

OnlineOnline

- refers to a computer or user currently - refers to a computer or user currently connected to a network or to the Internet. Online is connected to a network or to the Internet. Online is often used to refer to resources available on the often used to refer to resources available on the Internet Internet

II. Definition of TermsII. Definition of Terms

The WebThe Web

No design/co-ordinationNo design/co-ordination Distributed content creation, linkingDistributed content creation, linking Content includes truth, lies, obsolete Content includes truth, lies, obsolete

information, contradictions … information, contradictions … Structured (databases), semi-structured …Structured (databases), semi-structured … Scale larger than previous text corpora … Scale larger than previous text corpora …

(now, corporate records)(now, corporate records) Growth – slowed down from initial “volume Growth – slowed down from initial “volume

doubling every few months”doubling every few months” Content can be Content can be dynamically generateddynamically generated

The Web

II. History of the WebII. History of the Web

Vannevaar Bush-envisioned Hypertext (1940)Vannevaar Bush-envisioned Hypertext (1940)

Tim Berners-Lee and his collegues at CERN-created a Tim Berners-Lee and his collegues at CERN-created a protocol called HTTP (hyper text transfer protocol) protocol called HTTP (hyper text transfer protocol) which is a standardized communication between which is a standardized communication between servers and clientsservers and clients

III. Complexities of the WebIII. Complexities of the Web

1. 1. Distributed nature of the webDistributed nature of the web

There is no uniform standard used for the creation and processing There is no uniform standard used for the creation and processing of the web information resourcesof the web information resources

2. 2. Size and growth of the webSize and growth of the web

3. 3. Deep vs the surface webDeep vs the surface web

4. 4. Type and format of the documentsType and format of the documents

5. 5. Quality of the informationQuality of the information

6.6. FrequencyFrequency

7.7. OwnershipOwnership

8.8. Distributed usersDistributed users

9.9. Multiple LanguagesMultiple Languages

10.10. Resource requirementsResource requirements

IV. Traditional vs Web IRIV. Traditional vs Web IR

““Volume of Use”Volume of Use”

Web Information: volume and Web Information: volume and growthgrowth

Distribution of Distribution of websites from 1998 websites from 1998 to 2002to 2002

1998 2,851,000 1998 2,851,000

1999 4,882,000 1999 4,882,000

2000 7,399,000 2000 7,399,000

2001 8,745,0002001 8,745,000

2002 9,040,0002002 9,040,000

Growth in the Growth in the number of websites number of websites from 1998-2002from 1998-2002

1998-9 71% 1998-9 71%

1999 52% 1999 52%

2000 18% 2000 18%

2001 3%2001 3%

2002 217%2002 217%

Distribution of public Distribution of public website by country of website by country of origin in 2002origin in 2002

US 55%US 55%

Germany 6%Germany 6%

Japan 5%Japan 5%

UK 3%UK 3%

Canada 3%Canada 3%

Italy 2%Italy 2%

France 2% France 2%

Netherlands 2%Netherlands 2%

Others 18% Others 18%

Unknown 4% Unknown 4%

Distribution of public Distribution of public website by language website by language in 2002in 2002

English 72%English 72%

German 7%German 7%

Japanese 6%Japanese 6%

Spanish 3%Spanish 3%

French 3%French 3%

Italian 2%Italian 2%

Dutch 2%Dutch 2%

Chinese 2%Chinese 2%

Korean 1%Korean 1%

Portuguese 1% Portuguese 1%

Russian 1%Russian 1%

Polish 1%Polish 1%

In spite of this…In spite of this…

As of 2003 Google is reported to be as the largest search engines having indexed 3.8 billion web pages, and until today…

Categories:Categories:

• Web Search ToolsWeb Search Tools• Activating the appropriate program from a particular web pageActivating the appropriate program from a particular web page

Deep WebDeep Web - part of the web that is hidden and cannot be easily - part of the web that is hidden and cannot be easily accessedaccessed

Surface WebSurface Web - can be easily accessed - can be easily accessed

Information in the web can be categorized into two classes:

Web tools for Information Web tools for Information RetrievalRetrieval

1. Web browser1. Web browser- a computer program essential for getting access to the weba computer program essential for getting access to the web- URL ( Uniform Resource Locator ) or the web addressURL ( Uniform Resource Locator ) or the web address- e.g Netscape Navigator, Microsoft Internet Explorere.g Netscape Navigator, Microsoft Internet Explorer

Capabilities:Capabilities:

a) knows how to go to a web server on the internet and a) knows how to go to a web server on the internet and request a pagerequest a page

b) knows how to interpret the set of HTML tagsb) knows how to interpret the set of HTML tags

2. Search engines2. Search engines

- allows users to enter search terms such as keywords/ phrases- allows users to enter search terms such as keywords/ phrases

-retrieves from its database web pages that match the search -retrieves from its database web pages that match the search terms entered by the userterms entered by the user

3. Web Directories or “Link Directory”3. Web Directories or “Link Directory”

- - is a directory on the www which specializes in linking to other is a directory on the www which specializes in linking to other websites and categorizing links.websites and categorizing links.

- not a search engine, does not display lists of webpages based on - not a search engine, does not display lists of webpages based on keywordskeywords

- often allow site owners to directly submit their site for inclusion - often allow site owners to directly submit their site for inclusion

RSS directories are similar to web directories, but contain collections of RSS directories are similar to web directories, but contain collections of RSS feedsRSS feeds, instead of links to , instead of links to web sites. web sites.

Search Engines: How it Search Engines: How it worksworks

They search or select parts of the internet according to They search or select parts of the internet according to a set of criteriaa set of criteria

They keep an index of the words or phrases they find, They keep an index of the words or phrases they find, with specific information such as where they found with specific information such as where they found them, how many times they found them.them, how many times they found them.

They allow users to search for words or phrases or They allow users to search for words or phrases or combinations of words or phrases found in that index combinations of words or phrases found in that index

Three Main ComponentsThree Main Components

SpiderSpider- a program that automatically fetches web pages for search - a program that automatically fetches web pages for search

engines; crawls over the web.engines; crawls over the web.

Search engine software and interfaceSearch engine software and interface- information retrieval program that performs two major tasks- information retrieval program that performs two major tasks- searches through millions of terms recorded in the index to find - searches through millions of terms recorded in the index to find

matches to a search and it ranks the retrieved records (web pages) to the matches to a search and it ranks the retrieved records (web pages) to the most relevant.most relevant.

IndexIndex

Crawling and Indexing Crawling and Indexing process (Google)process (Google)

Web CrawlingWeb Crawling A URL server sends lists of URLs to be fetched to the crawlersA URL server sends lists of URLs to be fetched to the crawlers The web pages that are fetched are sent to the store server which compresses and The web pages that are fetched are sent to the store server which compresses and

stores the web pages into a repositorystores the web pages into a repository Every web page has an associated ID number called a docID which is assigned Every web page has an associated ID number called a docID which is assigned

whenever a new URL is parsed out of a webwhenever a new URL is parsed out of a web The indexing function is performed by the indexer and the sorterThe indexing function is performed by the indexer and the sorter The indexer reads the repository, uncompresses the documents, and parses themThe indexer reads the repository, uncompresses the documents, and parses them Each document is converted into a set of word occurences called hits. The hits Each document is converted into a set of word occurences called hits. The hits

record the word, position the document, an approximation of the font size and record the word, position the document, an approximation of the font size and capitalizationcapitalization

The indexer distributes these hits into a set of barrels, creating a partially sorted The indexer distributes these hits into a set of barrels, creating a partially sorted forward indexforward index

The indexer also parses out all the links in every web page and stores important The indexer also parses out all the links in every web page and stores important information about them in an anchor fileinformation about them in an anchor file

7 Categories of Search engines

A. Major Search Engines

B. News search engines

C. Speciality search engines

D. Kid’s search engines

E. Metacrawlers

F. Multimedia search engines

G. Regional and country search engines

B. News search engines

LexisNexis claims to be the "world’s largest collection of public records, unpublished opinions, forms, legal, news, and business information" while offering their products to a wide range of professionals in the legal, risk management, corporate, government, law

enforcement, accounting and academic markets.

C. Speciality Search engines

D. Kids’ search engines

E. Metacrawlers

F. Multimedia search engines

G. Regional and Country Search engines

Types of Search Engines (by Types of Search Engines (by model)model)

1. Search engines1. Search engines2. Meta search engines2. Meta search engines

-tools that allow users to conduct concurrent searches on more than one search engine-tools that allow users to conduct concurrent searches on more than one search engine -Search engines that automatically submit your keyword search to several other search tools, and retrieve -Search engines that automatically submit your keyword search to several other search tools, and retrieve

results from all their databases. Convenient time-savers for relatively simple keyword searches (one or two results from all their databases. Convenient time-savers for relatively simple keyword searches (one or two keywords or phrases in " ").keywords or phrases in " ").

3. Open source search engines3. Open source search engines

4. Social search engines4. Social search engines

5. Personal search engines5. Personal search engines

6. Visual search engines6. Visual search engines

7. Desktop search engines7. Desktop search engines - the name for the field of search tools which search the contents of a user's own computer files, rather than - the name for the field of search tools which search the contents of a user's own computer files, rather than

searching the Internet. These tools are designed to find information on the user's PC, including web searching the Internet. These tools are designed to find information on the user's PC, including web browser histories, e-mail archives, text documents, sound files, images and video. browser histories, e-mail archives, text documents, sound files, images and video.

8. Usenet8. Usenet - Bulletinboard-like network featuring thousands of "newsgroups." Google incorporates the historic file of - Bulletinboard-like network featuring thousands of "newsgroups." Google incorporates the historic file of

``Usenet Newsgroups (back to 1981) into its Google Groups. Yahoo Groups offers a similar service, but ``Usenet Newsgroups (back to 1981) into its Google Groups. Yahoo Groups offers a similar service, but does not include the old "Usenet Newsgroups." does not include the old "Usenet Newsgroups." BlogsBlogs are replacing some of the need for this type of are replacing some of the need for this type of community sharing and information exchange.community sharing and information exchange.

Open Source Search enginesOpen Source Search engines

Open source search engines allow participants to make changes Open source search engines allow participants to make changes and contribute to the improvement of the software. and contribute to the improvement of the software.

They are generally free and use the GPL or other open source They are generally free and use the GPL or other open source licensing schemes. licensing schemes.

In most cases, anyone can use the software on a site or In most cases, anyone can use the software on a site or incorporate it in a product, but must share improvements and incorporate it in a product, but must share improvements and additional functionality with the other source users. additional functionality with the other source users.

Note that these search engines generally require all options to be Note that these search engines generally require all options to be set using command lines or configuration files, rather than set using command lines or configuration files, rather than interactive browser-based graphic interfaces. Changes are often interactive browser-based graphic interfaces. Changes are often done on the server, requiring root access and passwords. done on the server, requiring root access and passwords.

Sphinx is a full-text search engine, distributed under GPL version 2. Commercial license is also available for embedded use. Generally, it's a standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL or PostgreSQL, or using XML pipe mechanism (a pipe to indexer in special XML-based format which Sphinx recognizes). As for the name, Sphinx is an acronym which is officially decoded

as SQL Phrase Index. Yes, I know about CMU's Sphinx project.

Social Search enginesSocial Search engines

Social search or a social search engine is a type of web search method Social search or a social search engine is a type of web search method that determines the relevance of search results by considering the that determines the relevance of search results by considering the interactions or contributions of users. When applied to web search this interactions or contributions of users. When applied to web search this user-based approach to relevance is in contrast to established algorithmic user-based approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the analyzing the text of each document or the link structure of the documents.documents.

Social search takes many forms, ranging from simple shared bookmarks Social search takes many forms, ranging from simple shared bookmarks or tagging of content with descriptive labels to more sophisticated or tagging of content with descriptive labels to more sophisticated approaches that combine human intelligence with computer algorithms.approaches that combine human intelligence with computer algorithms.

The Search experience revolve around the outcome of collaborative The Search experience revolve around the outcome of collaborative harvesting, collaborative directories, tag engines, social ranking, harvesting, collaborative directories, tag engines, social ranking, commenting on bookmarks, news, images, videos, podcasts and other commenting on bookmarks, news, images, videos, podcasts and other web pages. Example forms of user input include social bookmarking or web pages. Example forms of user input include social bookmarking or direct interaction with the search results such as promoting or demoting direct interaction with the search results such as promoting or demoting results the user feels are more or less relevant to their query.results the user feels are more or less relevant to their query.

Personal Search EnginesPersonal Search Engines

Desktop Search EnginesDesktop Search Engines

Apple Inc's Spotlight is an example of a desktop search tool.

X1 Technologies' X1 Professional Client is an example of a desktop search tool for Windows.

Documents

Web – Based Information Retrieval System. World Wide Web or Web - a massive collection of web pages stored on the millions of computers across the world