36
1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences Information System Department IS 531:Document Storage and Retrieval Systems

1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

1

Searching the Web

Prepared By:Hasan Ba-Abdullah. 425121603

Supervised By:

Dr. Mourad Ykhlef

King Saud UniversityCollege of Computer & Information Sciences

Information System DepartmentIS 531:Document Storage and Retrieval Systems

Page 2: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

2

Agenda

1. Introduction 1. Introduction

2. Challenges of Searching the Web2. Challenges of Searching the Web

3. Measuring the Web 3. Measuring the Web

4. Searching Engines (Google)4. Searching Engines (Google)

5. Web Directories 5. Web Directories

6. Metasearchers 6. Metasearchers

7. Google Searching Guidelines 7. Google Searching Guidelines

Page 3: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

3

1. Introduction

• The Web can be seen as a very large, unstructured but ubiquitous database.

• So we need for efficient tools to manage, retrieve and filter the information.

• There are 3 different forms of searching the Web:

1. Search Engines, which index a portion of Web

pages as a full-text database.

2. Web Directories, which classify selected Web

documents by subject.

3. Searching by hyperlinks structure.

Page 4: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

4

2. Challenges of Searching the Web• Problem with the data itself:

Distributed data High percentage of volatile data

• it is estimated that 40% of the Web changes every month.

Unstructured and redundant data • No conceptual model, no organization, no constraints.• By some estimates, about 30% of the Web is redundant.

Quality of data• Data can be false, invalid, outdated, poorly written or with many errors.

Heterogeneous data• Multiple media types, multiple formats, languages and alphabets.

• Problems regarding the user and his interaction with the retrieval system: How to specify a query How to interpret the answer provided by the system.

Page 5: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

5

3. Measuring the Web

• Detailed Domain Counts and Internet Statistic.

• Source: http://www.whois.sc/internet-statistics/

Page 6: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

6

3. Measuring the Web (Cont.)

on April 1st 2006.

Source: http://www.whois.sc/internet-statistics/country-ip-counts.html

Page 7: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

7

4. Search Engine

• A search engine is a program designed to help find information stored on a computer system such as the World Wide Web, or a personal computer.

• The search engine allows one to ask for content meeting specific criteria and retrieves a list of references that match those criteria.

• Two main architectures: 1. Centralized : Using crawlers , information is gathered into a single site, where it is indexed; the site then processes all user queries. 2. Distributed : Searching is a coordinated effort of many information gatherers and brokers .

Page 8: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

8

4.1 Centralized Architecture

• Most search engines uses a centralized crawler-indexer architecture.

• Components: Crawlers, Index, Query Engine, and Interface.

Page 9: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

9

4.2 Distributed Architecture

Harvest is an example of distributed architecture. Main drawback: requires the coordination of several Web

servers. Components:1. Gatherers: • Extracts information from the documents stored on one or more Web servers.• Can handle documents in many formats: HTML, PDF, Postscript, etc.2. Broker: provides the indexing mechanism and query interface.3. Replicator: to replicate servers.4. Object Cache: reduces network and server load.

Page 10: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

10

4.2 Distributed Architecture (Cont.)

Page 11: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

11

4.3 About Google?

• The name "Google" is a play on the word "googol", which refers to the number represented by 1 followed by one hundred zeros.

• Google receives over 200 million queries each day through its various services.

• As of January 2006, Google has indexed 9.7 billion web pages, 1.3 billion images, and over one billion Usenet messages — in total, approximately 12 billion items. It also caches much of the content that it indexes.

Page 12: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

12

User Interfaces

Page 13: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

13

Google Services and Tools Source: http://en.wikipedia.org/wiki/List_of_Google_services_and_tools

Page 14: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

14

How Google works

Page 15: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

15

Google finds important pages

• The idea is that the documents on the web have different degrees of "importance".

• Google will show the most important pages first.

• The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

Page 16: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

16

Google Relevance Factors

• Google's considers over 100 factors, including:

1. PageRank algorithm.1. PageRank algorithm.

2. Popularity of page. 3. Position and size of the search terms within page. 4. Unique Content. 5. Terms order. 6. Page size and load time. 7. Error free websites. 8. Important incoming links. 9. Website Optimization.

Page 17: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

17

Google PageRank• Numeric value to measure how important a page is.

• PageRank (PR) is the actual ranking of a page, as determined by Google.

• A probability is expressed as a numeric value between 0 and 1.

Toolbar PageRank Real PageRank

0/10 0.15 - 0.9

1/10 0.9 - 5.4

2/10 5.4 - 32.4

3/10 32.4 - 194.4

4/10 194.4 - 1,166.4

5/10 1,166.4 - 6,998.4

6/10 6,998.4 - 41,990.4

7/10 41,990.4 - 251,942.4

8/10 251,942.4 - 1,511,654.4

9/10 1,511,654.4 - 9,069,926.4

10/10 9,069,926.4 - 0.85 × N + 0.15

Page 18: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

18

Google System Features

A

T1

T2

Tn

C1

C2

Cm

• PageRank – Bring order to the web

– PR(A) = (1-d) + d (PR(T1)/C(T1) + ….. + PR (Tn)/C(Tn))

PR(A) is the PageRank of page A.PR(T1) is the PageRank of the page that links to our (A) page.C(T1) is the number of links going out of page T1. d is a damping factor, usually set to 0.85.

Page 19: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

19

Example

• PageRank calculation:• PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single pages:

• PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

Page 20: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

20

• For example, the word "civil" might occur in documents 3, 8, 22, 56, 68, and 92, while the word "war" might occur in documents 2, 8, 15, 22, 68, and 77.

• Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:

1. Find the set of pages that contain the user’s query somewhere

2. Rank the matching pages in order of relevance

Indexing

Page 21: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

21

Web crawler

• A web crawler (also known as a web spider) is a program which browses the World Wide Web in a methodical, automated manner.

• Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.

• It starts with a list of URLs to visit. As it visits these

URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web according to a set of policies.

Page 22: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

22

Google Search Engine Architecture

• URL Server- Provides URLs to be fetched• Crawler is distributed• Store Server - compresses and stores pages • Repository - holds pages for indexing• Indexer - parses documents, records words, positions, font size, capitalization• Lexicon - list of unique words found• Barrels hold • Anchors - keep information about link found in web pages• URL Resolver - converts relative URLs to absolute• Sorter - generates Doc Index• Doc Index - inverted index of all words in all documents (except stop words)• Links - stores info about links to each page (used for Pagerank)• Pagerank - computes a rank for each page retrieved• Searcher - answers queries

Page 23: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

23

5. Web Directories

• Web directory : A classification of Web pages by subject.

• Principles:Classification is by a hierarchical taxonomy.Directory may be specific to a subject, a region, a

language.Pages are submitted and reviewed before they are included.Automatic classification is not successful enough.

• Advantage: if found, the answer will be useful in most cases;

• Disadvantage:classification is not specialized enough;not all Web pages are classified;

Page 24: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

24

6. Metasearchers

• Metasearcher : web server that sends a given query to several search engines and Web directories, collects the answers and unifies them.– Examples: Metacrawler, Savvysearch, MetaSearch, Mamma.

• Advantages:– Combine the results of many sources.– Save users from the need to pose queries to multiple searchers.– Ability to sort the results by different attributes.– Pages retrieved by multiple searchers are more relevant.– Improve coverage: individual searchers cover a small fraction of the

Web.• Issues:

– How to translate the given query to the specific language of each search Engine?

– How to rank the unified results?

Page 25: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

25

7. Google Searching Guidelines

• Query modifiers– Use these commands in the search window.

• intitle:test

• allintitle:test results

• inurl:testresults

• allinurl:testresults personality

• allintext:test results personality

• allinanchor:test results personality

• site:loc.gov

• filetype:doc

Page 26: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

26

intitle:test resultsThis search returns sites with the word test in the title and results anywhere in the document.

Page 27: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

27

inurl:test results

• inurl:test results – only test must be found in the web address (URL)

Page 28: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

28

allintext

• Sometimes you get pages that do not have your search term/phrase in them.

• Use allintext to get only those pages that have your search terms in them.– Compare the searches in the next two slides…

Page 29: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

29

Example: crash test results

Page 30: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

30

allintext:crash test results

Different pages float to the top of your “hit list”.

And you get fewer pages than before.

Page 31: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

31

site:

• Limit your search to a specific web site.

• Enter search terms then qualifier.

• EXAMPLES:– “students” site:ksu.edu

• Finds student(s) on the King Saud University site

Page 32: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

32

filetype:

• You can specify a type of document to search. • EXAMPLES:

– pdf – Adobe readable files– doc – Microsoft Word documents– mdb – Microsoft Access databases– jpg, gif, tif – graphics, photos– ppt – Microsoft PowerPoint presentations

Page 33: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

33

define:

• will provide definitions of the words, gathered from various online sources.

Page 34: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

34

Funny Google News

• Google Bombing “Miserable Failure”

وذلك • ومنتجاتها لنفسها عن للترويج القمر سطح استغالل غوغل شركة قررتالقمر علىسطح شعارها وضع منخالل

Page 35: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

35

Summary

• Search engines are among the most important applications or services on the web.

• The success of the Google search engine was mainly due to its simple, easy-to-use, no-ad interface, and its powerful PageRank algorithm.

Page 36: 1 Searching the Web Prepared By: Hasan Ba-Abdullah. 425121603 Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences

36

ThanksAny Questions