61
2 December 2005 Web Information Systems Web Search Prof. Beat Signer Department of Computer Science Vrije Universiteit Brussel http://www.beatsigner.com

Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Embed Size (px)

DESCRIPTION

This lecture is part of a Web Information Systems course given at the Vrije Universiteit Brussel.

Citation preview

Page 1: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

2 December 2005

Web Information SystemsWeb Search

Prof. Beat Signer

Department of Computer Science

Vrije Universiteit Brussel

http://www.beatsigner.com

Page 2: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 2December 5, 2014

Search Engine Result Pages (SERP)

Page 3: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 3December 5, 2014

Search Engine Result Pages (SERP) ...

Page 4: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 4December 5, 2014

Vertical Search Result Pages

Page 5: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 5December 5, 2014

Search Engine Market Share (2013)

Page 6: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 6December 5, 2014

Search Engine Result Page

There is a variety of information shown on a search

engine result page (SERP) organic search results

non-organic search results

meta-information about the result (e.g. number of result pages)

vertical navigation

advanced search options

query refinement suggestions

...

Page 7: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 7December 5, 2014

Search Engine History

Early "search engines" include various systems

starting with Bush's Memex

Archie (1990) first Internet search engine

indexing of files on FTP servers

W3Catalog (September 1993) first "web search engine"

mirroring and integration of manually maintained catalogues

JumpStation (December 1993) first web search engine combining crawling, indexing and

searching

Page 8: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 8December 5, 2014

Search Engine History ...

In the following two years (1994/1995) many

new search engines appeared AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ...

Two categories of early Web search solutions full text search

- based on an index that is automatically created by a web crawler in

combination with an indexer

- e.g. AltaVista or InfoSeek

manually maintained classification (hierarchy) of webpages

- significant human editing effort

- e.g. Yahoo

Page 9: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 9December 5, 2014

Information Retrieval

Precision and recall can be used to measure the

performance of different information retrieval algorithms

documents retrieved

documents retrieveddocumentsrelevant precision

documentsrelevant

documents retrieveddocumentsrelevant recall

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3 D8

D9 D10

query

6.05

3precision

75.04

3recall

Page 10: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 10December 5, 2014

Information Retrieval ...

Often a combination of precision and recall, the so-called

F-score (harmonic mean) is used as a single measure

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3

D8 D9 D10

query

57.0precision

1recall

recallprecision

recallprecision2score-F

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3 D8

D9 D10

query

6.0precision

75.0recall

67.0score-F

D5D2

73.0score-F

Page 11: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 11December 5, 2014

Bank

Delhaize

Ghent

Metro

Shopping

Train

D1 D2 D3 D4 D5 D6

1

Boolean Model

Based on set theory and boolean logic

Exact matching of documents to a user query

Uses the boolean AND, OR and NOT operators

query: Shopping AND Ghent AND NOT Delhaize

computation: 101110 AND 100111 AND 000111 = 000110

result: document set {D4,D5}

1 0 0 1 1

1

1

0

1

1

1

0

0

1

0

0

1

1

1

0

0

1

0

1

1

0

1

0

1

0

0

1

0

0

0

... ... ... ... ... ... ...

inverted index

Page 12: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 12December 5, 2014

Boolean Model ...

Advantages relatively easy to implement and scalable

fast query processing based on parallel scanning of indexes

Disadvantages does not pay attention to synonymy

- different words with similar meaning

does not pay attention to polysemy

- a single word with different meanings

no ranking of output

often the user has to learn a special syntax such as the use of double quotes to search for phrases

Variants of the boolean model form the basis of many

search engines

Page 13: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 13December 5, 2014

Vector Space Model

Algebraic model representing text documents and

queries as vectors based on the index terms one dimension for each term

Compute the similarity (angle) between the query vector

and the document vectors

Advantages simple model based on linear algebra

partial matching with relevance scoring for results

potenial query reevaluation based on user relevance feedback

Disadvantages computationally expensive (similarity measures for each query)

limited scalability

Page 14: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 14December 5, 2014

Web Search Engines

Most web search engines are based on traditional

information retrieval techniques but they have to be

adapted to deal with the characteristics of the Web immense amount of web resources (>50 billion webpages)

hyperlinked resources

dynamic content with frequent updates

self-organised web resources

Evaluation of performance no standard collections

often based on user studies (satisfaction)

Of course not only the precision and recall but also the

query answer time is an important issue

Page 15: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 15December 5, 2014

Web Search Engine Architecture

WWW Crawler

URL Pool

StorageManager

PageRepository

content already added?

Document Index

SpecialIndexes

IndexersURL Handler

URLRepository

filter

normalisation

and duplicate

elimination

Client

QueryHandler

inverted index

Ranking

Page 16: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 16December 5, 2014

Web Crawler

A web crawler or spider is used to create an

index of webpages to be used by a web search engine any web search is then based on this index

Web crawler has to deal with the following issues freshness

- the index should be updated regularly (based on webpage update frequency)

quality

- since not all webpages can be indexed, the crawler should give priority to

"high quality" pages

scalabilty

- it should be possible to increase the crawl rate by just adding additional

servers (modular architecture)

- e.g. the estimated number of Google servers in 2007 was 1'000'000 (including

not only the crawler but the entire Google platform)

Page 17: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 17December 5, 2014

Web Crawler ...

distribution

- the crawler should be able to run in a distributed manner (computer centers all

over the world)

robustness

- the Web contains a lot of pages with errors and a crawler has to deal with

these problems

- e.g. deal with a web server that creates an unlimited number of "virtual web

pages" (crawler trap)

efficiency

- resources (e.g. network bandwidth) should be used in a most efficient way

crawl rates

- the crawler should pay attention to existing web server policies

(e.g. revisit-after HTML meta tag or robots.txt file)

User-agent: *Disallow: /cgi-bin/Disallow: /tmp/ robots.txt

Page 18: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 18December 5, 2014

Pre-1998 Web Search

Find all documents for a given query term use information retrieval (IR) solutions

- boolean model

- vector space model

- ...

ranking based on "on-page factors" problem: poor quality of search results (order)

Larry Page and Sergey Brin proposed to compute the

absolute quality of a page called PageRank based on the number and quality of pages linking

to a page (votes)

query-independent

Page 19: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 19December 5, 2014

Origins of PageRank

Developed as part of an

academic project at Stanford

University research platform to aid under-

standing of large-scale web dataand enable researchers to easilyexperiment with new searchtechnologies

Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google

Larry Page Sergey Brin

Page 20: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 20December 5, 2014

PageRank

A page Pi has a high PageRank Ri if there are many pages linking to it

or, if there are some pages with a high PageRank linking to it

Total score = IR score × PageRank

P1

R1

P2

R2

P3

R3

P4

R4

P5

R5

P6

R6

P7

R7

P8

R8

Page 21: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 21December 5, 2014

Basic PageRank Algorithm

where Bi is the set of pages

that link to page Pi

Lj is the number ofoutgoing links for page Pj

ij BP j

j

iL

PRPR

)()(

P1 P2

P3

P1

1

P2

1

P3

1

P1

1.5

P2

1.5

P3

0.75

P1

1.5

P2

1.5

P3

0.75

Page 22: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 22December 5, 2014

Matrix Representation

Let us define a hyperlink

matrix HP1 P2

P3

otherwise0

if1 ijj

ij

BPLH

0210

001

1210

H iPRRand

HRR

R is an eigenvector of H

with eigenvalue 1

Page 23: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 23December 5, 2014

Matrix Representation ...

We can use the power method to find R sparse matrix H with 40 billion columns and rows but only an

average of 10 non-zero entries in each colum

ttHRR 1

0210

001

1210

HFor our example

this results in or 122R 2.04.04.0

Page 24: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 24December 5, 2014

Dangling Pages (Rank Sink)

Problem with pages that

have no outgoing links (e.g. P2)

Stochastic adjustment if page Pj has no outgoing links then replace column j with 1/Lj

New stochastic matrix S always has a stationary vector R can also be interpreted as a markov chain

P1 P2

01

00H and 00R

210

210C

211

210CHSand

C

C

Page 25: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 25December 5, 2014

Strongly Connected Pages (Graph)

Add new transition proba-

bilities between all pages with probability d we follow

the hyperlink structure S

with probability 1-d wechoose a random page

matrix G becomes irreducible

Google matrix G reflects

a random surfer no modelling of back button

P1 P2

P3P4

P5

1SGn

dd1

1 GRR

1-d

1-d 1-d

Page 26: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 26December 5, 2014

Examples 1SGn

dd1

1

A1

0.26

A2

0.37

A3

0.37

Page 27: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 27December 5, 2014

Examples ...

A1

0.13

A2

0.185

A3

0.185

B1

0.13

B2

0.185

B3

0.185

5.0AP 5.0BP

1SGn

dd1

1

Page 28: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 28December 5, 2014

Examples

PageRank leakage

A1

0.10

A2

0.14

A3

0.14

B1

0.22

B2

0.20

B3

0.20

38.0AP 62.0BP

1SGn

dd1

1

Page 29: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 29December 5, 2014

Examples ...

A1

0.3

A2

0.23

A3

0.18

B1

0.10

B2

0.095

B3

0.095

71.0AP 29.0BP

1SGn

dd1

1

Page 30: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 30December 5, 2014

Examples

PageRank feedback

A1

0.35

A2

0.24

A3

0.18

B1

0.09

B2

0.07

B3

0.07

77.0AP 23.0BP

1SGn

dd1

1

Page 31: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 31December 5, 2014

Examples ...

A1

0.33

A2

0.17

A3

0.175

B1

0.08

B2

0.06

B3

0.06

80.0AP

20.0BPA4

0.125

1SGn

dd1

1

Page 32: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 32December 5, 2014

Google Webmaster Tools

Various services and infor-

mation about a website

Site configuration

submission of sitemap

crawler access

URLs of indexed pages

settings

- e.g. preferred domain

Your site on the web

search queries

keywords

internal and external links

Page 33: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 33December 5, 2014

Google Webmaster Tools ...

Diagnostics crawl rates and errors

HTML suggestions

Use HTML suggestions for on-page factor optimisation meta description

- duplicate meta descriptions

- too long meta descriptions

title tag

- missing or duplicate title tags

- too long or too short title tags

non-indexable content

Similar tools offered by other search engines e.g. Bing Webmaster Tools

Page 34: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 34December 5, 2014

XML Sitemaps

List of URLs that should be crawled and indexed

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.example.com/sitemap/0.9"><url><loc>https://www.tenera.ch/trommelreibe-classic-p-2259-l-de.html</loc><lastmod>2013-07-06</lastmod><changefreq>weekly</changefreq><priority>0.4</priority>

</url><url><loc>https://www.tenera.ch/universalmesser-weiss-p-34-l-de.html</loc><lastmod>2012-12-05</lastmod><changefreq>weekly</changefreq><priority>0.1</priority>

</url>...

</urlset>

Page 35: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 35December 5, 2014

XML Sitemaps ...

All major search engines support the sitemap format

The URLs of sitemap are not guaranteed to be added to

a search engine's index helps search engine to find pages that are not yet indexed

Additional metadata might be provided to search engines relative page relevance (priority)

date of last modififaction (lastmod)

update frequency (changefreq)

Page 36: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 36December 5, 2014

Questions

Is PageRank fair?

What about Google's power and influence?

What about Web 2.0 or Web 3.0 and web search? "non-existent" webpages such as offered by Rich Internet

Applications (e.g. using AJAX) may bring problems for traditional search engines (hidden web)

new forms of social search

- Delicious

- ...

social marketing

Page 37: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 37December 5, 2014

The Google Effect

A recent study by Sparrow et al. shows that

people less likely remember things that they

believe to be accessible online Internet as a transactive memory

Does our memory work differently in the age of Google?

What implications will the future of the Internet and new

search have?

Page 38: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 38December 5, 2014

Search Engine Marketing (SEM)

For many companies Internet marketing

has become a big business

Search engine marketing (SEM) aims to

increase the visibility of a website search engine optimisation (SEO)

paid search advertising (non-organic search)

social media marketing

SEO should not be decoupled from a website's

content, structure, design and used technologies

SEO has to be seen as an continuous process in a

rapidly changing environment different search engines with regular changes in ranking

Page 39: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 39December 5, 2014

Structural Choices

Keep the website structure as flat a possible minimise link depth

avoid pages with much more than 100 links

Think about your website's internal link structure which pages are directly linked from the homepage?

create many internal links for important pages

be "careful" about where to put outgoing links

- PageRank leakage

use keyword-rich anchor texts

dynamically create links between related content

- e.g. "customer who bought this also bought ..." or "visitors who viewed this

also viewed ..."

Increase the number of pages

Page 40: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 40December 5, 2014

Technological Choices

Use SEO-friendly content management system (CMS)

Dynamic URLs vs. static URLs avoid session IDs and parameters in URL

use URL rewriting to get descriptive URLs containing keywords

Think carefully about the use of dynamic content Rich Internet Applications (RIAs) based on AJAX etc.

content hidden behind pull-down menus etc.

Address webpages consistently http://www.vub.ac.be http://www.vub.ac.be/index.php

Some notes about the Google toolbar shows logarithmic PageRank value (from 0 to 10)

information not frequently updated (google dance)

Page 41: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 41December 5, 2014

Consistent Addressing of Webpages

Page 42: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 42December 5, 2014

Search Engine Optimisations

Different things can be optimised on-page factors

off-page factors

It is assumed that some search engines use more than

200 on-page and off-page factors for their ranking

Difference between optimisation and breaking the

"search engine rules" white hat and black hat optimisations

A bad ranking or removal from index can cost a company

a lot of money or even mark the end of the company e.g. supplemental index ("Google hell")

Page 43: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 43December 5, 2014

Positive On-Page Factors

Use of keywords at relevant places in title tag (preferably one of the first words)

in URL

in domain name

in header tags (e.g. <h1>)

multiple times in body text

Provide metadata e.g. <meta name="description"> also used by search engines

to create the text snippets on the SERPs

Quality of HTML code

Uniqueness of content across the website

Page freshness (changes from time to time)

Page 44: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 44December 5, 2014

Negative On-Page Factors

Links to "bad neighbourhood"

Link selling in 2007 Google announced a campaign against

paid links that transfer PageRank

Over optimisation penalty (keyword stuffing)

Text with same colour as background (hidden content)

Automatic redirect via the refresh meta tag

Cloaking different pages for spider and user

Malware being hosted on the page

Page 45: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 45December 5, 2014

Negative On-Page Factors ...

Duplicate or similar content

Duplicate page titles or meta tags

Slow page load time

Any copyright violations

...

Page 46: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 46December 5, 2014

Positive Off-Page Factors

Links from pages with a high PageRank

Keywords in anchor text of inbound links

Links from topically relevant sites

High clickthrough rate (CTR) from search engine for a

given keyword

Listed in DMOZ / Open Directory Project (ODP) and

Yahoo directories

High number of shares on social networks e.g. Facebook, Google+ or Twitter

Page 47: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 47December 5, 2014

Positive Off-Page Factors ...

Site age (stability) Google sandbox?

Domain expiration date

High PageRank

...

Page 48: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 48December 5, 2014

Negative Off-Page Factors

Site often not accessible to crawlers e.g. server problem

High bounce rate users immediately press the back button

Link buying rapidly increasing number of inbound links

Use of link farms

Participation in link sharing programmes

Links from bad neighbourhood?

Competitor attack (e.g. via duplicate content)?

Page 49: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 49December 5, 2014

Black Hat Optimisations (Don'ts)

Link farms

Spamdexing in guestbooks, Wikipedia etc. "solution": <a rel="nofollow" href="...">...</a>

Keyword Stuffing overuse of keywords

- content keyword stuffing

- image keyword stuffing

- keywords in meta tags

- invisible text with keywords

Selling/buying links "big" business until 2007

costs based on the PageRank of the linking site

Page 50: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 50December 5, 2014

Black Hat Optimisations (Don'ts) ...

Doorway pages (cloaking) doorway pages are normally just designed for search engines

- user is automatically redirected to the target page

e.g. BMW Germany and Ricoh Germany bannedin February 2006

Page 51: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 51December 5, 2014

Nofollow Link Example

Nofollow value for hyperlinks introduced by Google in

2005 to avoid spamdexing <a rel="nofollow" href="...">...</a>

Links with a nofollow value were not counted in the

PageRank computation division by number of outgoing links

e.g. page with 9 outgoing links and 3 of them are nofollow links

- PageRank divided by 6 and distributed across the 6 "really linked pages"

SEO experts started to use (misuse) the nofollow links

for PageRank sculpting control flow of PageRank within a website

Page 52: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 52December 5, 2014

Nofollow Link Example ...

In June 2009 Google decided to treat nofollow links

differently to avoid PageRank sculpting division by total number of outgoing links

e.g. page with 9 outgoing links and 3 of them are nofollow links

- PageRank divided by 9 and distributed across the 6 "really linked pages"

no longer a good solution to prevent Spamdexing since we loose (diffuse) some PageRank

SEO experts start to use alternative techniques to

replace nofollow links e.g. obfuscated JavaScript links

Page 53: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 53December 5, 2014

Product Search

Various shopping and

price comparison sites

import product data

some of them are free, for

others one has to pay

Google Product Search

started as Froogle, became

Google Products and now

Google Product Search

product data uploaded to

Google Base

very effective vertical search

Page 54: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 54December 5, 2014

Non-Organic Search

In addition to the so-called organic search, websites can

also participate in non-organic web search cost per impression (CPI)

cost- per-click (CPC)

The non-organic web search should be treated

independently from the organic web search

Quality of the landing page can have an impact on the

non-organic web search performance!

The Google AdWords programme is an example of a

commercial non-organic web search service other services include Yahoo! Advertising Solutions,

Facebook Ads, ...

Page 55: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 55December 5, 2014

Google AdWords

pay-per-click (PPC) or

cost-per-thousand (CPM)

Campains and ad groups

Two types of advertising

search

content network

- Google Adsense

Highly customisable ads

region

language

daytime

...

Page 56: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 56December 5, 2014

Google AdWords ...

Excellent control and monitoring for AdWords users cost per conversion

In 2013 Google's total advertising revenues

were 51 billion USD

Page 57: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 57December 5, 2014

Conclusions

Web information retrieval techniques have to deal with

the specific characteristics of the Web

PageRank algorithm absolute quality of a page based on incoming links

based on random surfer model

computed as eigenvector of Google matrix G

PageRank is just one (important) factor

Various implications for website development and SEO

Page 58: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 58December 5, 2014

Exercise 10

Web Search and PageRank

Page 59: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 59December 5, 2014

References

L. Page, S. Brin, R. Motwani and T. Winograd,

The PageRank Citation Ranking: Bringing Order

to the Web, January 1998

S. Brin and L. Page, The Anatomy of a Large-Scale

Hypertextual Web Search Engine, Computer Networks

and ISDN Systems, 30(1-7), April 1998

Amy N. Langville and Carl D. Meyer, Google's

PageRank and Beyond – The Science of Search Engine

Rankings, Princeton University Press, July 2006

PageRank Calculator http://www.webworkshop.net/pagerank_calculator.php

Page 60: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Beat Signer - Department of Computer Science - [email protected] 60December 5, 2014

References …

B. Sparrow, J. Liu and D.M. Wegner, Google

Effects on Memory: Cognitive Consequences of Having

Information at Our Fingertips, Science, July 2011

Google Webmaster Tools http://www.google.com/webmasters/

The W3C Markup Validation Service http://validator.w3.org

SEOmoz http://moz.com

Page 61: Web Search - Lecture 10 - Web Information Systems (4011474FNR)

2 December 2005

Next LectureSecurity, Privacy and Trust