Web Document Rankingeol/SSIIM/1516/seminars/SSIM_nunes15-web-ranking.pdfWeb Document Ranking Web documents can be ranked in a static, absolute way or ranked in a given context. The

Web Document Ranking

Sérgio Nunes

DEI, Faculdade de EngenhariaUniversidade do Porto

SSIIM, MIEIC, 2014/152015/16

Overview of concepts and techniquesfor ranking web documents

The World Wide Web

The Web

The World Wide Web is a distributed information systemunprecedented in many ways — in size, in lack of centralcoordination, and in the diversity of users’ backgrounds.

The first published vision of a large-scale distributedhypertext system can be traced back to Vannevar Bush’sseminal article “As We May Think” (1945).

Web Growth

Web pages >> web hosts.

Altavista reported an index of 30 million web pages in 1995.At least 11.5 billion indexable web pages in 2005 [Gulli et al.].

How can we estimate the size of the web?

Authority Problem

Several factors have led to the mass adoption of the web as apublishing medium — from anonymous individuals toprofessional organizations.

The lack of a central authority or coordination, the simplicityof the underlying technology, and the easy access to free webpublishing tools, means that anybody can publish anything.

How can we assess the reliability of content found on the web?

Which pages can we trust?

Web Directories

A web directory is a hierarchical structure, organized bytopics, containing selected web sites — e.g. dmoz.org.

In the early days of the web, these directories were verypopular — human editors selected the highest quality pagesfor each category.

This approach quickly became unfeasible at web-scale.Additionally, these approaches implied a strong semanticagreement between the directory’s editors and the users.

Search Engines

First generation search engines were based on classic keywordmatching techniques developed for text search. The mainchallenge was dealing with the size of the web.

While classic text search techniques provided sufficient results,the overall quality was questionable due to the nature of webcontent.

Most notably, the web has no central editorial control, there isa complete lack of publishing standards, there is a high degreeof content duplication and some content is published withmalicious intents (i.e. spam).

Web’s Size

Estimating the size of the web is not a trivial problem — e.g.the number of dynamic web pages is technically infinite.

The deep web is estimated to be several orders of magnitudebigger than the surface web.

The size of the surface web was considered to be 170 TB in2003. The deep web was several orders of magnitude bigger,with approximately 90,000 TB.“How Much Information? 2003”http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

SPAM

On the web, spam is an issue of major importance.

At its root, spam exists due to commercial motivations — e.g.achieve better rankings in search engines. There is a widerange of techniques for web spam, from simple to highlysophisticated.

Keyword stuffing Repetition of high-value keywords in content.

Cloaking (mask) Show different content to search engines.

Link spam Artificial links created using hidden links, link farms, etc.

Web search engines operate in an adversarial informationretrieval environment (research topic).

SPAM Example

1. Scrape content from real web documents: blogs,Wikipedia, news sites, etc.

2. Mix and generate synthetic content to avoid duplicatedetection.

3. Insert key words and phrases.

4. Replace or insert links to sites being “promoted”.

5. Publish content on the web using free publishingplatforms (e.g. wordpress, blogspot, comments, etc).

The Web GraphThe web is usually modeled as a directed graph, where eachweb page is a node and each link is a directed edge.

A B

C

The hyperlinks that point to a page are called in-links andthose originating in the page are called out-links. The numberof in-links to a page is called in-degree.

The Bowtie Model

SCC OUTIN

TUBE

DC

TENDRILS

A web surfer can pass from any page in IN to any page in SCC byfollowing hyperlinks. Likewise, from any page in SCC to any pagein OUT. SCC is a strongly connected core.“Graph structure in the Web” (2000) http://dx.doi.org/10.1016/S1389-1286(00)00083-9

Web Ranking

Web Document Ranking

Web documents can be ranked in a static, absolute way orranked in a given context.

The static ranking of document is typically calledquery-independent — i.e. documents have a weight regardlessof a query or a context. E.g.: most important document onthe world wide web.

In query-dependent ranking, each document has a differentweight depending on the query of context being analyzed.E.g.: best document for learning how to cook.

Signals

Documents are scored (i.e. ranked) using various sources ofinformation, usually called features or, more generically,signals. A multitude of signals can be identified:

! Length of document! Age of document! Number of incoming links! Number of outgoing links! Document’s host domain! Document’s language

! Number of query terms! Time of query! Query terms in document! Query terms in collection! Query terms in document title! Query’s language

On the left are examples of query-independent signals, on theright are query-dependent examples.

Google reportedly uses more than 200 signals in their ranking.

Types of Signals

The signals available in a collection of web documents can bedivided in two groups depending on their origins.

The signals obtained directly from the document are nameddocument-based signals. E.g.: term frequency, doc length, etc.

Signals obtained from the Web are named web-based signals.E.g.: number of citations, anchor text, etc.

Web search engines have access to other sources of signals:click data, external collections, etc.

Document-based Signals

Term Frequency

The number of occurrences of a terms in a document is asignal typically used in text retrieval. However, the web is anadversarial information retrieval environment.

Quasi architecto

Sed ut perspiciatis unde omnis iste

natus error sit flowers accusantium

doloremque laudantium, totam rem

aperiam, eaque ipsa quae ab illo

flowers veritatis et quasi architecto

beatae vitae dicta sunt explicabo.

Nemo enim flowers voluptatem quia

voluptas sit aspernatur aut odit aut

fugit, sed quia consequuntur magni

dolores eos qui ratione voluptatem

sequi nesciunt.

TF("flowers") = 3

Quasi architecto

Sed ut flowers unde omnis flowers

natus error sit flowers accusantium

flowers laudantium, totam rem

aperiam, eaque ipsa quae ab illo

flowers veritatis et quasi flowers

beatae vitae dicta sunt explicabo.

Nemo enim flowers voluptatem quia

voluptas sit aspernatur aut flowers aut

fugit, sed quia flowers magni dolores

eos qui ratione voluptatem sequi

flowers.

TF("flowers") = 10

Quasi architecto

flowers ut flowers flowers omnis

flowers flowers flowers sit flowers

flowers flowers flowers, totam

flowers aperiam, flowers ipsa flowers

ab flowers flowers flowers et quasi

flowers flowers flowers dicta flowers.

flowers enim flowers flowers quia

flowers flowers flowers aut flowers

aut flowers, flowers quia flowers

flowers dolores flowers qui flowers

flowers sequi flowers.

TF("flowers") = ∞

Inverse Document FrequencyTerms that appear in fewer documents of a collection havemore discriminative power, thus are given an higher weight.

IDF (term) =|Documents in collection|

|Documents containing term|

Measures the general importance of a term. Combined withterm frequency, results in the classic tf.idf measure.

Term PositionThe position of a term within an HTML file has impact on itsmeaning and importance. Terms within the title or strongtags are highlighted differently.

Quasi architecto


natus flowers sit olucap accusantium


aperiam, eaque ipsa quae ab illo sumo

veritatis et quasi flowers beatae vitae

dicta sunt explicabo.

Nemo enim etupm voluptatem quia

flowers sit aspernatur aut odit aut fugit,

sed quia consequuntur flowers dolores

eos qui ratione voluptatem sequi

nesciunt.

Quasi flowers


natus error sit olucap accusantium

doloremque flowers, totam rem


veritatis et quasi architecto beatae vitae






sequi nesciunt.

Term Position

Regardless of the HTML structure, should terms in differentpositions have different weights?

Quasi architecto

Sed ut flowers unde flowers iste natus

flowers sit olucap flowers doloremque

flowers, totam rem aperiam, eaque

ipsa quae ab illo sumo veritatis et quasi

architecto beatae vitae dicta sunt

explicabo.





sequi nesciunt.

Quasi architecto


natus error sit olucap accusantium



veritatis et quasi architecto beatae vitae




fugit, flowers quia flowers magni

dolores flowers qui ratione flowers

flowers nesciunt.

Web-based Signals

Host StructureWeb documents in the same host are related to each other.

A document in a high-value host like www.bbc.co.uk shouldbe valued higher than www.besttopnews.com.

The location of a document in a site structure is an importantsignal. Documents that are closer to the root of a site aretypically more important.

Anchor TextA citation between web documents is defined by an HTMLanchor tag that requires a content. The text used in anchortags is one of the most valuable signals

<a href="http://www.amazon.com">amazon</a>

www.amazon.com

amazon

books

books

sucks

Link Analysis

Link analysis has many aspects in common with the field ofbibliometrics, more specifically citation analysis.

Central assumption → a link is an endorsement.A hyperlink from page A to page B represents a vote in pageB from the creator of page A.

Simply using the in-degree of a page as a measure of itsimportance would be easy to manipulate (e.g. link spam).

PageRank

Originated from Stanford and used by Google.

The PageRank algorithm depends on the link structure of theweb graph and assigns a score between 0 and 1 to each page.

The PageRank weight is a query-independent score.

“The PageRank Citation Ranking: Bringing Order to the Web”Larry Page, Sergey Brin, Rajeev Motwani and Terry Winograd (1998)

PageRank Random Surfer

0

2

1

3

2

1

1

1. Consider a random surfer visiting web pages andfollowing the out-links in a random fashion at each point.

2. Eventually, the nodes with an higher in-degree will bevisited more often.

3. The idea behind PageRank is that pages that have morevisits are more important.

PageRank Calculation

PR(A) = (1− d) + d×!

p∈In(A)

PR(p)

|Out(p)|

0.6 0.2

0.2

0.2

0.2

0.20.2

0.35

0.2

0.15

0.2

0.15

d = 1

Computation is performed iteratively untila minimum threshold is achieved.

PageRank Example

A

B

C E

D

PR(A) =PR(B)

2+PR(C)

1+PR(E)

3

PR(B) =PR(D)

1

PR(C) =PR(E)

3

PR(D) =PR(A)

1+

PR(E)

3

PR(E) =PR(B)

2

HITS

The Hyperlinked Induced Topic Selection (HITS) wasproposed by Jon Kleinberg in 1999.

HITS is an algorithm that uses the link structure of the webto produce two query-dependent scores — an authority scoreand a hub score.

An authority is a page with many citations from hubs.A hub is a page that cites a large number of authorities.

Three major differences from PageRank:(1) it is computed at query time (!); (2) it produces two valuesfor each page; (3) it is applied to subsets of the web.

HITS Calculation

1. Select a collection of documents related to a query.

2. Iteratively calculate authority and hub values for eachdocument.

Authority(A) =!

p∈In(A)

Hub(p)

Hub(A) =!

p∈Out(A)

Authority(p)

Scoring

With so many signals, how to obtain a single ranking score?

Score(P ) = α× Signal1(P ) + β × S2(P ) + γ × S3(P ) . . .

1. Manually tuning by experts based on real-datameasurements.

2. Use machine-learning methods to automatically buildranking formulas: learning to rank / machine-learnedrelevance.

Search Engines

Discovering Information

There are two broad categories of services for facilitating thediscovering of information on the web.

Full-Text Search EnginesGenerically known as web search engines, these services crawlthe web, index their contents and rank the documents.

Web DirectoriesTopic-oriented collections, maintained by human editors.

Search Engine Architecture

WEB

Disk Disk Disk

INDEXER

CRAWLER

SEARCH

USER

RANKING

Crawler

Includes the software that finds and fetches web pages.Multiple and distributed crawlers operate simultaneously.

First generation search engines had a scheduled periodic crawlof the web. In current search engines, crawlers operatecontinuously — e.g. very popular and dynamic documents arecrawled multiples times a day.

There is an infinite number of pages on the Web, thus thecrawler must decide which will be crawled and which won’t.

A crawler must be robust and polite.A crawler should be distributed, scalable, efficient, fresh,quality-targeted and extensible.

robots.txt

User-agent: *

Disallow: /ADS/

Disallow: /banners/

Disallow: /bartoon/

Disallow: /bdt/

Disallow: /bin/

Disallow: /calvin_and_hobbes/

Disallow: /cinecartaz/

Disallow: /desportohtml/

Disallow: /emprego/

Disallow: /especial/

Disallow: /img/

Disallow: /includeKimus/

Disallow: /lazer/

Disallow: /mail/

Disallow: /static/

Disallow: /xsl/

www.publico.pt/robots.txt

User-agent: *

Disallow: /search

Disallow: /groups

Disallow: /images

Disallow: /catalogs

Disallow: /catalogues

Disallow: /news

Allow: /news/directory

Disallow: /nwshp

Disallow: /setnewsprefs?

Disallow: /index.html?

Disallow: /?

Disallow: /addurl/image?

Disallow: /pagead/

Disallow: /relpage/

Disallow: /relcontent

Disallow: /imgres

Disallow: /imglanding

Disallow: /keyword/

Disallow: /u/

Disallow: /univ/

Disallow: /cobrand

...

www.google.com/robots.txt

IndexerIndices are data structures designed for fast reading.The index is the biggest component of a search engine.

Web documents are parsed and separated into tokens. This isa very challenging task due to the diversity of the web: fileformats, language ambiguity, word boundaries, etc.

a —› d1...

domingo —› d1,d17,d30

estranho —› d2

flores —› d1,d3,d5

porto —› d4,d18

...

Research challenges in: size optimization, parallelism,maintenance, lookup speed, etc.

Ranking and Presentation

MAGICin x millisecs

QUERY 10 DOCS

For a given query, documents are ordered combining hundredsof signals. Additionally, ads are selected ($) and snippets areproduced for each document. All in a few milliseconds.

Business

“1% of the web search market is worth over $1 billion”

Search engine’s business model is based on advertisement.

First business models were based on small per-view charges.Ads were indiscriminately published, resulting a lowconversion rates.

The use of targeted advertising (ads are related to searches)resulted in much higher conversion rates. Advertisers bid onquery terms and pay-per-click.

Search engines operate complex systems that try to maximizerevenue by selecting which ads to display.

Summary

The World Wide Web didn’t exist 20 years ago.

The Web is scientifically young and combines research frommany different fields, not just technology.

There are many open problems and much more to be opened.

Some currently hot topics: learning to rank, wisdom of thecrowds, social media, real-time, contextual, hcir.

Thank You

http://www.fe.up.pt/∼ssn

Some Ideas for SSIIM

- ANT: evaluation of entity oriented searchQueries in entity search: relation, attribute, entity, type, keyword

- State of the art report on DB ranking

- Web template extraction - Web meta-search - Web crawling

- Measuring diversity in search results

- Social Networks characterization

References

! An Introduction to Information Retrieval (2009)Christopher D. Manning, Prabhakar Raghavan and Hinrich Schützehttp://www.informationretrieval.org

! Web Information Retrieval (2009)Nick Craswell and David Hawking