Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive

Web Archive Information Retrieval

Miguel Costa, Daniel Gomes (speaker)Portuguese Web Archive

Information Retrievalis the activity of obtaining information resources relevant to an information need from a collection of information resources.”.

Wikipedia, en.wikipedia.org/wiki/Information_retrieval

Which are the information needsof web archive users?


Collecting information needs (methods)

[03/02/2012 21:16:11] QUERY fcul

[03/02/2012 21:16:19] CLICK RANK=1

Laboratory studies Online questionnaires

Search log mining

Classes of web archive information needs

• Navigational – 53% to 81%– Seeing a web page in the past or how it evolved– How was the home page of the Library of

Congress in 1996?

• Informational – 14% to 38%– Collecting information about a topic written in the

past– How many people worked at the Library of Congress

in 1996?

• Transactional – 5% to 16%– Downloading an old file or recovering a site from the

past– Make a copy of the site of Library of Congress in

1996.

Which are the information resources provided by web archives?


Conducted 2 surveys in 2010 and 2014

Questionnaires and public information.

List of Web archiving initiatives, Wikipedia

Data: significant amount archived by web archiving initiatives (2014)

>68 initiatives across 33 countries>534 billions of web-archived files since 1996 (17 PB)

Access method: URL search

• Technology based on the Wayback Machine. • URLs from the past that contain the desired

information may be unknown to the users.• The URLs for the same site change across time.• <lcweb.loc.gov, 1996> vs. <www.loc.gov, 2015>

http://lcweb.loc.gov/

http://lcweb.loc.gov/

http://www.loc.gov/

Access method: Full-text search

• Technology based on Lucene extensions (NutchWAX & Solr).

• Lucene was not designed for Web Archive Information Retrieval • Low relevance of search results

How to obtain relevantinformation resourcesthrough full-text search?

Our challenge!


A Full-Text Index is like a huge Book GlossaryTerm………..Archived Web Pages (millions)“Library”………….<www.loc.gov, 2009>, <www.bl.uk, 2008>,

<www.nlm.nih.gov, 2010>, <www.ala.org, 2005>“Congress”……….<lcweb.loc.gov,

2009>,<www.greencarcongress.com, 2011>, <www.loc.gov, 2009>, <eawop-congress.iscte.pt, 2003>

Information Retrieval• Search for archived pages that contain given

words• Millions of web pages that contain the searched

words.Web Archive Information Retrieval• Archived pages are spread across time.

Which archived pages should be presented on the top results?

How to rank the relevance of temporal search results?

Previous studies showed that temporal information:• has been exploited to improve IR

systems.• can be extracted from web archives.

Hypothesis: Improve WAIR systems by exploiting temporal information intrinsic to web archives through Machine Learning (Learn to Rank).

Ranking relevance in WAIR

Methodology: combine ranking features using learn-to-rank (L2R) methods into a ranking model

Ranking features• f(q,d): frequency of

query term q on text of document d

• i(d): number of inlinks to document d

• …

Test collection• Documents• Topics• Relevance

judgments

Best ranking model• r(f(q,d), i(d))

=0.6 x f(q,d) +0.4 x i(d)

L2R method• L2R

algorithms

Including temporal features.

Automatically assigned weights for each ranking feature.

Created Test collection through Cranfield Paradigm

• Documents – 255 million documents (8.9TB)– 6 archived-web collections: broad crawls,

selective crawls, integrated collections– 14 years time span (1996-2010)

• Topics– 50 navigational needs (with date range)– e.g. the page of Publico newspaper before 2000

• Relevance judgments– 3 judges, 3-level scale of relevance, 267 822

versions assessed– 5-fold cross-validation: 3 folders for training, 1

for validation, 1 for testing

Relevance judgments (qrels: TREC format)

Test collection sample

Topics (navigational needs)

L2R method

• 3 L2R algorithms– AdaRank, Random Forests, RankSVM.

• 6 relevance metrics– NDCG@k | Precision@k: k=1,5,10

• 68 ranking features– IR ranking features (e.g. Lucene, BM25,

TF-IDF)– New temporal features

Temporal ranking features

Intuition:

Persistent documents are more relevant for navigational queries.

Lifespan is correlated to relevance.

documents with higherrelevance tend to have

a longer lifespan

14 years of web snapshots analyzed

0

20

40

60

80

100

relevance level

frac

tion

of d

ocum

ents

with

a

lifes

pan

long

er th

an 1

yea

r

Number of versions is correlated to relevance.

documents with higherrelevance tend to have

more versions

0

20

40

60

80

100

relevance level

frac

tion

of d

ocum

ents

with

mor

e th

an 1

0 ve

rsio

ns

14 years of web snapshots analyzed

L2R with temporal features improves search relevance

Relevancemetric

Lucene NutchWAX

L2R without temporal features (Random Forests method)

L2R with temporal features (Random Forests method)

Precision gain of L2R with temporal features over Lucene

Precision@1

0.28 0.32 0.64 0.76 270%

Precision@5

0.16 0.24 0.39 0.4 250%

Precision@10

0.13 0.17 0.24 0.24 184%

Most important ranking features (initial selection)

We do not use them in Arquivo.pt yet. We use:• 4 ranking features derived from TREC

collection.• 4 fields (URL, title, text body and anchor

text of incoming links).

Temporal ranking model

Intuition:

Several ranking models learned for specific periods of time are more effective than a single ranking model.

Why several ranking models for WAIR?

• The web presents different characteristics across time– Technology– Structure and size of pages– Structure and size of sites– Language– Web-graph structure– Web 1996 vs. Web 2015?

• Should we use a single-model that fits all data?– No: [Kang & Kim 2003; Geng et al. 2008; Bian et al.

2010]

Several ranking models yield more relevant results than a single ranking model with temporal features

• Best results with 4 ranking models for 14 years

14 7 4 2 10.510.520.530.540.550.560.570.580.590.6

0.61

Nr. of ranking models for 14 years timespan of archived collections

Relevance NDCG@10

+3.3%

Conclusions1. Web archive users

– First navigational needs, then informational needs.– Search as in web search engines.– Prefer full-text search and older documents.

2. WAIR systems– WAIR must be optimized for the specific needs of web

archive users.– Some needs are not supported by current Information

Retrieval technology.

3. Improving WAIR systems– Relevant documents tend to have a longer lifespan and

more versions.– WAIR systems can be improved by exploiting ranking

features intrinsic to web archives using Machine Learning (L2R).

– Ranking models per time period outperform a single ranking model.

Future (hard) work

• Feature selection– We do not have our best ranking function in

production– Must choose features attainable to implement

considering the required resources (CPU, memory, processing time, impact on response speed)

• Create test collection for informational needs– Who was the president of the USA in 2007?

• Image search– A “must have” from users

• We need your collaboration!

All our resources are free open source• All code available under the LGPL license:

– https://code.google.com/p/pwa-technologies/• Test collection to support evaluation:

– https://code.google.com/p/pwa-technologies/wiki/TestCollection

• L2R dataset for WAIR research:– http://

code.google.com/p/pwa-technologies/wiki/L2R4WAIR• Search service since 2010:

– http://archive.pt• OpenSearch API:

– http://code.google.com/p/pwa-technologies/wiki/OpenSearch

• Google Code:Pwa_technologies will migrate to GitHub:FullPast project.

https://code.google.com/p/pwa-technologies/

https://code.google.com/p/pwa-technologies/

https://code.google.com/p/pwa-technologies/wiki/TestCollection

https://code.google.com/p/pwa-technologies/wiki/TestCollection

http://code.google.com/p/pwa-technologies/wiki/L2R4WAIR

http://code.google.com/p/pwa-technologies/wiki/L2R4WAIR

http://archive.pt/

http://code.google.com/p/pwa-technologies/wiki/OpenSearch

http://code.google.com/p/pwa-technologies/wiki/OpenSearch

Publications• A Survey on Web Archiving Initiatives, Gomes et al., 2011.

• Understanding the Information Needs of Web Archive Users, Costa & Silva, 2010.

• Characterizing Search Behavior in Web Archives, Costa & Silva, 2011.

• A Search Log Analysis of a Portuguese Web Search Engine, Costa & Silva, 2010.

• Evaluating Web Archive Search Systems, Costa & Silva, 2012.

• Towards Information Retrieval Evaluation over Web Archives, Costa & Silva, 2009.

• Learning Temporal-Dependent Ranking Models, Costa et al., 2014.

• Creating a Billion-Scale Searchable Web Archive, Gomes et al., 2013.

• Information Search in Web Archives, Miguel Costa, PhD Thesis, 2014.

Thank you.