Engineering Web Search Applications

1.Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010

Alessandro Bozzon

Post-doc @Politecnico di Milano

http://home.dei.polimi.it/bozzon

Marco Brambilla

Assistant Professor @Politecnico di Milano

http://home.dei.polimi.it/mbrambil

Research background and interests

Web engineering and model-driven development

WebML and WebRatio

Complex enterprise application design

BPM, SOA and integration with Web application devel.

Search engine and complex search application development

Search Computing: multidomain search

Pharos: multimedia search framework

Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints

This tutorial is:

Breadth-oriented

Development process driven

using real-world case studies as examples

The tutorial is necessarily shallow

But we provide references and links

Introduction

What are Web search applications?

Requirements

Which are their requirements?

Design

How to design them?

Implementation

How to implement them?

Validation

How to measure their success?

Searchis an integral part of online life of people

Web search has become a standard (and often preferred) source of information finding

... 92%of Internet users say the Internet is a good place to go for getting everydayinformation... - 2004 Pew Internet Survey

Web search engines are now thesecond most frequently usedonline computer application, after email

Search is fully integrated into operating systems and is viewed as an essential part of most information systems

Estimated size:~ 60 billion pages 22/06/2010

http://www.worldwidewebsize.com/

> 9.3 billion queries just in the U.S. inMay2010

http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/

and growing

Twitter

# of new tweets per day: 55 million

# of search queries per day: 600 million

Facebook

400 Million Global Users (and growing)

The average Facebook User Spends 55 Minutes Per Day

IDC Digital Universe report estimates:

digital data grew by 62% between 2008 and 2009

~ 800,000 petabytes (PB)

>1.2 million PB in 2010

reach 35 ZB (zetabytes) by 2020.

Information retrieval (IR)deals with the representation, storage, organization of, and access to information items.

Old discipline

As an academic field of study:

Information retrieval (IR) is devoted tofinding relevant documents , not finding simple match to patterns.

Information retrieval (IR) is finding material (usually documents) of anunstructured nature(usually text) that satisfy an information need from within large collections (usually stored on computers).

[Manning et al., 2007]

Search(ad hoc retrieval)

Static document collection

Dynamic queries

Filtering

Queries are static

Document collection constantly changing

Example: corporate mails routed by predefined queries to different parts of the organizations

retrieving all objects whichmight be useful or relevantto the user information need

Usuallyunstructuredqueries (no formal semantics)

The IR system interpret the contents of the information items

Examples: keyword-based queries, context queries, proximity, phrases, natural language queries

Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics)

Errorsin the results aretolerated

Core concept:relevance

Relevance Ranking(accordingto the user need)

It is not clear what degree of relevancethe user is happy with

The user starts from the top of theranked list and explore down satisfied

Data Retrieval (RDBMS, XML DB)

retrieving all objects whichsatisfy clearly defined conditionsexpressed trough a query language.

Data has a well defined structure and semantics

Formal query languages

Regular expression, relation algebra expression, etc.

Results areEXACT matches errors are not tolerated

Norankingw.r.t. the userinformation need

Binary retrieval: does not allow the user to control the magnitude of the output

For a given query, the system may return:

Under-dimensioned output

Over-dimensioned output

Search Engine

data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query

Web Search Application

data management system where search engines are a piece of a more complex puzzle, that includes:

data source integration (e.g. databases,legacy systems, the Web)

content analysis technologies orchestration

user interfaces

Web-mediated social interactions, etc.

It is not a simple problem:

Blurred goals

Sensory Gap

Gap between the object in the

world and the information in a

(computational) description

Semantic Gap

Lack of coincidence between the

(computational) description of the

information and their interpretation

Precision: fraction of retrieved docs that are relevant

P(relevant|retrieved)

degree of soundness of the system

not considering the total number of documents

Recall: fraction of relevant docs that are retrieved

P(retrieved|relevant)

degree of completeness of the system

Public Web search engines are the ones known to the general public

But there is also a huge need (and market share!) forprofessional search over enterprise repositories

Enterprise search is covered by

Packaged suites

Microsoft FAST

Autonomy IDOL

IBM OmniFind

Exalead

Frameworks

Apache UIMA (ex IBM)

Textual Search

YaGoBi

Multi-media Search

The PHAROS Project

Multi-domain Search

The Search Computing project

Example of Web Search Application

Chansonnier

THEWeb Search

92% of marketshare in the U.S.

Searching on

Web pages, Blog, News, Books, Scientific Publications, Emails

Images and Videos (but only troughtextual descriptions )

Tweets

FP6 IP, 3Years, 12 Partners, ~15 M budget

Mission : Develop SOA-compliant,open and distributed technologyplatform for development of information access solutions foraudio visual content

www.pharos-audiovisual-search.eu

European Research Council (ERC), 2008 Call for "IDEAS Advanced Grants, 5y (started in 2009)

Mission : provide the abstractions, foundations, methods, and tools required to answermulti-domain queries by interacting with a constellation of cooperating search services, usingranking and joining of results

as the dominant factors forservice

composition

www.search-computing.org

BsC Thesis project

Mission : graduate

Open source video analysis

application based on

open frameworks(SMILA / SOLR)

Crawling of Web video

Download of song lyrics

Analysis on lyrics text

Language, emotion

Keyframe extraction for video snippets

http://github.com/giorgiosironi/Chansonnier

Data Source

User Behavior

Query Format

User Interface

Security

Data Analysis

Performance

Data Format

Social Interactions

Search Engine

Databases

File systems

Intranet / Extranets

Legacy systems

Sensors (in wide sense)and streams

Unstructured data

Textual Documents

Blog Posts

(Semi) Structured data

Software Code

Models

XML Files

Pictures

Textual Analysis

Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.)

Media Analysis

Deals with media contents

Transcoding

Classification

Feature Extraction

An activity performed at the purpose of providing a representation of a content item suited for the application

Textual

Textual contents represented as collection of unstructured text terms

Fielded

Textual contents structured infields(e.g., metadata)

Semi-structured

Textual contents organized incomplex (possibly heterogeneous)structure (e.g., XML, HTML)

Content-based

Media contents described by low-level features

Geographic and other special dimensions

Content featuring geo-spatial features

Streaming content searched by temporal features (e.g., recency)

Representation of the user information need

Natural Language

For instance trough vocal interfaces

Keyword

Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions

Fielded Keyword

Text items defined on one or more fields

Queries to semi-structured search-engines andFaceted queries

Content-based

Query by example (text, image, video, audio, etc.)

Geographicand other special dimensions

Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.)

Timestamps plus temporal operator terms (recent, near, interval, etc.)

Data Sources

Web : crawling of Web resources

Users : comments, preferences, relationships

Data Types

Unstructured data :Web pages

Documents : PDF, PPT, DOC, etc.

Data Analysis

Textual : for content, document, and user generated comments

Media : some basic image analysis for color, faces, size

Search Engine

Fielded: filetype, page title, site, page content

Content-based: image similarity in Google

Query Format:

Fielded keyword

Geographic

Data Sources

Web : crawling of audio/video files

File System : NAS and content provider media archives

Users : comments, preferences, relationships

Data Types

Structured data : content provider description metadata

Media : hi-quality video and audio files

Semi-structured data : MPEG-7 description of processed media files and user annotations

Data Analysis

Textual : for content metadata and user generated comments

Media : for audio and video

Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text

Search Engine

Semi-structured : XML search engine for MPEG-7 content description

Plusgeographicannotations and geo-based ranking

3 content-based engines :

one CB for music,

one for images (shots of the video)

one for face similarity

Query Format

Fielded-keyword : XQuery for XML search engine

Query by example : for image, music and faces

MPQF: high level query language

AND/OR/AND THEN for fielded keyword and by-example queries

Search is evolving

Content Vs. Intent

People dont want to search

People want to get task done and get answers

Moving towardsidentifying a users task

Enabling means fortask completion

Search as a Process

Search applications must

Support the user in the search process

(try to) Infer the user intent to help him accomplishing his task

Information foragingapplies the ideas fromoptimal foraging theory to understand how human users search for information.

Assumption: humans use "built-in" foraging mechanisms that evolved to help our animal ancestors find food.

Some References

Fu, Wai-Tat; Pirolli, Peter (2007), "SNIF-ACT: a cognitive model of user navigation on the world wide web", Human-Computer Interaction: 335412

Jason Withrow, "Do your links stink?," American Society for Information Science Bulletin, June 1, 2002

Pirolli, Peter (2009), "An elementary social information foraging model", Proceedings of the 27th international conference on Human factors in computing systems: 605614

Patches of information = websites

Problem:should I continue foraging in the current patchor look for another patch?

Expected gain from continuing in current patch vs. moving to another

Wandering:the userdoes not haveaninformation seeking-goal in mind.

Exploring:the user has ageneral goalbut not a plan for how to achieve it.

Seeking: the user hasstarted to identifyinformation needs that must be satisfied but the needs are open-ended.

Asking:the user has avery specificinformationneed that corresponds to a closed-class question

Information needschange during interactions

M.J. Bates. The design ofbrowsing and berrypickingtechniques for the onlinesearch interface.OnlineReview, 13(5):407431,1989.

Orienteering [ Teevan et al., CHI 2004 ] :Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal.Easy! (perfect query not needed)

Teleporting:Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience.

Exploratory Search:users intent is primarily to learn more on a topic of interest, by exploring various directions and sources

exploratory searchblends querying and browsing strategies and is differentfromretrievalthat is best served by analytical strategies

Marchionini, G. Exploratory search:from finding to understanding.Communications ACM 49(4): 41-46 (2006)

Some references

Definition and analysis of the problem

White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007)

Complex Search and Exploratory Search

Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)

search for upcomingconcerts closeto anattractivelocation(like a beach, lake, mountain, natural park, and so on), considering also availability ofgood ,close-by hotels

Current approach the user can adopt:

Independently explore search services

Manually combine findings

expandthe search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one

Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources

Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject

Hakia : resume pages for topics associated with users queries, natural language processing techniques

Structured Object Search : process queries and present results that address entities or real world objects described in Web pages

Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers

Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or fuse) the data in some column with other tables

There is a limit after which the found options need to be marked down.

A new paradigm allowing users toformulateand getresponsestomulti-domainqueries through anexploratory information seekingapproach, based uponstructuredinformation sources exposed as software services

Compositeanswers obtained by aggregating search results from various domains

Highlightthe contribution of each search service

Joinof results based on the structural information afforded by the search service interfaces

Refinethe user query

Re-shapethe result list

Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA

Template-based approach

It consists of subsetting and parametrizing the resource graph...

And then characterizing the user interaction

Parametrization of global ranking

Data visualization options

.. and so on

If the current set of combinations is not satisfactory, the user may ask formorevalues for a service (more one) or for all services (more all)

More concerts, more hotels, or more combinations

Add new informationabout further domains for selected combinations (expand)

Find close-by restaurants or co-located events

Aggregateinformation to ease analysis and readability (clustering, grouping)

Group events by venue

Reducethe number of shown items through filtering

Total walked distance for the night

Re-order(ranking or sorting)

Calculate derived values from existing ones

Total walked distance for the night

Alternativedata visualization

Map, parallel coordinates,

DEMO :

http://demo.search-computing.org

Understand the user information need

User intent taxonomy (Broder2002)

Informational want to learn about something (~40% / 65%)

Navigational want to go to a given page (~25% / 15%)

Transactional want to do something (web-mediated) (~35% / 20%)

Grey Areas

Find a good hub

Exploratory search

Context Vs. Personalization

Trigger the right search depending on the context

Location

User Engagement

Not interested in your personal profile

Your favorite restaurant?

It depends on where you are!

Relevance of the resultswith respect to the request is the main expectation for search engine users

Top-k relevant items : retrieve quickly a number ( k)of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations

Some References

R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):8399, 1999.

F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203214, 2004

D. Martinenghi and M. Tagliasacchi: Proximity Rank Join,to appear in PVLDB

Relevance is not the only success factor for a result set

User satisfactionis increased if the first items cover a good spectrum of options

If userintent is ambiguous , diversification tries to cover the most likely intents

If several top-kitems are very similar ,they can be clustered together

Thus: an optimization problem

Objective: find the set of kelements that contains themost relevant and diverse items

Maximal Marginal Relevance[Carbonell and Goldstein 1998]

More Complete information on one search

Optimization of the result set layout (and of page space)

Users dont want tolose their timewaiting for a search result

User satisfaction

Performances are the leading factorfor the evaluation ofWeb Search applications

Queries per seconds (QPS)

Time to Index

Scalability

Content

Queries

Distribution

Service-oriented computing

Content Delivery Networks

But intellectual properties may be a concern

More in section (ARCHITECTURE)

Social Interaction

Content evaluation

User relationships and actions as additional content description

Security & Privacy

Access policies

Collection Vs. Item level

Anonymity

Who I am = What I like + What I do + Where I am ?

A search process tells a lot about whom is doing it

Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tnnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece .

Reference architecture

Reference execution processes

Set of design dimensions

Development methodology

Tools supporting the methodology

Hewlett-Packard -> Hewlett and Packard as two tokens?

San Francisco: one token or two? How do you decide it is one token?

Language issues(normalization)

Accents: rsum vs. resume.

L'ensemble -> one token or two?

L ? L ? Le ?

How are your users like to write their queries for these words?Use locale?

Punctuation(e.g: U.S.A. vs. USA)

Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 )

Dates (e.g. March 1 st2009 vs. 03/01/09 vs. 1/03/2009)

Case folding.

It depends on the addressed language

E.g., in Chinese spaces do not separate words

(tokenization based on vocabulary)

Removal of high-frequency words , which carry less information

Strategies

Statistical analysis on the indexed collection

Functional terms (articles, conjunctions, auxiliary verbs)

A-priori knowledge, based on the IR system domain

Creation of a stop-list with all the terms to remove

English stop list is about 200-300 terms (e.g., been, a, about, otherwise, the, etc..)

http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

< 30% - 50% of tokens (smaller dictionary)

It candecrease recall(e.g. to be or not to be, let it be)

Most of WEB search enginesdo notremove stopwords[ ManningIR]

Phrases capture the meaning behind the bag of words and result inmulti-term phrases

Uses of phrases:

Added to the query: a query New York should be modified to search for New York> 10% in precision and recall

Replace terms in index: empirically considered not as good as query rewriting

Simple Phrases

Many systems identify phrases as any pairs of terms not separated by:

stop term

punctuation mark

special character

Phrases occurring fewer than 25 times are removed (decrease in memory requirements)

Part Of Speech and Word Sense tagging

statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token

Syntactic parsing

Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA)

Thesauri

A thesaurus is as aclassificationscheme composed ofwords and phraseswhose organization aims atfacilitatingthe expression of ideas in written text

E.g.: synonyms and homonyms

Example entry from Rogets 1thesaurus: cowardlyadjective

Ignobly lacking in courage: cowardly turncoats.

Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered

A thesaurus can be

Thematic: specific to the IR systems domain of application (most frequent case)

E.g.: Thesaurus of Engineering and Scientific Terms

Generic

A thesaurus can be used to

Helpuser formulate queries

Modificationof queries by the system

Selectindex terms

Many kinds of thesauri have been developed for IR systems

Hierarchical: synonyms(RTrelated terms, UFuse for),generalization(BTbroader term),specialization(NTnarrower term)

ISO and ANSI standards, almost always thematic

Manually built and updated by domain experts

Clustered:cluster (or synset) of words

Non-typed, semantic relationships among cluster

Each cluster is a set of word having strong semantic relationship (usually UF)

WORDNET

Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships

Associative:graph of words, where nodes represents words and edges representssemantic similarityamong words

Edges can be oriented or not, according to the symmetry of the similarity relationship

Edged can be weighted (fuzzy pseudo-thesauri)

Can be automatic generated from a collection of documents using a co-occurrence relationships

Reduce terms to their roots before indexing

Reduce inflectional/variant forms to base form

language dependent

am, are, is -> be

car, cars, car's, cars' -> car

the boy's cars are different colors -> the boy carbe different color

Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time

Stemming collapses derivationally related words

Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word

Lemmatization collapses the different inflectional forms of a lemma

Not widely used cause it harms performances

Many different algorithms :

Porters algorithm

Commonest algorithm for stemming English

Porter, Martin F. 1980. An algorithm for suffix stripping.Program 14:130137.

http://www.tartarus.org/martin/PorterStemmer/

One-pass Lovins stemmer

Lovins, Julie Beth. 1968. Development of a stemming algorithm.Translation and

Lancaster

http://www.comp.lancs.ac.uk/computing/research/stemming/

Paice, Chris D. 1990. Another stemmer.SIGIR Forum 24:5661

http://snowball.tartarus.org/demo.php

Stemming increases recall while harming precision

Lucene and Solr contains a lot of text analyzer working on several languages

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

CharFilters, Tokenizer, Token Analyzers

Apache Tika

http://tika.apache.org/

toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries

GATE(General Architecture for Text Engineering)

http://gate.ac.uk/

ANNIE (A Nearly-New Information Extraction System)

tokenizer, gazetteer, sentence splitter, part of speech tagger,

named entities transducer, coreference tagger

Support for English, Spanish, Chinese, Arabic, French, German,

Hindi, Italian, Cebuano, Romanian, Russian

MALLET(Machine Learning for Language Toolkit)

http://mallet.cs.umass.edu/index.php

Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text

OpenNLP

http://opennlp.sourceforge.net/projects.html

open source projects related to natural language processing)

Cognitive Computation Group University of Illinois

http://l2r.cs.uiuc.edu/~cogcomp/software.php

Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc.

Supersense Tagger

http://medialab.di.unipi.it/wiki/SuperSense_Tagger

tool for assigning to each noun, verb, adjective and adverb of a sentence one of the45 standard WordNet supersenses

Wordnet Domains

http://wndomains.fbk.eu/hierarchy.html

Synesketch

http://www.synesketch.krcadinac.com/

Open source textual emotion recognition

Computer are not able to catch the underlying meaning of a multimedia content.Annotation is needed.

Manual annotation

Expensive

It can take up to 10x the duration of the video

Problems in scaling to millions of contents

Incomplete or inaccurate

People might not be able to holistically catch all the meanings associated with a multimedia object

Difficult

Some contents are tedious to describe with words

E.g., a melody without lyrics

Automatic annotation

Reasonably good quality

Some technologies have a ~90% precision

Low cost

GOAL: split an audio track according to contained information

Speech

Additional usage

Identification and removal of ads

Keyframe segmentation:

segment a video track according to its keyframes

fixed-length temporal segments

Shot detection:

automated detection of transitions between shots

a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.

Speaker Identification : identify people participating in a discussion

Additional usage:

Vocal command execution

Speech To Text : automatically recognize spoken words belonging to an open dictionary

GOAL: automatically classify the genre and mood of a song

Rock, pop, Jazz, Blues, etc.

Happy, aggressive, sad, melancholic,

Additional usage:

Automatic selection of songs for playlist composition

Tutorial from PHAROS Summer School

http://www.pharos-audiovisual-search.eu/res/files/SummerSchool/Programme_Summer_School_file.zip

GOAL: extract implicit characteristics of a picture

luminosity

orientations

textures

Color distribution

GOAL: recognize and identify faces in an image

Usage examples:

People counting

Security applications

GOAL: recognize context/ concepts of an image

E.g., playground, seaside, road, ...

Extraction of low level features from raw data

color histograms, color correlograms, color moments,co-occurrence texture matrices, edge direction histograms, etc..

Features can be used to builddiscrete classifiers , which may associate semantic concepts to images or regions thereof

The MediaMill semantic search engine defines 491 semantic concepts

http://www.science.uva.nl/research/mediamill/demo

Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques

GOAL: identify objects appearing in a picture

Basket ball, cars, planes, players, etc.

OpenCV

http://opencv.willowgarage.com/wiki/

Framework for image analysis

Octave

http://www.gnu.org/software/octave/

high-level language, primarily intended for numerical computations, it works well with Matlab

Marsyas(Music Analysis, Retrieval and Synthesis for Audio Signals)

http://marsyas.sness.net/

Framework for music analysis and retrieval

TINA(TINA Is No Acronym)

http://www.tina-vision.net/

is an open source environment developed to accelerate the process of image analysis research.

Sphynx

http://cmusphinx.sourceforge.net/sphinx4/

speech recognition system written entirely in the Java

http://www.cs.waikato.ac.nz/ml/weka/

A collection of machine learning algorithms for data mining

This section is inspired by the WWW2010 tutorialby Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010

Web Search Engine Metricsfor Measuring User Satisfaction

http://analytics.ncsu.edu/reports/wsmt.pdf

Measurableproperties

How fast does it process (index) documents?

Number of documents/hour

Average document size

How fast does it search?

Latency as a function of index size

Expressiveness of query language

Speed on complex queries

Thekeymeasure: userhappiness

What is this?

Speed of response/size of index are factors

But blindingly fast, useless answers wontmake a user happy

How do we quantify user happiness?

Whois the user we are trying to makehappy?

Depends on the setting

Web engine: user finds what they want andreturn to the engine

Can measure rate of return users

eCommerce site: user finds what they wantand make a purchase

Is it the end-user, or the eCommerce site,whose happiness we measure?

Measure time to purchase, or fraction ofsearchers who become buyers?

Enterprise (company/govt/academic): Care about user productivity

How much time do my users save whenlooking for information?

Many other criteria having to do with breadth of access, secure access

Relevance

Of searchresults

Coverage

Presence of content of interest in a catalog

Diversity

Ofresult set

Discovery and Latency

How many new resources (in the collection) are in the catalogue

How long it took to get the new resources in the catalog?

Time to first click

Freshness

How do you measure relevance?

In order to assess the performance of a IR system you needed a test collection composed of:

A benchmark document collection

A benchmark suite of queries

A binary assessment of eitherRelevantorIrrelevantfor each query-doc pair ( gold standard , orground truth )

Test collection must be of a reasonable size

Need to average performance since results are very variable over different documents and information needs

Setbased evaluation

Rankbased evaluation withexplicitjudgment

Absolute judgment

Preference judgment

Rankbased evaluation withimplicitjudgment

Direct and indirect evaluation by clicks

Modelbased evaluation

Browsing models

User satisfaction

Relevance is assessed relative to the neednot to the query

E.g., Information need:

I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query:wine red white heart attack effective

A document is relevant if itaddressesthe stated information need,not just because itcontainsall the word in the query

The two most frequent and basic measures for IR effectiveness areprecisionandrecall

Precision: fraction of retrieved docs that are relevant

P(relevant|retrieved)

Provides a measure of the degree of soundness of the system

This not consider the total number of documents

Recall: fraction of relevant docs that are retrieved

P(retrieved|relevant)

Provides a measure of the degree of completeness of the system

Can get highrecall (but lowprecision ) byretrieving all docs for all queries!

Recall is anon-decreasingfunction of thenumber of docs retrieved

Precision usually decreases (in a good system)

Precisioncan be computedat different levels ofrecall

Perhaps most appropriate for web search: all people want are good matches on the first one or tworesults pages

Precision-oriented users

Web surfers

Recall-oriented users

Professional searchers, paralegals, intelligence analysts

Combined measurethat assesses the tradeoff between precision and recall (weighted harmonicmean):

Values of 1 emphasize recall

People usually use balancedF 1measure

i.e., with = 1 or =

Harmonic mean is conservative average

[CJ van Rijsbergen,Information Retrieval ]

Average over large corpus/query

Need human relevance assessments

People arent reliable assessors

Assessments have to be binary

Nuanced assessments?

Heavily skewed by corpus/authorship

Results may not translate from one domain to another

The relevance of one document is treated asindependentof the relevance of other document

This is also an assumption in most retrieval system

In ranked retrieval systems,PandRare values relative to arank position

Evaluation performed by computing precision as a function of recall

Function computed at each rank position in which a relevant

document has been retrieved

Resulting values are interpolated

yielding a precision/recall plot

Mean average precision ( MAP )

Measure of quality at all recall levels

[email_address]

Not all queries will have more than K relevant results

Even a perfect system may have a score less than 1.0 for some queries

R-Precision [Allan 2005]

Use a variable result set cut-off for each query based on number of its relevant results

Mean Reciprocal Rank ( MRR )[ Voorhees 1999]

Reciprocal of the rank of thefirst relevant result averagedover a population of queries

[Jrvelin and Keklinen 2002]

Gain adjustable for importance of different relevance gradesfor user satisfaction

Discounting desirable for web ranking

Most users dont browse deep

Search engines truncate the list of results returned.

DCG yieldsunbounded scores

For each query, divide the DCG by the best attainable DCG for that query

Normalized Discounted Cumulative Gain (nDCG)

Example:

Very Useful: 3

Somehow useful: 1

Not Useful: 0

Kendall taucoefficient

Based on counts of preferences

Range in [-1, 1]

Robust for incomplete judgments

Binary Preference (bpref)

Buckley and Voorhees (2004)

Designed for incompletejudgments

Generalized to graded judgment

De Beer and Moens (2006)

How to present information?

Which information

Where they should be displayed

Which presentation elements should be used?

Font, colors, design elements, interaction design

Generalization

How to measure success?

User studies

On-line, on-home, usability, eye tracking, focus group, surveys

Log analysis

Editorial

Comparative, Perceived vs. actual

Golden Triangle

Thefirst result is always considered moretrusted and morerelevant by default

The user spend less time reading the lower part of the page

[Marti A. Hearst,Search User Interfaces , Cambridge University Press, 2009]

Questions?

Modern Information Retrieval

Ricardo Baeza-Yates, Berthier Ribeiro-Neto ,Addison Wesley Longman Publishing Co. Inc., 2010

[ManningIR] Introduction to Information Retrieval

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze,Cambridge University Press, 2008

Information Retrieval: Algorithms and Heuristics .

D.A. Grossman, O. Frieder. Springer, 2004

Managing Gigabytes.

I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999

Mining the Web: Analysis of Hypertext and Semi Structured Data .

S. Chakrabarti. Morgan Kaufmann, 2002

Search User Interfaces

Marti A. Hearst. Cambridge University Press, 2009

Search Computing Challenges and directions

Stefano Ceri, Marco Brambilla(eds.) . Springer LNCS, vol. 5950, 2010

Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction

Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!)

www2010

Recent Progress on Inferring Web Searcher Intent

Eugene Agichtein (Emory University)

www2010

Applications of Open Search Tools

Rosie Jones, Ted Drake (Yahoo!)

www2010

[BAEZASeco2010] New Frontiers for Search

Ricardo Baeza-Yates

www2010

Web Mining for Search

Ricardo Baeza-Yates and Rosie Jones (Yahoo!)

SIGIR 2008

[Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins:Toward a PeopleWeb

IEEE Computer 40(8): 63-72 (2007)

[Broder2002] A. Broder.A taxonomy of web search

SIGIR Forum, 36(2):310, 2002.

[BATES2002]Bates, Marcia J.Toward an integrated model for information seeking and searching

In: The Fourth International Conference on Information Needs, Seeking and Use in Dierent Contexts, 2002

[FU2007] Fu, Wai-Tat; Pirolli, Peter,SNIF-ACT: a cognitive model of user navigation on the world wide web

Human-Computer Interaction: 335412 , 2007

[Withrow2002] Jason Withrow,Do your links stink?

American Society for Information Science Bulletin, June 1, 2002

[Pirolli2009] Pirolli, PeterAn elementary social information foraging model

Proceedings of the 27th international conference on Human factors in computing systems: 605614, 2009

[D. Rose, 2008]

[BATES1989] M.J. Bates.The design of browsing and berrypicking techniques for the online search interface

Online Review, 13(5):407431,1989.

[Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D.The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search

Proceedings of ACM CHI 2004, pp. 415-4422.

[MARCHIONINI2006]Marchionini, G.Exploratory search:from finding to understanding .

Communications ACM 49(4): 41-46 (2006)

[WHITE2007] White, R. W., and Drucker, S. M.Investigating behavioral variability in web search

16th WWW Conf. (Banff, Canada, 2007)

[AULA2008] Aula, A., and Russell, D.M.Complex and Exploratory Web Search

ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)

[BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.Liquid Query: multi-domain exploratory search on the Web

WWW 2010, Raleigh, USA

[FAGIN1999] R. Fagin.Combining fuzzy information from multiple systems

J. Comput. Syst. Sci., 58(1):8399, 1999.

[ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid.Rank-aware query optimization

In SIGMOD Conference, pages 203214, 2004.

[MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi:Proximity Rank Join

to appear in PVLDB

[Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization:Using MMR for Diversity- based Reranking

SIGIR98

[BozzonEtAl2007] Alessandro Bozzon,et AlRole Based Access Control for the interaction with Search Engines

International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece.

[BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero FraternaliConceptual Modeling of Multimedia Search Applications using Rich Process Models

ICWE 2009, June 24-26, 2009, San Sebastian, Spain

[BozzonThesis2009]Alessandro Bozzon,Model-driven development of Search Based Web Applications

Ph.D Thesis, Politecnico di Milano, April 2009.

[BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca:Panta Rhei: An Execution Model for Queries over Web Information Sources

http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf

[Allan 2005] J. Allan (2005),HARD track overview in TREC 2005: High accuracy retrieval from documents.

[Voorhees 1999] E.M. Voorhees (1999),TREC-8 question answering track report

[Jrvelin and Keklinen 2002] K. Jrvelin and J. Keklinen,Cumulated gain-based evaluation of IR techniques

ACM Trans. IS, 20(4): 422-446, 2002

[Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees,Retrieval evaluation with incomplete information

SIGIR04.

[De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine.Rpref: a generalization of Bpref towards graded relevance judgments

SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM

Search Computing Course Lecture Notes

http://www.search-computing.it/course

Fabio Aolli,Universit di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html

http://www.ir.disco.unimib.it/

Technology

Engineering Web Search Applications