Upload
diana-gallagher
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Captain Nemo:a Metasearch Engine with Personalized Hierarchical Search Space
(http://www.dblab.ntua.gr/~stef/nemo)
Stefanos Souldatos, Theodore Dalamagas, Timos Sellis
(National Technical University of Athens, Greece)
INTRODUCTION
Metasearching
Metasearch
Engine
SearchEngine
1SearchEngine
2SearchEngine
3
Metasearch engines can reach a large part of the web.
Personalization
Personalization is the new need on the Web.
Personalization in Metasearching
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization can be applied in all 3 stages of metasearching:
Personalization in Metasearching
Personal Retrieval Model
search engines, #pages, timeout
Personalization can be applied in all 3 stages of metasearching:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearching
Personal Presentation Style
grouping, ranking, appearance
Personalization can be applied in all 3 stages of metasearching:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearching
Thematic Classification of Results
k-Nearest Neighbor, Support Vector Machines, Naive Bayes, Neural Networks, Decision
Trees, Regression Models
Personalization can be applied in all 3 stages of metasearching:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Hierarchical Classification
Flat Model
Hierarchical Model
CINEMAmovie film
actor
PAINTINGpainter camvas
gallery
BASKETBALLbasket nba
game
FOOTBALLground ball
match
ROOT
ARTfine arts
SPORTSathlete score
referee
CINEMAmovie film
actor
PAINTINGpainter camvas
gallery
BASKETBALLbasket nba
game
FOOTBALLground ball
match
ARTfine arts
SPORTSathlete score
referee
ROOT
RELATED WORK
Personalization in Metasearch Engines
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
SearchIxquickInfogridMamma
ProfusionWebCrawle
rQuery Server
search engines to be used
User defines the:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
InfogridMamma
ProfusionQuery Server
timeout option (i.e. max time to wait for search results)
User defines the:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
ProfusionQuery Server
number of pages to be retrieved by each search engine
User defines the:
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
Dogpile WebCrawle
r MetaCrawle
r
Result can be grouped by search engine that retrieved them
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personalization in Metasearch Engines
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Organizes search results into dynamic custom folders
Northern Light
Recognises thematic categories and improves queries towards a category
Inquirus2
Buntine et al. (2004)
Topic-based open source search engine
CAPTAIN NEMO
Personal Retrieval Model
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personal Retrieval Model
Search Engines
Number of Results
Search Engine Timeout
Search Engine Weight
SearchEngine
1
SearchEngine
2
SearchEngine
3206
308
104
7 10
5
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personal Presentation Style
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personal Presentation Style
Result Grouping Merged in a single list Grouped by search engine Grouped by relevant topic of interest
Result Content Title Title, URL Title, URL, Description
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Personal Presentation Style
Result Retriev
al
Result Presentatio
n
Result Administrati
on Look ‘n’ Feel
Color Themes(XSL Stylesheets)
Page Layout
Font Size
Topics of Personal Interest
Result Retriev
al
Result Presentatio
n
Result Administrati
on
Topics of Personal Interest
Result Retriev
al
Result Presentatio
n
Result Administrati
on Administration of topics of personal interest
The user defines a hierarchy of topics of personal interest (i.e. thematic categories).
Each thematic category has a name and a description of 10-20 words.
The system offers an environment for the administration of the thematic categories and their content.
Topics of Personal Interest
Result Retriev
al
Result Presentatio
n
Result Administrati
on Hierarchical classification of results
The system proposes the most appropriate thematic category for each result (Nearest Neighbor).
The user can save the results in the proposed or other category.
Classification Example
CINEMAmovie film
actor
PAINTINGpainter camvas
gallery
BASKETBALLbasket nba
game
FOOTBALLground ball
match
ROOT
ARTfine arts
SPORTSathlete score
referee
Query: “Michael Jordan”
Results in user’s topics of interest:
3 82
3
METASEARCH RANKING
Two Ranking Approaches
Using Initial Scores of Search
Engines
Not Using Initial Scores of Search
Engines
Using Initial Scores
Rasolofo et al. (2001) believe that the initial scores of the search engines can be exploited.
Normalization is required in order to achieve a common measure of comparison.
A weight factor incorporates the reliability of each search engine. Search engines that return more Web pages should receive higher weight. This is due to the perception that the number of relevant Web pages retrieved is proportional to the total number of Web pages retrieved as relevant.
Not Using Initial Scores
The scores of various search engines are not compatible and comparable even when normalized.
Towell et al. (1995) note that the same document receives different scores in various search engines.
Gravano and Papakonstantinou (1998) point out that the comparison is not feasible not even among engines using the same ranking algorithm.
Dumais (1994) concludes that scores depend on the document collection used by a search engine.
Aslam and Montague (2001)
Bayes-fuse uses probabilistic theory to calculate the probability of a result to be relevant to a query.
Borda-fuse is based on democratic voting. It considers that each search engine gives votes in the results it returns (N votes in the first result, N-1 in the second, etc). The metasearch engine gathers the votes and the ranking is determined democratically by summing up the votes.
Aslam and Montague (2001)
Weighted borda-fuse: weighted alternative of borda-fuse, in which search engines are not treated equally, but their votes are considered with weights depending on the reliability of each search engine.
Weighted Borda-Fuse
V (ri,j) = wj * (maxk(rk) - i + 1) V(ri,j): Votes of i result of j search engine
wj: weight of j search engine (set by user)
maxk(rk) : maximum number of results
Example:5 4 3 2SE1:
5 4 3SE2:
5 4 3 2 1SE3:
W1=735 28 21 14
50 40 30
25 20 15 10 5
W2=10W3=5
Captain Nemo
http://www.dblab.ntua.gr/~stef/nemo
Links
Introduction
Related work
Captain Nemo
Metasearch Ranking