Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni [email protected]

Conceptual structures in

modern information retrieval

Claudio CarpinetoClaudio CarpinetoFondazione Ugo BordoniFondazione Ugo Bordoni

[email protected]@fub.it

OverviewOverview

• Keyword-based IR and early conceptual approachesKeyword-based IR and early conceptual approaches

• Context and concepts in modern topical IRContext and concepts in modern topical IR

• Emerging IR tasks requiring knowledge structuresEmerging IR tasks requiring knowledge structures

• Research at FUBResearch at FUB

• ConclusionsConclusions

DocumentsDocuments QueryQuery

Vectors ofVectors ofweighted keywordsweighted keywords

Vector of Vector of weighted keywordsweighted keywords

Retrieved documentsRetrieved documents

MatchingMatching

Vector-based IR

Term weighting

• tf.idf and vector space model (Salton) very popular in70’s and 80’s

• BM25 (Robertson) has been the state of the art in the 90’s

• Several recent term-weighting functions based on statistical language modeling (Ponte, Lafferty)

• A new weighting framework based on deviation from randomness + information gain (FUB + UG)

W = Inf1. Inf2

tf . log [(N + 1) / (n + 0.5)]......…

tf / (tf + 1)......…

tfn = tf . log (1 + K . avg_l / l)

Inherent limitations of keyword-based IR

• Vocabulary problemVocabulary problem

• Relations are ignoredRelations are ignored

Early approaches to conceptual IR

• n-gramsn-grams (Salton 1975, Maarek 1989)

• parse treeparse tree (Dillon 1983, Metzler 1989)

• case relationscase relations (Fillmore 1968, Somers 1987)

• conceptualconceptual graphsgraphs (Dick 1991)

Why early conceptual IR not successful

• No best representation schemeNo best representation scheme

• Manual coding too costlyManual coding too costly

• Automated coding too hardAutomated coding too hard

• Training required both for the indexer and the userTraining required both for the indexer and the user

• Effectiveness not clearly demonstratedEffectiveness not clearly demonstrated

• Retrieval task often not appropriateRetrieval task often not appropriate

OverviewOverview

• Vector-based IR and early conceptual approachesVector-based IR and early conceptual approaches





Evolution of topical IR

• Very short queriesVery short queries

• Heterogeneous collectionsHeterogeneous collections

• Unreliable sourcesUnreliable sources

• Interactive sessionsInteractive sessions

IndexingIndexing

DocsDocs QueryQuery ContextContext

VisualizationVisualization

RankingRanking

UseUse

IndexingIndexing

InteractionInteraction

Model of modern topical IRModel of modern topical IR

Ranking

Query

Inverted File

Weighted Query

Form. Docs

+norm

Select top D docs

Compute σ(w )

Select top E terms

Query Expansion

Performance of retrieval feedback versus query difficultyPerformance of retrieval feedback versus query difficulty

Ranking based on interdocument similarity

Cluster hypothesis (van Rijsbergen 1978)Cluster hypothesis (van Rijsbergen 1978)

ApproachesApproaches

- Matching the query against document clusters (Willet 1988)- Matching the query against document clusters (Willet 1988)

- Matching the query against transformed document- Matching the query against transformed document representations (GVSM, Wong 1987, LSI, Deerwester 1990)representations (GVSM, Wong 1987, LSI, Deerwester 1990)

- Computing the conceptual distance between query andComputing the conceptual distance between query and documents (Order-theoretical ranking, Carpineto 2000)documents (Order-theoretical ranking, Carpineto 2000)

Order-theoretical ranking

NNS 0 FINANCE (Query)

1 NNS

FINANCE CREDIT

KBS (D7)

4 KBS

1 NNS

FINANCE BANK

ACCOUNT (D1)

1 NNS

1 FINANCE

2 NNS

BANK

2 NNS

BANK ACCOUNT

(D3)

2 FINANCE

CREDIT KBS (D4)

3 CREDIT

KBS (D5)

3 NNS

BANK RIVER

(D2)

3 BANK

4 BANK

KBS WATERS

(D6)

Performance of order-theoretical ranking

• Better than hierarchic clustering and comparable to best matching on the whole collection

• Markedly better than both hierarchic clustering and best matching on non-matching relevant documents

• Order-theoretical ranking does not scale up well but it is synergistic with best matching document ranking

OverviewOverview






Question Answering

Task:

Closed-class questions in unrestricted domains with

no guarantee of answer and result possibly scattered

over multiple documents

Question Answering

Approach:

1. Recognize type of queries2. Retrieve relevant documents3. Find sought entities near question words4. Fall back to best-matching passage retrieval in case of failure

Web Information Retrieval

Web Information Retrieval

Current tasks:

named-entity finding tasktopic distillation task

Approach:

1. Use of multiple methods2. Combination of results via interpolation and normalization schemes

XML document retrieval

Goal:

Use document structure to improve precision andrecall of unstructured queries

“concerts this weekend at Sofia under 20 euros”

Approaches:

• Automatic inference of query structure

• Semi-automatic query annotation

• Hybrid query languages

OverviewOverview






Recommender systemsRecommender systems

“Related keyword” feature

versus

Context-dependent query reformulation

DocumentDocument

RankingRanking

DocsDocs

QueryQueryQuery

Term ranking 1Term ranking 1



+

Combining text retrieval and text mining with concept latticesCombining text retrieval and text mining with concept lattices

Integration of multiple search strategies

(querying, browsing, thesaurus climbing,

bounding) into a unique Web interface

Goal

The use of conceptual structures surfaces in traditionaltopic relevance retrieval and it is at the heart of manynon-topical retrieval tasks

Towards conceptual search

Conclusions

•Understand term meaning•Adapt to the user•Can translate between applications•Explainable•Capable of filtering and summarization