21
1 ICS-FORTH January 11, 2000 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January 11, 2000 Centre for Cultural Informatics and Documentation Systems

ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

Embed Size (px)

Citation preview

Page 1: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

1ICS-FORTH January 11, 2000

Thesaurus Mapping

Martin Doerr

Foundation for Research and Technology - HellasInstitute of Computer Science

Bath, UK, January 11, 2000

Centre for Cultural Informatics and Documentation Systems

Page 2: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

2ICS-FORTH January 11, 2000

Thesaurus MappingThe Problem

Logical aspects Semantics of involved entities Notions of translation Objectives and logics of mapping

Production of mappings Human Language engineering, cluster analysis

Architecture Mapping management Mapping service Integration in IT environment

Page 3: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

3ICS-FORTH January 11, 2000

Thesaurus MappingWhy do we need mapping?

Thesauri for information retrieval depend on: View point (e.g. functional, morphological, social,

special database fields etc.) Language or social group (experts, common people etc.) Size and distribution of target material (effective partitioning)

Therefore Concepts differ Use of concepts differs Semantic embedding differs

Even if we agree on the same world Research topic: Formalisation of views and context

Page 4: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

4ICS-FORTH January 11, 2000

Thesaurus Mapping Semantics of entities

Concepts are defined by agreement, e.g. orange (colour)

Concepts identify sets of real world objects

Concepts are identified by scope notes, literature references, examples, images

Concepts should not be changed they should be created or abandoned

they should be understood, accepted or rejected

A Descriptor is a concept identifier

Page 5: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

5ICS-FORTH January 11, 2000

Thesaurus Mapping Semantics of entities

Links should express opinions and differences about set relation between concepts

subsumtion, disjointness etc. about derived concepts about term usage opinions may be human or computational !

Terms (noun phrases) should be used by social groups to refer to (multiple) concepts without direct linguistic meaning one term is selected as concept identifier

Page 6: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

6ICS-FORTH January 11, 2000

Thesaurus Mapping Semantics of entities

concept - concept relations:

set semantics : BT, between thesauri/ version - for query expansion, users

associative: RTs, BTP, etc, - for user guidance

concept - term :

authoritative: preferred, used for - for cataloguers, users

statistical, possible synonyms: - for information retrieval

term - term relations : dictionary entries: - limited precision, within LE tools

Page 7: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

7ICS-FORTH January 11, 2000

A translated thesaurus: For comprehension Established concepts and terms from one user group Optimally interpreted in words of another or more languages Translations are not established terms

Mapped thesauri (ISO5964): For transition Independent thesauri, each one from another user group Established concepts and terms. links declare “overlap” between concepts

Interlingua: For communication and knowledge sharing Compromise to share concepts between many user groups Optimally interpreted in words of another language

Thesaurus Mapping What is a Multilingual Thesaurus?

Page 8: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

8ICS-FORTH January 11, 2000

Thesaurus Mapping Functionality of Mapping

Transparent query transformation (Z39.50!)

Replace Boolean term combination from thesaurus A with optimal

term combination from thesaurus B to retrieve equivalent results

Guaranteed transition needed (ev. to higher concepts)

Need controlled loss of precision or recall (research!)

Combinatorial explosion:

Need cascading Thes A => Thes B => Thes C

Page 9: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

9ICS-FORTH January 11, 2000

Interthesaurus relations (ISO 5964)(from Descriptor of Thes. A to Descriptor of Thes. B )

• partial equivalence Better: broader equivalence

narrower equivalence• exact equivalence• inexact equivalence (“+/-”)

good for FTR only• single to multiple equivalence

Better:exact equivalence to BOOLEAN combination of target terms.

“AND” (intersection), “OR” (union), “NOT” (complement)

Thesaurus Mapping Logics of Mapping

Page 10: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

10ICS-FORTH January 11, 2000

ANDEnglish Heritage Thesaurus Merimee Thesaurus

English Vocabulary French Vocabulary

Interthesaurusrelations

linguistictranslation

linguistictranslation

+/-

Interlingua

+/- +/-

Thesaurus Mapping Translation and Mapping

Page 11: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

11ICS-FORTH January 11, 2000

BT

Thesaurus MappingBoolean OR-Combinations

A

CB

B OR CExact

equivalence

Boolean Compound

• Combines instances of B and C• Uses properties of either B or C• Is BT of B, C and NT of their common broader terms.

Page 12: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

12ICS-FORTH January 11, 2000

BT

Thesaurus MappingBoolean AND-Combinations

AB AND C

Exact equivalence

Boolean Compound

• Uses instances of both, B and C• Combines properties of B and C• Is NT of B, C and BT of their common narrower terms.

CB

Page 13: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

13ICS-FORTH January 11, 2000

BT

Thesaurus MappingApproximation by Inclusion

A

CB

Broader equivalence

Narrower equivalences

Page 14: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

14ICS-FORTH January 11, 2000

BT

Thesaurus Mapping Avoid redundant linking!

A BBroader equivalence

Narrower equivalences

Exact equivalence

Page 15: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

15ICS-FORTH January 11, 2000

Thesaurus Mapping Problems of Mapping

Consistency and reasoning (Description Logics!)

Optimal substitution of combined query terms

Protocol to propagate recall/ precision control

Inverse reading of one-to-many links.

Postcoordination : unclear semantics !

e.g. “grinding & factories”, solution by DL ?

Page 16: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

16ICS-FORTH January 11, 2000

Thesaurus Mapping Production of Mappings

Human assessment needs (see Term-IT): CSCW, work flow, decentralised management tools

Excellent comparative presentation of thesaurus contents

Language engineering (see Term-IT): termhood recognition, automatic translation by parallel texts,

filtering by occurrence in target indexing language.

Excellent for preprocessing !

Analysis of use: Cluster analysis with doubly indexed entries.

Libraries: problem to identify the same “work” !

Page 17: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

17ICS-FORTH January 11, 2000

SIS - Thesaurus Management System Co-operative linking

BTVersion 0

Version 1

Version 0

Version 1

Version 2

New Workspace

Group 1 Group 2

New Workspaceobsolete term

links of group2

links of group1

Page 18: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

18ICS-FORTH January 11, 2000

Thesaurus MappingUsers Environment

??

User’s Authorities

Target Authorities CMS Collections

old version

specialized

DistributedRetrieval

Local Term

Agreed-on Term

foreignlanguage

Page 19: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

19ICS-FORTH January 11, 2000

Search AidTool

Thesaurus MappingThree-level Architecture

CMS Maintainer CMS CMS Maintainer CMS

National Authority Providers

conceptproposal

Thesaurus initialization

Local TMSLocal TMS

End User Cascadedmapping service

conceptproposal

Thesaurus initialization

Update term use

Update term use

Page 20: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

20ICS-FORTH January 11, 2000

Thesaurus Mapping Architectural Considerations

We propose to distinguish: Collection Management Systems with local term management National authority providers Mapping service

Mapping service: Co-operative mapping production environment and system,

- for few languages (3?), domain specific ? Large scale mapping tables detached from production system,

accessible as replicated Web resource.

Integration: Access engines connect to mapping resources on demand Provision of suitable metadata for CMS capabilities

Page 21: ICS-FORTH January 11, 2000 1 Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January

21ICS-FORTH January 11, 2000

Thesaurus Mapping Conclusions

Thesaurus mapping is feasible and the best means to access coherently multiple CMS with controlled vocabulary

Thesaurus mapping is a major investment in human resources and IT environment

Targeted research can much improve the currently

feasible

- quality of mapping

- quality of service

- and production cost