29
Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Embed Size (px)

Citation preview

Page 1: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Preserving meanings in (multilingual) text mining for

Cultural Heritage

Michel Généreux David B. Arnold

University of Brighton, UK

Page 2: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Overview

Date 24.10.06 ICS-FORTH, Greece Slide 2

•The extraction tool for English:

•Natural Language & Semantic Processing

•Experiment 1: triple extraction from the CIDOC-CRM documentation

•Experiment 2 :triple extraction from free text

•Extension to other languages (French and German)

•Use of the tool in EPOCH

•(Cultural Heritage corpus creation)

•Discussion

Page 3: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Motivation for the extraction tool

Date 24.10.06 ICS-FORTH, Greece Slide 3

• Extract and facilitate the mapping of (unstructured) textual information to CIDOC-CRM

• Assist more generic mapping tools (AMA)• Develop a first prototype able to extract

triples from English texts • Evaluate the tool, using data from the

CIDOC-CRM documentation

Page 4: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

On the propositional nature of CIDOC-CRM

“The domain class is analogous to the grammatical subject of the phrase for

which the property is analogous to the verb. Property names in the CRM

are designed to be semantically meaningful and grammatically correct when read

from domain to range. In addition, the inverse property name, normally

given in parentheses, is also designed to be semantically meaningful and

grammatically and correct when read from range to domain.”

Definition of the CIDOC Conceptual Reference Model, June 2005, page vi

Date 24.10.06 ICS-FORTH, Greece Slide 4

Page 5: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

ERCDM

XMLSchema

CIDOCXML

Schema

De facto standard modelCIDOC model

XMLCIDOC

CompliantSchema

Mapping Tools

Date 24.10.06 ICS-FORTH, Greece Slide 5

Assisting generic tools:EPOCH AMA

The idea is to represent in the simplest way the CIDOC model as well as the original model and to let the domain expert insert his/her knowledge in the mapping.

Page 6: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Text mining for CH

TEXT

CIDOC-CRM

Database

The creation of a CIDOC-CRM compatible database for CH is partially automated by extracting triples after analyzing free texts in natural language.

Text analysis

Triple extraction

Database creation

CRM ~ semantic representation

Page 7: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

We have a painting of John Constable entitled “A country

lane” and identified by “203-1888”. It is a watercolour which

is 21.3 cm high and 18.3 cm wide. It was painted in

England.

Date 24.10.06 ICS-FORTH, Greece Slide 7

The problem

• The capability of the CRM to describe so many formats

with so few properties is due to the fact that most actions

and events are not encoded as properties, but as paths with

the event as node in the middle. So "Van Gogh painted

this .." translates to two triples with a Production in the

centre. I.e. hundreds of action verbs have to be recognized

and mapped. If this is not understood no useful matching

can be done (anonymous reviewer)

Page 8: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Some terms• semantic association: measure of co-occurrence

of terms• lemma: base form of a word• POS: part-of-speeches• clause: a coherent whole of POSs• hypernym: a word more generic than a given

word• chunking: breaking sentences into clauses• WordNet: structured lexical database

Date 24.10.06 ICS-FORTH, Greece Slide 8

Page 9: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Date 24.10.06 ICS-FORTH, Greece Slide 9

POS tagging and Phrase chunking

Page 10: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics through Syntax

“ CIDOC-CRM is great ”

V

be(x,y)

AdjP

great

VP

be(x,great)

NP

CIDOC-CRM

Sentence

be(CIDOC-CRM,great)Sentence NP VPVP V AdjP

Date 24.10.06 ICS-FORTH, Greece Slide 10

Page 11: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics directly through a Semantic Grammar

“ CIDOC-CRM is great ”

Action

be(x,y)

Info

great

Action_Info

be(x,great)

Brand

CIDOC-CRM

Assertion

be(CIDOC-CRM,great)Assertion Brand Action_InfoAction_Info Action Info

Date 24.10.06 ICS-FORTH, Greece Slide 11

Page 12: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics from keywords and pattern-matching

<assertion> <CIDOC-CRM> “is” <info>

<opinion> “I think” <CIDOC-CRM> “is” <info>

Other noisy input is simply skipped.

What about words not in WordNet (proper nouns)? Semantic Orientation

Date 24.10.06 ICS-FORTH, Greece Slide 12

Page 13: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

SO-A in practice

We compute Association using PMI:

which is positive when words co-occur and negative otherwise.

We compute PMI in IR using hit counts over a large corpus:

N is the total number of documents in the corpus, smoothing makes PMI-

IR 0 for words not in the corpus and NEAR means a distance of at most

20 words (Turney and Littman, 2003).

Date 24.10.06 ICS-FORTH, Greece Slide 13

Page 14: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

1.Text cleaning.

2.Tokenization and POS tagging

3.Clause chunking and pruning

4.NC regrouping

5.Intermediate triples (IT) creation

6.Referent resolution

7.Final triple (FT) creation.

Approach

Date 24.10.06 ICS-FORTH, Greece Slide 14

Page 15: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Experiments (1)

Date 24.10.06 ICS-FORTH, Greece Slide 15

•144 sentences

•184 final triples

•at least a final triple for 46 sentences

Page 16: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Lange Herzogstrasse is Wolfenbüttel

main shopping area. The street's

particular charm lies in its broad-face

half-timbered buildings, historic

merchant's houses; their central gables

still retain the distinctive hatches

through which goods could be hoisted

up to the attics for storage.

Experiments (2)

Date 24.10.06 ICS-FORTH, Greece Slide 16

•A text of 3922 words and 173 sentences.

•197 intermediate triples and 79 final triples extracted.

Page 17: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

We have a painting of John Constable entitled “A country

lane” and identified by “203-1888”. It is a watercolour which

is 21.3 cm high and 18.3 cm wide. It was painted in

England.

Date 24.10.06 ICS-FORTH, Greece Slide 17

A shallower approach for English

Page 18: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

We PP we

have VHP have

a DT a

painting NN painting

of IN of

John NP John

Constable NP Constable

entitled VVD entitle

" `` "

A DT a

country NN country

lane NN lane

" '' "

and CC and

identified VVD identify

by IN by

" `` "

203-1888 JJ @card@

" '' "

. SENT .

It PP it

is VBZ be

a DT a

watercolour NN

which WDT which

is VBZ be

21.3 CD @card@

centimetres NNS centimetre

high JJ high

and CC and

18.3 CD @card@

centimetres NNS centimetre

wide JJ wide

. SENT .

It PP it

was VBD be

painted VVN paint

in IN in

England NP England

. SENT .

Date 24.10.06 ICS-FORTH, Greece Slide 18

Shallow extraction (English)

Page 19: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

French

Nous avons une toile de John Constable intitulée «Une route de

campagne» et identifiée par «203-1888». C'est une aquarelle

qui a une largeur de 21.3 centimètres et une hauteur de 18.3

centimètres. Elle fût peinte en Angleterre.

Date 24.10.06 ICS-FORTH, Greece Slide 19

Page 20: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Nous PRO:PER nous

avons VER:pres avoir

une DET:ART un

toile NOM toile

de PRP de

John NAM John

Constable NOM constable

intitulée ADJ <unknown>

" PUN:cit "

Une DET:ART un

route NOM route

de PRP de

campagne NOM campagne

" PUN:cit "

et KON et

identifiée VER:subp <unknown>

par PRP par

" PUN:cit "

203-1888 ABR @card@

" PUN:cit "

. SENT .

C' PRO:DEM ce

est VER:pres être

une DET:ART un

aquarelle NOM aquarelle

qui PRO:REL qui

a VER:pres avoir

une DET:ART un

largeur NOM largeur

de PRP de

21.3 NUM @card@

centimètres NOM

et KON et

une DET:ART un

hauteur NOM hauteur

de PRP de

18.3 NUM @card@

centimètres NOM <unknown>

. SENT .

Elle PRO:PER elle

fût VER:futu <unknown>

peinte VER:pper peindre

en PRP en

Angleterre NAM Angleterre

. SENT .

Date 24.10.06 ICS-FORTH, Greece Slide 20

Shallow extraction (French)

Page 21: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

German

Wir haben einen Anstrich des John Constable erlauben

„Einen Landweg” und durch „203-1888” gekennzeichnet

werden. Es ist ein Aquarell, das 21.3 Zentimeter hoch und

18.3 Zentimeter breit ist. Es würde in England gemält.

Date 24.10.06 ICS-FORTH, Greece Slide 21

Page 22: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Wir PPER wir

haben VAFIN haben

einen ART ein

Anstrich NN Anstrich

des ART d

John NE John

Constable NE <unknown>

erlauben VVFIN erlauben

" $( "

Einen NN Eine

Landweg NN Landweg

" $( "

und KON und

durch APPR durch

" $( "

203-1888 CARD @card@

" $( "

gekennzeichnet VVPP kennzeichnen

werden VAINF werden

. $. .

Es PPER es

ist VAFIN sein

ein ART ein

Aquarell NN Aquarell

, $, ,

das PRELS d

21.3 CARD @card@

Zentimeter NN Zentimeter

hoch ADJD hoch

und KON und

18.3 CARD @card@

Zentimeter NN Zentimeter

breit ADJD breit

ist VAFIN sein

. $. .

Es PPER es

würde VVFIN <unknown>

in APPR in

England NE England

gemält VVFIN <unknown>

. $. .

Date 24.10.06 ICS-FORTH, Greece Slide 22

Shallow extraction (German)

Page 23: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Application: EPOCH CHARACTERISE

Question Answering for CH

User

Natural LanguageInteraction

NLPNLG

Through Natural Language Processing (NLP) and Generation (NLG), users can interact and query databases in CIDOC-CRM. Resources are available for a wide range of languages (multilingualism). By combining the mining and interactive tools, language technology automates the structuring and querying of heterogeneous and semi-structured information within the framework of CIDOC-CRM.

CIDOC-CRM

Database

Page 24: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Resources (1)

Date 24.10.06 ICS-FORTH, Greece Slide 24

•Towards a Semantic Web for Heritage Resources

•http://www.digicult.info/downloads/html/1054648757/1054648757.html

•Art & Architecture Thesaurus Online

•http://www.getty.edu/research/conducting_research/vocabularies/aat/about.html

•National Monuments Record Thesauri

•http://thesaurus.english-heritage.org.uk/

•Networked Knowledge Organization Systems/Services

•http://nkos.slis.kent.edu/

•AGROVOC  Multilingual Thesaurus

•http://www.fao.org/aims/ag_intro.htm

•Controlled vocabulary for the applied life sciences

•http://www.cabi-publishing.org/DatabaseSearchTools.asp?PID=277

•National Library of Medicine: http://www.nlm.nih.gov/mesh/

•MDA Archaeological Objects Thesaurus: http://www.mda.org.uk/archobj/archcon.htm

•EuroWordnet: http://nipadio.lsi.upc.es/wei.html

Page 25: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Resources (2)

Date 24.10.06 ICS-FORTH, Greece Slide 25

•Multimatch: http://www.multimatch.org/contact.html

•HEREIN: http://www.european-heritage.net/sdx/herein/thesaurus/introduction.xsp•MUSEUM:http://www.science.uva.nl/~kamps/museum/ •IKEM: http://www.ikem.be/•Cultural Heritage: http://www.culturalheritage.net/•Michael: http://www.michael-culture.org/project.html

•EUROVOC: http://europa.eu/eurovoc/

•AMA tools:

•http://www.epoch-net.org/index.php?option=com_content&task=view&id=74&Itemid=120

•MultiWordNet: http://multiwordnet.itc.it/english/home.php

Page 26: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Abstract

Seeds

Triplets

URLs

Terms

Corpus

Corpus

archeology

art

historic

landscape

monument

museum

conservative

aic

amerisuites

arboretum

archaeology

archeological

archeology

archive

artifacts

blm

browse

cemeteries

conservation

copyright

crm

database

email

fax

fpi

hrs

iii

internet

interns

internship

metropark

minh

mini

minigrants

monuments

museums

ncptt

nlcs

nps

nsu

online

overview

parks

projects

ptt

services

website

First Seeds

Second seeds

Resources (3): Corpus and Multiwords terms extraction

Date 24.10.06 ICS-FORTH, Greece Slide 26

Page 27: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Archaeological Sites

Image for Archeology

Mailing lists and discussion

North Carolina

Back Issues

archaeological research

Financial Corp

Summer Institute

NCPTT Partners

Ministry of Culture

via e-mail

University Art

Hire Date

Preschooler Programs

Next page

de los

plot keywords

natural resources

Quick Links

Sculpture Conservation Studio

Museum Shop

Dot Net Nuke

Cast overview

megalithic tombs

Archaeological Institute of America

landscape painting

Sound Mix

Mc Culley

Department of Animal

Historical Center

historic sites

Find surnames

Previous page

Golf Course

Genealogy CDs

Children's and Juvenile Books

list archives

Historic Materials

Nine Inch Nails

favorite player

Dennis Montagna

Disaster Recovery by the Book

Preservation Officers

historic properties

U Minh

non-profit organization

Web Notify

Internet Service

Southwest Parks

Arts Council

archeology nationwide

download protocol

Science Center

web site

site provides

African Studies

Mini Grants

Travel Guides

programming language

Landscape Restoration

click the Order button

Ancient Near

Grant Programs

Personal Profile

Development Center

Break w

Remembering the Way

Kuala Lumpur

Guns N Roses

National Monuments

materials conservation

Hidden Treasures

para o

Command History

Architectural History

writing skills

x cm

Preservation Pioneer

Powered by LISTSERV

Remembering Nelson Hall

Please send

Financial Group

Local History

Art Center

Grant Recipients

Site Map

Grand Staircase-Escalante National

Blue Book

Can Tho

Resource guides

Park Net Accessibility FOIA Privacy

Fine art

Contact Lorine

South Carolina

Up the Heat on Research

Road Safety

web pages

22AM PDT

Total Transfers

Common Ground

100 random Multiwords terms from 1567

Date 24.10.06 ICS-FORTH, Greece Slide 27

Page 28: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Discussion•Benchmark for evaluation: CIDOC-CRM human-annotated texts

•Over generation: properties expressed with ‘be’ and ‘have’

•How do we combine triples to form paths ? (see *)

•Given modest results for English, how do we extend the previous method to other languages?

Date 24.10.06 ICS-FORTH, Greece Slide 28

*The capability of the CRM to describe so many formats with so few properties is due to

the fact that most actions and events are not encoded as properties, but as paths with

the event as node in the middle. So "Van Gogh painted this .." translates to two triples

with an Production in the centre. I.e. hundreds of action verbs have to be recognized

and mapped. If this is not understood no useful matching can be done (anonymous

reviewer)

Page 29: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Thank you!

Questions ?

Date 24.10.06 ICS-FORTH, Greece Slide 29