77
Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

Embed Size (px)

Citation preview

Page 1: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

Language Technology and the Semantic Web

Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

Page 2: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 2

We present collaborative research work on the combination of language technology (LT) and technologies for encoding (domain) knowledge in ontologies, supporting the emergence of the Semantic Web (SW), or maybe more appropriate: Semantic Webs

MUMIS (dealing with multimedia content indexing and searching in the soccer domain, finished in December 2002)

MuchMore (dealing with cross-lingual information retrieval in the medical domain, finished in Mai 2003)

Esperonto (developing a Semantic Annotation Service for upgrading the actual Web to the Semantic Web, Sept. 2002 - Mai 2005)

Page 3: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 3

Semantic Web Applications of LT

Supporting accurate ontology-based semantic annotation of multilingual web documents (Knowledge Markup)

Supporting Ontology Learning/Construction from linguistically/semantically annotated multilingual text (Knowledge Extraction)

See also the Special Interest Group (SIG-5) OntoWeb-lt on Language Technology in Ontology Development and Use: http://ontoweb-lt.dfki.de

Page 4: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 4

 

Knowledge Markup and Knowledge Extraction

Text/SpeechText/Speech

Text/Speech Mining

Concepts, Relations, EventsConcepts, Relations, Events

Linguistic AnalysisMorpho-Syntactic Analysis and Tagging,

Semantic Class Tagging, Term/NE Recognition, Grammatical Function Tagging, Dependency Structure Analysis

Linguistic and Semantic Annotations

Page 5: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 5

 

Knowledge Markup and Knowledge Extraction (2)

Text/Speech/Image-VideoText/Speech/Image-Video

Text/Speech/Media Mining

Concepts, Relations, EventsConcepts, Relations, Events

Linguistic and Media Analysis

Linguistic, Low-level Image and Semantic Annotations

Page 6: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 6

Integration of Language Technology and Domain Knowledge

Page 7: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 7

Linguistic Analysis

Language technology tools are needed to support the upgrade of the actual web to the Semantic Web (SW) by providing an automatic analysis of the linguistic structure of textual documents. Free text documents undergoing linguistic analysis become available as semi-structured documents, from which meaningful units can be extracted automatically (information extraction) and organized through clustering or classification (text mining). Here we focus on the following linguistic analysis steps that underlie the extraction tasks: morphological analysis, part-of-speech tagging, chunking, dependency structure analysis, semantic tagging.

Page 8: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 8

Morphological Analysis

Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word.

While processing the German word Häusern (houses) the following morphological information should be analysed:

[PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

Page 9: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 9

Part-of-Speech Tagging

Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun:

He works [N,V] the whole day for nothing.His works [N,V] have all been sold abroad.

PoS tagging involves disambiguation between multiple part-of-speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

Page 10: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 10

ChunkingFollowing Abney: chunks as the non-recursive parts of core phrases, such as nominal, prepositional, adjectival and adverbial phrases and verb groups.

Chunk parsing is an important step towards making natural language processing robust, since the goal of chunk parsing is not to deliver a full analysis of sentences, but to extract just the linguistic fragments that can be surely identified. However, even if this strategy fails to produce an analysis for the whole sentence, the partial linguistic information gained so far will still be useful for many applications, such as information extraction and text mining.

Page 11: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 11

Named Entities detectionRelated to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Named entity recognition can be included as part of the linguistic chunking procedure and the following sentence fragment: “…the secretary-general of the United Nations, Kofi Annan,…”will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization, and Kofi Annan with named entity class: person

Page 12: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 12

Dependency Structure Analysis

A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it.There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object).

Page 13: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 13

Internal Dependency Structure

.

In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers. Consider the following example:

“The shot by Christian Ziege goes over the goal.”

The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”.

Page 14: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 14

Grammatical FunctionsDetermine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”:

“The shot by Christian Ziege goes over the goal.”

This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

Page 15: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 15

Semantic Tagging

Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

Page 16: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 16

Semantic ResourcesSemantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine. They can be roughly distinguished into the following three groups:

Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. (like Roget)

Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet)

Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

Page 17: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 17

The MeSH ThesaurusMeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants):

MH = Gene LibraryENTRY = Bank, GeneENTRY = Banks, GeneENTRY = DNA LibrariesENTRY = Gene Bank

etc.

Page 18: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 18

The WordNet Semantic Lexicon

WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

Page 19: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 19

WordNet: An example

The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes:

09396070 tree 0 09395329 woody_plant 0 ligneous_plant 0 09378438 vascular_plant 0 tracheophyte 0 00008864 plant 0 flora 0 plant_life 0 00002086 life_form 0 organism 0 being 0 living_thing 0 00001740 entity 0 something 010025462 tree 0 tree_diagram 0 09987563 plane_figure 0 two-dimensional_figure 0 09987377 figure 0 00015185 shape 0 form 0 00018604 attribute 0 00013018 abstraction 0

Page 20: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 20

CyC: A Semantic NetworkCYC is a semantic network of over 1,000,000 manually defined rules that cover a large part of common sense knowledge about the world . For example, CYC knows that trees are usually outdoors, or that people who died stop buying things. Each concept in this semantic network is defined as a constant, which can represent a collection (e.g. the set of all people), an individual object (e.g. a particular person), a word (e.g. the English word person), a quantifier (e.g. there exist), or a relation (e.g. a predicate, function, slot, attribute). The entry for the predicate #$mother: #$mother : (#$mother ANIM FEM) isa: #$FamilyRelationSlot #$BinaryPredicate

This says that the predicate #$mother takes two arguments, the first of which must be an element of the collection #$Animal, and the second of which must be an element of the collection #$FemaleAnimal.

Page 21: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 21

Word Sense Disambiguation

Words mostly have more than one interpretation, or sense. If natural language were completely unambiguous, there would be a one-to-one relationship between words and senses. In fact, things are much more complicated, because for most words not even a fixed number of senses can be given. Therefore, only in certain circumstances and depending on what we mean exactly with sense, can we give restricted solutions to the problem of Word Sense Disambiguation (WSD)

Page 22: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 22

A simplified Example of a Domain Ontology

 

Ontology_1: Movies Title: String Date: mm/dd/yyyy Duration: minutes Type: (action, drama,..) Director: String Main Actors: Name_1: Role: Name_2: Role: Name_3: Role: ……

Ontology_1: Movies Title: Lord of the Rings Date: Duration: Type: Director: Peter Jackson Main Actors: Name_1: Role: Name_2: Role: Name_3: Role: ……

Instances

Page 23: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 23

Example of RDF Schema forthe Movie Ontology

 

etc…

<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#' xmlns:NS0='http://webode.dia.fi.upm.es/RDFS/MovieOntology#' ><rdf:Description rdf:about='http://webode.dia.fi.upm.es/RDFS/MovieOntology#SpecialEffectsCompanyActing'> <rdf:type rdf:resource='http://www.w3.org/2000/01/rdf-schema#Class'/> <rdfs:comment>Details of company that created special effects in this movie</rdfs:comment> <rdfs:subClassOf rdf:resource='http://webode.dia.fi.upm.es/RDFS/MovieOntology#CompanyActing'/></rdf:Description> <rdf:Description rdf:about='http://webode.dia.fi.upm.es/RDFS/MovieOntology#Police'> <rdf:type rdf:resource='http://www.w3.org/2000/01/rdf-schema#Class'/> <rdfs:comment>Films that deal solely with police activity</rdfs:comment> <rdfs:subClassOf rdf:resource='http://webode.dia.fi.upm.es/RDFS/MovieOntology#Crime'/> </rdf:Description>

Page 24: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 24

Standardization

Format Web-based Standards for Lexical Semantic Representation will Increase their

Uptake (Easy Plug-and-Play, Remote Access, etc.)

Content Widely Used (Lexical) Semantic Resources will lead to (Further) Semantic Standardization

Example of Semantic Lexicon is WordNet (sometimes also referred to as a ‘Linguistic Ontology’)Ontologies are domain specific models, usually lacking linguistic information (PoS, Morphology, Syntax etc.)

To be Integrated in One Resource or Kept/Accessed Separately?

Integration of Ontology andSemantic Lexicon

Page 25: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 25

<rdf:RDF xmlns:rdf =”http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:rdfs =”http://www.w3.org/2000/01/rdf-schema#” xmlns:xsd =”http://www.daml.org/2000/10/XMLSchema#” xmlns:daml =”http://www.daml.org/2001/03/daml+oil#” xmlns:art =”http://www.art-world.org/art-world#”>

<daml:Ontology rdf:about=”Concepts in the Art World”> <daml:imports rdf:resources=”http://www.daml.org/2001/03/daml+oil#”></daml:Ontology>

Defining a Linguistic Ontology for the Art World (Tentative)

Page 26: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 26

<daml:Class rdf:ID="art-world.01"> <rdfs:label>art-world.01</rdfs:label> <rdfs:subClassOf

rdf:resource="http://www.art-world.org/art-world.00#"/></daml:Class>

<art-world.01 rdf:ID="work"/><art-world.01 rdf:ID="painting"/>

<daml:Class rdf:ID="art-world.02"/><art-world.02 rdf:ID="beautiful"/><art-world.02 rdf:ID="colourful"/>

<daml:Class rdf:ID="art-world.03"/><art-world.03 rdf:ID="paper"/><art-world.03 rdf:ID="canvas"/>

Defining Art World Concepts (Classes, “Synsets”)(Tentative)

Page 27: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 27

<daml:ObjectProperty rdf:ID="manner"> <rdfs:range rdf:resource="#art-world.02"/> <rdfs:domain rdf:resource="#art-world.01"/></daml:ObjectProperty > <daml:ObjectProperty rdf:ID="medium"> <rdfs:range rdf:resource="#art-world.03"/> <rdfs:domain rdf:resource="#art-world.01"/></daml:ObjectProperty > </rdf:RDF>

Defining Properties (“Selection Restrictions”)(Tentative)

Page 28: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 28

…an Important Part of the Semantic Web

…Represented Using Markup Languages (RDF)

…Accessible in a Remote, Distributed Fashion

…Central to Further Semantic Standardization

(Semantic) Lexicons will be…

Page 29: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 29

Multilingual terminological lexicon, attached to a

domain ontology (MUMIS)

<lex-element id="ID" concept="Shot-on-goal"> <... lang="DE" type="main">Torschuss</term> <... lang="EN" type="main">shot on goal</term> <... lang="NL" type="main">schot op doel</term> <definition>ein Angriffsspieler kickt den Ball zu den gegnerischen Tor</definition> <... lang="DE" type="synonym">Distanzschuss</term> <... lang="DE" type="synonym">Nachschuss</term> <... lang="DE" type="synonym">Schuss</term> <... lang="DE" type="synonym">abzieh</term></lex-element>

Page 30: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 30

Extension and Formalization of the multilingual terminological lexicon, including

syncategorematic information. Supporting WSD.

<lex-element id="ID" concept="Shot-on-goal"> <...lang = "DE" type = "main„ pos = „N“ mod = {„von

concept = „Player“ | concept = „player“ gender = „gen“ | pos = „posspron“ } >Torschuss</term>

<...lang="DE" type="synonym„ pos = „V“ comp = {„SUBJ“ concept = „Player“} >abzieh</term>

<definition>URL: DFB home page/glossary</definition></lex-element>

Page 31: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 31

Integrating Syntactic and Domain Knowledge

 

Including Syntactic Analysis for a more accurate tagging of domain specific semantic annotation

Page 32: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 32

Abstraction over Syntactic Annotation

 

Ontology_1: NPHead:NMod: {Adj*,PP?,GenNP}Spec: {Det? PossPron?}Type: {RefNP, ProNP, DateNP,etc.}

Ontology_2: PP Head: PrepType: {LocPP,DatePP, etc.}

Comp: NP

Ontology_4:Grammatical FunctionsSubject, Object, Ind. ObjectNP Adjunct, PP Adjunct, etc..

Ontology_3: Dependencies Head Comp Mod Spec

Page 33: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 33

Merging of Syntactic and Domain Knowledge

 

Example of a possible rule for conceptual annotation:

If (Head of Subj_NP of Verb[type=soccer::shot-on-goal] is a person) => { annotate head of NP with semantic class “soccer::player”; …}

Example of a rule for Instance Filling:

If (term annotated with concept “soccer::player”) =>{ try to find information about relations “Team”, “Age” etc.} (Template Filling in Information Extraction).

Page 34: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 34

NLP-based knowledge markup

Page 35: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 35

document sentence

umlsterms

xrceterms

ewnterms

semrels

gramrels

chunks

text

cui

sense

umlsterm

xrceterm

ewnterm

semrel

gramrel

chunk

token

to

id from

to

offset

from

id

code

typeterm2term1id

pref tui

code pref tui

type

id

to

id from

type

id pos

lemma

msh

cui msh

MuchMore: DTD for Annotation

Page 36: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 36

Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of spatial and object-based attention, disturbed spatial perception and representation, and optic ataxia resulting from bilateral parieto-occipital lesions.

<text> <token id="w1" pos="NN">Balint</token> <token id="w2" pos="NN">syndrom</token> <token id="w3" pos="VBZ" lemma="be">is</token> <token id="w4" pos="DT" lemma="a">a</token> <token id="w5" pos="NN" lemma="combination">combination</token> <token id="w6" pos="IN" lemma="of">of</token> <token id="w7" pos="NNS" lemma="symptom">symptoms</token> ... <token id="w20" pos="JJ" lemma="spatial">spatial</token> <token id="w21" pos="NN" lemma="perception">perception</token> <token id="w22" pos="CC" lemma="and">and</token> <token id="w23" pos="NN" lemma="representation">representation</token> ...</text>

<chunks><chunk id="c1" from="w1" to="w2" type="NP"/><chunk id="c7" from="w20" to="w23" type="NP"/></chunks>>

MuchMore: Linguistic Annotation(Lemmatization, POS, Basic Chunking)

Page 37: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 37

Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of spatial and object-based attention, disturbed spatial perception and representation, and optic ataxia resulting from bilateral parieto-occipital lesions.

<umlsterm id="t7" from="w20" to="w21"><concept id="t7.1" cui="C0037744" preferred="Space Perception" tui="T041"> <msh code="F2.463.593.778"/> <msh code="F2.463.593.932.869"/></concept>

</umlsterm>

<umlsterm id="t8" from="w26" to="w26"><concept id="t8.1" cui="C0029144" preferred="Optics" tui="T090"> <msh code="H1.671.606"/></concept>

</umlsterm>

<semrel id="r7" term1="t7.1" term2="t8.1" reltype="issue_in"/>

<ewnterm id="e2" from="w21" to="w21"><sense offset="0487490"/><sense offset="3955418"/><sense offset="4002483"/>

</ewnterm>

MuchMore: Semantic Annotation (UMLS, EuroWordNet)

Page 38: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 38

MUMIS: DTD for Linguistic Annotation

Document SentenceParagraph

PP

VG

NP

NE

AP

AdvP

Subord-Clause

Page 39: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 39

AP

TYPE

STRUK

AP_AGR

STRING

AP_HEADW

MUMIS: DTD for Linguistic Annotation

Page 40: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 40

VG

TYPE

VG_SUBCAT_STEM

STRING

KLAMMER

VG_STRG

SENT_STRING

VG_TYPE

VG_AGR

STRUK

VG_HEAD

...

VG

W

MUMIS: DTD for Linguistic Annotation

Page 41: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 41

W

INFL

STRING

CLAUSE_PRED_SUBCAT

CLAUSE_PP_LIST

...

CLAUSE_TYPE

TC

CLAUSE_SUBJ

CLAUSE_PRED_STRG

STEM

TYPE

SENT_STRING

CLAUSE_VG_LIST

CLAUSE_PRED_AGR

CLAUSE

POS

CLAUSE_PP_ADJUNKT

CLAUSE_NP_LIST

MUMIS: DTD for Linguistic Annotation

Page 42: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 42

Industrie, Handel und Dienstleistungen werden in der ersten Liste aufgeführt, wobei die in Klammern gesetzten Zahlen auf die Mutterfirmen hinweisen.

(Industry, trade and services are mentioned in the first list, in which numbers within brackets point to parent companies.) <chunks> <chunk id="c1" from="w1" to="w5" type="NP" head=”w1,w3,w5”/> <chunk id="c2" from="w6" to="w6" type="VG"/> <chunk id="c3" from="w7" to="w10" type="PP" head=”w7” complement=”w8,w9,w10”/> <chunk id="c4" from="w11" to="w1" type="VG"/> ….</chunks> <clauses> <clause id="cl1" from="c1" to="c4" pred_struct="c2 c4" GF_Subj="c1"/> <clause id="cl2" from="c6" to="c9" pred_struct="c9" GF_Subj="c6"/></clauses>

MUMIS: Linguistic Annotation(Lemmatization … Dependency Structure)

Page 43: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 43

 

7. Ein Freistoss von Christian Ziege aus 25 Metern geht über das Tor.

<chunks> <chunk id="c1" from="w1" to="w5" type="NP" head=”w2” pp modifier=”w3 w4 w5”/> <chunk id="c2" from="w6" to="8" type="PP" head=”w6” complement=”w7 w8”/> <chunk id="c3" from="w9" to="9" type="VG"/> <chunk id="c4" from="w10" to="w12" type="PP" head=”w10” complement=”w11 w12”/></chunks>

<clauses> <clause id="cls1" from="c1" to="c4" pred_struct="c3“ GF_Subj="c1"/></clauses>

<events> <event id="e1" clause=”cls1” event-name=”free-kick”> <arguments>

<argument id="arg1" name="player” value=”w4, w5”/> <argument id="arg2" name="location” value=”25-meter”/>

<argument id="arg3" name="time” value=”07:00”/> </arguments> </event> <event id="e2" clause=”cls1” event-name=”goal-scene-fail”> <arguments>

<argument id="arg1" name="player” value=”w4, w5”/> <argument id="arg2" name="location” value=”25-meter”/> <argument id="arg3" name="time” value=”07:00”/> </arguments> </event></events>

MUMIS: Semantic Annotation (Events)

Page 44: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 44

 

Conceptual Annotations for Multimedia Indexing and Retrieval: A multilingual cross-document and incremental IE approach (MUMIS)

Technology development to automatically index (with formal annotations) lengthy multimedia recordings (off-line process): Find and annotate relevant entities, relations and events

Technology development to exploit indexed multimedia archives (on-line process): Search for interesting scenes and play them via Internet

Test Domain: Soccer Games / UEFA Tournament 2000

Page 45: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 45

Off-line TaskAutomatic Speech Recognition (Radio/TV Broadcasts)

Automatically transforms the speech signals into texts (for 3 languages — Dutch, English and German)

Natural Language Processing (Information Extraction)

Analyse all available textual documents (newspapers, speech transcripts, tickers, formal texts ...), identify and extract interesting entities, relations and events

Merging all the annotations produced so far

Create a database with formal annotations

Use video processing to adjust time marks

Indexing by...

Page 46: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 46

Information Extraction

Information Extraction (IE) is the task of identifying, collecting and normalizing relevant information for a specific application or user. The relevant information is typically represented in form of predefined “templates”, which are filled by means of Natural Language (NL) analysis. IE combines pattern matching mechanisms, (shallow) NLP and domain knowledge (terminology and ontology).

Page 47: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 47

Information Extraction (2)

IE is generally subdivided in following tasks:- Named Entity task (NE)

- Template Element task (TE)

- Template Relation task (TR)

- Scenario Template task (ST)

- Co-reference task (CO)

Page 48: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 48

Subtask of IE

Named Entity task (NE): Mark into the text each string that represents, a person, organization, or location name, or a date or time, or a currency or percentage figure.Template Element task (TE): Extract basic information related to organization, person, and artifact entities, drawing evidence from everywhere in the text.

Page 49: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 49

Subtask of IE (2)

Template Relation task (TR): Extract relational information on employee_of, manufacture_of, location_of relations etc. (TR expresses domain-independent relationships).Scenario Template task (ST): Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities (ST identifies domain and task specific entities and relations).Co-reference task (CO): Capture information on co-referring expressions, i.e. all mentions of a given entity, including those marked in NE and TE.

Page 50: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 50

IE applied to soccer

Terms as descriptors for the NE task Team: Titelverteidiger Brasilien, den respektlosen Außenseiter Schottland

Player:Superstar Ronaldo, von Bewacher Calderwood noch von Abwehrchef Hendry, von Jackson als drittem Stürmer, Torschütze Cesar, von Roberto Carlos (16.),

Referee: vom spanischen Schiedsrichter Garcia Aranda

Trainer: Schottlands Trainer Brown, Kapitän Hendry seinen Keeper Leighton

Location: im Stade de France von St. Denis (more fine-grained location detection would be: Stadion: im Stade de France and City: von St. Denis )

Attendance: Vor 80000 Zuschauern

Page 51: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 51

IE applied to soccer (2)Terms for NE Task

Time: in der 73. Minute, nach gerade einmal 3:50 Minuten, von Roberto Carlos (16.), nach einer knappen halben Stunde, scheiterte Rivaldo (49./52.) jeweils nur knapp, das vor der Pause Versäumte versuchten die Brasilianer nach Wiederbeginn, ...

Date: am Mittwoch, der Turnierstart (?), im WM-Eröffnungsspiel (?)

Score/Result: Brasilien besiegt Schottland 2:1, einen 2:1 (1:1)-Sieg, der zwischenzeitliche Ausgleich, in der 4. Minute in Führung gebracht, köpfte zum 1:0 ein

Page 52: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 52

IE applied to soccer (3)Relations for TR Task:

Opponents: Brasilien besiegt Schottland, feierte der Top-Favorit ... einen glücklichen 2:1 (1:1)-Sieg über den respektlosen Außenseiter Schottland,

Player_of: hatte Cesar Sampaio den vierfachen Weltmeister ... in Führung gebracht, Collins gelang ... der zwischenzeitliche Ausgleich für die Schotten, der Keeper des FC Aberdeen, Brasiliens Keeper Taffarel

Trainer_of: Schottlands Trainer Brown

...

Page 53: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 53

IE applied to soccer (4)Events for ST task:

Goal: in der 4. Minute in Führung gebracht, das schnellste Tor ... markiert, Cesar Sampaio köpfte zum 1:0 ein, Collins (38.) verwandelte den Strafstoß, hätte Kapitän Hendry seinen Keeper Leighton um ein Haar zum zweiten Mal bezwungen, von dem der Ball ins Tor prallte

Foul: als er den durchlaufenden Gallacher im Strafraum allzu energisch am Trikot zog

Substitution: und mußte in der 59. Minute für Crespo Platz machen...

Page 54: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 54

NL Processing and Knowledge Markup of (German) soccer texts with the SCHUG system

 

 

A multilingual ontological lexicon

• Formal Text1• Formal Text2• XML Soccer Annotation for Text1• XML Soccer Annotation for Text2• Merging of Annotations for Formal Texts• Semi-Formal Text• Semi-Formal Text annotated with Soccer Information

(XML)

Page 55: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 55

Multilingual ExtensionSpanish (Esperonto)

 

 

Ontology:

<lex-element id="ID" concept=„Second-half"> <... lang="DE" type="main">zweite Halbzeit</term> <... lang="EN" type="main">second half</term> <... lang=„ES" type="main">reanudacion</term></lex-element>….

Processing with the SCHUG system: Example

Page 56: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 56

 

Conceptual Annotations for Multimedia Indexing and Retrieval: MUMIS

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextFree

Text

FormalText

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextFormal

Text

IEMergedAnnotated formal text

Information Extraction

FormalText

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextTrans-

criptsASR

Automatic Speech RecognitionFormal

TextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextSpeechSignals

Merging

Annotations

FormalText

Merging

FormalTextFormal

TextAnno-tations

Domain Modeling

DM

FormalText

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextSoccer

Texts

Ontology

OntologyOntologyDomain Lexicon

User Interface

UI

OntologyOntologyQueryDEEN NL

FORMAL

Legend

Page 57: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 57

The first user interface of MUMIS

 

 

Page 58: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 58

Esperonto:Partners

 

 

Intelligent Software Components (Coord) : Semantic Web, Annotation Services.UPM : ontology development and evaluation.University of Innsbruck : Semantic Web languages.Saarland University : multilingual Annotation services, using Information ExtractionUNILIV : Semantic indexation of Semantic Web content. Routing solutions. Visualization and navigation to make content presentation user-friendlier.Residencia de Estudiantes : Content provider. Cultural tour test case. Evaluation.CIDEM : Content provider. Fund finder test case. Evaluation.BioVista : Content provider. Scientific Discovery test case.

Page 59: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 59

Aim

 

 

Application Service Provision of Semantic Annotation, Aggregation,

Indexing and Routing of Textual, Multimedia, and Multilingual Web Content

The project aims at bridging the gap between the actual World Wide Web and the semantic Web by providing a service to "upgrade" existing content to semantic Web content.Ontologies play a key role in this effort, together with multilingual Natural Language Analysis of textual documents currently in the web as free or HTML encoded texts.

Page 60: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 60

Main Goals

 

 

To bridge the gap between the current web and the Semantic Web: SemASP Ontology-based annotation Sources:

Static pages Pages dinamically generated from DB Textual and multimedia information Web services

Added value knowledge-based services on top of the constructed semantic web Routing based on P2P communication Semantic aggregation Meaning negotiation

Support Multilinguality on ontology construction, ...

Page 61: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 61

Applications

Router

Agent

XML DAML OIL RDF(S)

Certificate

Workbench

Maintenance

Multilinguality

Reengineering

Mapping

OntologyRepository

Service

Tagger/Wrapper

Web ServerProvider

DynamicInformation Provider

StaticInformation Provider

Multimedia DataProvider

Multilingual NL

Understanding

World Wide Web

Semantic Web

VisualizationServiceProvider

SemASP

MultilingualNL

Generation

PortalAgent

Tagger/Wrapper

Tagger/Wrapper

Tagger/Wrapper

Router

Router

Router

Router

Semantic indices, Concept instances

Page 62: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 62

 

Ontology-based Annotation

Annotate accurately document with concepts and terms described in various semantic resources: EuroWordNet, UMLS, Soccer ontology etc.

Annotate documents with relations defined in the ontology

Page 63: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 63

 

Ontology construction from Text

There are various methodologies under investigation for extracting/learning knowledge from text, and to encode it in an ontology (see Ontology Learning Overview - OntoWeb D1.5 http://www.ontoweb.org). Many are based on Machine Learning techniques

We discuss here the possibility of a rule-based approach for partial and shallow ontology construction from text, based on various levels of syntactic patterns annotated in the documents.

Page 64: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 64

 

Ontology construction from Text: A starting experiment: Medicine

Document Set: 65 sample phrases that link symptoms with Rheumatoid Arthritis (RA).

Page 65: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 65

 

Ontology construction from Text: Apposition and Paranthesis (1)

“The effects of rheumatoid arthritis on bone include structural joint damage (erosions) and osteoporosis “

Linguistic Structure:

[[The effects of rheumatoid arthritis] [on bone]] [include] [[structural joint damage ( erosions )] [ and] [osteoporosis]]

=> The Apposition (2 syntactic heads “joint” and “erosions” in one NP) including a parenthesis construction suggests a synonymy relation or a definition. Heuristic: Establishing Semantic Relations on the top of linguistic “head-modifiers” constructions

Page 66: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 66

 

Ontology construction from Text: Apposition with Paranthesis (2)

“For symptoms of rheumatoid arthritis (pain, joint stiffness), the reference treatment is a nonsteroidal antiinflammatory drug (NSAID) such as diclofenac or ibuprofen.”

Linguistic Structure

[For symptoms of rheumatoid arthritis ( pain , joint stiffness )] , [the reference treatment] [is] [a nonsteroidal antiinflammatory drug ( NSAID)]

Suggesting a semantic relation between („pain“ and „joint stiffness“)

Classify „pain“ and „joint stiffness“ as symptom of RA. The word „symptom“ is linguistically annotated as the head of the Compl-NP of the PP starting with „For“.

Page 67: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 67

 

Ontology construction from Text: Apposition with Paranthesis (3)

But there is a need for constraining the hypothesis:

“In patients with rheumatoid arthritis (RA)” => RA is abbreviation of rheumatoid arthritis

And in the sentence:

“Fourteen consecutive elbows have been treated for rheumatoid arthritis (9 elbows) and for post-traumatic osteoarthrosis (5 elbows) by total elbow replacement with the GSB III implant. “,

the parenthesis (9 elbows) and (5 elbows) have no semantic relations to the preceding head nouns!

Page 68: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 68

 

Ontology construction from Text: Apposition with commas

“Etoricoxib, a selective COX2 inhibitor, has been shown to be as effective as non-selective non-steroidal anti-inflammatory drugs in the management of chronic pain in rheumatoid arthritis and osteoarthritis, …”

Linguistic Structure:

[Etoricoxib, a selective COX2 inhibitor,] [has been shown]…

The same hypothesis as in the former examples: a semantic relation between “Etoricoxib” and “selective COX2 inhibitor”. Probably a “isa” relation

Page 69: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 69

 

Ontology construction from Text: Compound Analysis

„Joints destructions, „joint damage“, „joint disease“, „joint stiffness“ but „joint cartilage“.

„Knee joints“ vs. „tender joints”

What can happen to joins, where are joints located?. Use of synsets to detect relations? „Joint cartilage“ is not a disease.

Page 70: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 70

 

Ontology construction from Text: PP post-modification

„inflammation of joints, synovial lining of joints”

Here: use of synsets for grouping that what can happen to joints?

Page 71: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 71

 

Ontology construction from Text: Phrase Internal Coordination

“The effects of rheumatoid arthritis on bone include structural joint damage (erosions) and structural joint damage “Linguistic Structure:[[The effects of rheumatoid arthritis] [on bone]] [include] [[structural joint damage ( erosions )] [ and] [osteoporosis]]

RA causes structural joint damage AND structural joint damage (interpreting the head noun “effects” as a causation).Hypothesis: The two heads of an NP coordination are somehow related.

Page 72: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 72

 

Ontology construction from Text: Phrase Internal Coordination (2)

“A study was conducted to determine the incidence of ulnar and peripheral neuropathy “

Linguistic Structure:… [The incidence of [[ulnar and peripheral] neuropathy]]

The AP “ulnar and peripheral” AP modifies the head noun “neuropathy”. The AP is a coordinated one, having two Adjectival heads. Hypothesis: They correspond to two types of neuropathy

Page 73: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 73

 

Ontology construction from Text: Subject Verb Objetcs (Ind. Obj. etc.)

[Rheumatoid arthritis is an immunologically mediated inflammation of joints of unknown aetiology] and [often leads to disability]

=> RA leads to Disability (effect of ellipsis resolution: RA detected as the subject of the verb „leads“, even if not realised in text. Reference resolution very important for knowledge extraction)

=> Lexical semantic info: collects all objects of RA leads to …

=>Suggest Causality (verb lead + to)

Page 74: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 74

 

Ontology construction from Text: Subject Verb Objects (Ind. Obj etc.)

“These changes constitute hallmarks of synovial cell activation and contribute to both chronic inflammation and hyperplasia”

On line exercise!

Page 75: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 75

 

Future Work

Still have to identify accurately the sub-set of linguistic tags, describing syntactic/semantic patterns that are relevant for ontology extraction (or even ontology mark-up).

Page 76: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 76

 

First Conclusions

Construction of partial and shallow ontologies from (complex) syntactic patterns seems feasible. It might seem “expensive” in the sense that documents first should be (automatically) linguistically annotated.

But Machine Learning methods also needs a lot of semi-automatically annotated data for training.

A need to conduct a comparative evaluation taking into account as many parameters as possible.

Page 77: Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (Saarland University & DFKI GmbH)

T. Declerck, P. Buitelaar 77

Practical Sessions (Adrian Raschip)

• Exercise 1 : Semi-Automatic Terminological extension: Romanian and other languages. On the base of the TMX encoded MUMIS multilingual terminology

• Exercise 2 : (Manual) linguistic annotation of English and Romanian Text on Soccer

• Exercise 3 : Define a soccer ontology in Protégé

• Exercise 4: Search for possible mapping rules between

linguistic annotations and relations that might be relevant to be extracted