36
TEI FOR LANGUAGE RESOURCES: A MISSED CHANCE OR A COMING OPPORTUNITY? Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia

TEI for language resources: a missed chance or a coming opportunity ?

Embed Size (px)

DESCRIPTION

TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions. - PowerPoint PPT Presentation

Citation preview

Page 1: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI FOR LANGUAGE RESOURCES: A MISSED CHANCE OR A COMING OPPORTUNITY?

Tomaž ErjavecDept. of Knowledge TechnologiesJožef Stefan InstituteLjubljana, Slovenia

Page 2: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 2/36

Overview1. Some history2. Why TEI isn‘t used for LRs (as much as expected)3. MULTEXT-East and other case studies4. Conclusions

Page 3: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 3/36

HistoryAt its inception TEI was meant to cover CL/NLP LRs, esp. corpora:• ACL one of the supporting associations • modules for corpora, linguistic analysis, feature-structures, graphs

• BNC in TEI• At the time CL/NLP do not use SGML:clear playing field

Page 4: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 4/36

The age of XML and LRsRelease of XML (more or less) corresponds to the begining of the era of Language resources:1998: XML 1.0, First LREC conference

But developed LRs (mostly) did not use TEI. Why?

Page 5: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 5/36

Reason 1: (X)CES• EAGLES Corpus Encoding Standard

• „constraining or simplifying the TEI specifications in order to ensure interoperability“ (Ide 1998)

• So, more compact and easier to apply than TEI• Almost TEI, but not quite• No methods for extension

Page 6: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 6/36

Reason 2: Comp Sci attitude• I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...)

• If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...)

• I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)

Page 7: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 7/36

Reason 3: General gripes• Missing modules for syntactic analysis & lexical databases

• Not perscriptive / precise enough• Too general elements• Too book oriented

Page 8: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 8/36

Result• Project-local proposals:

• TIGER treebank format• Concede lexical database format• GENIA NER format• ...

• Semantic Web: DC, RDF, OWL• ISO TC 37 SC4:

• LMF, isoCat, • LAF, MAF, SynAF, ...

Page 9: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 9/36

MyTEI• MULTEXT-East: multilingual corpora and lexica• Fida(PLUS): Slovene Reference Corpus• IJS-ELAN, SVEZ-IJS: en-sl parallel corpora • jaSlo: Japanese-Slovene L2 dictionary• eZISS: Scholarly Digital Editions of Slovene Literature• JRC-ACQUIS: Parallel corpus of EC laws• SDT: Slovene Dependency Treebank• SBL: Slovene Biographic Lexicon• AHLib: DL/corpus of 19th century Slovene books• JOS: Slovene gold-standard corpus for HLT • MULTEXT-East...

Page 10: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 10/36

MULTEXT-East• EU project 1995-97: MULTEXT sequel• Development of standardised language resources for Central and Eastern European languages + English hub

• Corpora, lexica, morphosyn. specifications • V1: 1998, 7 languages, LaTeX + CES/SGML• V4: 2010, 16 languages, TEI P5• http://nl.ijs.si/ME/

Page 11: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 11/36

MULTEXT-East Version 4 by language and resource type

Page 12: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 12/36

Why TEI for MTE?• Because I like TEI• Varied resources:

• Metadata / Documentation• „Document“ corpus: rich annotation structure• Lingustically annotated „1984“ corpus• Sentence alignments: stand-off markup• Morphosyntactic specifications: book-like

Either choose several (moving target) schemas or use TEI.

Page 13: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 13/36

Documentation

Page 14: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 14/36

TEI Header-v4-v3-v2-v1-eci-ota-soas-

Page 15: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 15/36

Annotated 1984<text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>

Page 16: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 16/36

Whitespace• A long time ago „1984“ lost its spaces• Whitespace is brittle but important:

• Retokenisation• Reading

• TEI <space> no good!• So <mte:space> </mte:space>, 24:1?• Sitting on the fence JOS solution: </S>• <mte:g/>?

Page 17: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 17/36

Sentence alignments

In MTE V3:<?xml version="1.0" encoding="us-ascii"?><!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd"><cesAlign version="4.1"> <linkList id="Oruen"> <linkGrp type="body" targType="s" domains="Oru Oen"> <link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/> <link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/> <link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/> <link xtargets=" ; Oen.1.3.4.3"/>

Page 18: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 18/36

TEI P5 Alignments• TEI way is with two level indirection: 1st grouping, 2nd alignment

• Too complicated, esp. as 98% alignments are 1-1• Chose fence-sitting one-level:

<linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml"> <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/> <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/> <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8 oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/> <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->

Page 19: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 19/36

Morphosyntactic specifications• Define categories (PoS) and their features• Map feature-structures to morphosyntactic descriptions (MSD tagsets)

• Specify which languages have which features and tagsets

• E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs Tagset∈ sl

• Complex morphology → complex specifications• MSD tagsets are grounded in lexicon and corpus

Page 20: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 20/36

Example: common specifications<table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....

Page 21: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 21/36

Page 22: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 22/36

Language particular specifications <div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div>

MTEsl = JOS

Page 23: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 23/36

Page 24: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 24/36

Encoding• TEI provides needed elements, also for commentary,

bibliography, ...• TEI XSLT used to render as HTML• Tables retained from MULTEXT• Several XSLT scripts for MSD conversions, e.g. to

collating sequence, to fvLib and fsLib• Interesting challenge: conversion to isoCat (Adam P. for

Polish tagset), OWL

Page 25: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 25/36

MTE specifications in OWL(by Christian Chiarcos)

Page 26: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 26/36

Morals, 1• TEI good for in-place markup of richly annotated

resources with varied structure:• Readable• Updatable (validation)

• Not good for huge dataset with shallow annotation:• Processable• Read only

→ useful for (small, medium size) gold standard hand-corrected language resources/ „new“ langauges → localisation /

Page 27: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 27/36

IMPACT @ JSI• EU IP „Improving Access to Text“• Make better OCR and IR for historical texts• JSI: Developing a lemmatisation (+ modernisation)

module for XIX century Slovene• Background: Lexicon, Tagging and Lemmatisation for

modern Slovene + FSA rewrite patterns• Current dataset: AHLib (~100 books)• AHLib marked up in TEI

Page 28: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 28/36

AHLib Digital Library

Page 29: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 29/36

IMPACT Lexicon

Page 30: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 30/36

Mark-up challenges• Text-critical apparatus vs. linguistic annotation• „Parallel“ corpora of transcriptions and modernisations

• Layered linguistic annotations: tokenisation, tagsets

• Lexicon (+dictionary) encoding

Page 31: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 31/36

Morals, 2• Text-critical editions use TEI anyway• Ditto for DLs of historical texts• HLT increasingly applied also to such texts• TEI provides a good basis to join the two views

Page 32: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 32/36

Current EU Projects: FlareNet• Fostering Language Resources Network (2008-11)• WG4 - Harmonisation of Formats and Standards• D4.1 Identification of problems in the use of LR standards

and of standardisation needs (M12): • „For academic purposes the TEI Guidelines (current version P5) has

been a well established and widely used resource of LR‐specific standards mainly for corpus analysis, markup and annotation. But TEI is hardly known in industrial communities (with a few exceptions) and completely foreign to professional groups such as localizers and translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./

• D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)

Page 33: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 33/36

Research Infrastructures for the Humanities

• DG Research funded RIs; pilot phase, 2008-2010• DARIAH ask Lou...• EU RI CLARIN:

Common Language Resources and Technology Infrastructure

• WP5 Language Resources and Technologies Overview• D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encoding digital text by following the P5 guidelines and conversion methods.“

Page 34: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 34/36

Morals, 3• TEI is firmly acknowledged in current work on LR encoding standardisation

• But is not perscriptive enough and lacks modules for many types of LRs

→ Need of constrained solutions & linkages to ISO/W3C standards:

• Cross-walks• Roma & Schema „namespace“ catalogue to

DC, LMF, MAF, ...

Page 35: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 35/36

TEI for LRSWOT

• Universality, Maturity, Community, Extensibility (compare ISO)

• Vagueness, Learning curve, ISO/W3C linkage

• HLT (Humanities Language Technologies), New languages

• Marginalisation, Technical obsolescence

Page 36: TEI for language resources:  a  missed chance or a coming opportunity ?

TEI for Language Resources 36/36

Conclusions• Frontiers: DL+HLT, Gold standard LRs• Priority: Instantiated connections to other standards and languages

• Connection with linguistics? SIG will tell...