View
225
Download
0
Tags:
Embed Size (px)
Citation preview
RelishRendering Endangered Languages Lexicons Interoperable through Standards Harmonization
Marc [email protected]
Max Planck Institute for Psycholinguistics
SaLTMIL WorkshopSpeech and Language Technology for Minority LanguagesMay 23rd 2010LREC Malta
Increase interoperability between endangered language lexica created on both sides of the Atlantic
Background
Lexica constitute important record of endangered languagesDiverging European and American standards for data formatting and markup
LIFT/LLIFT vs. LMFGOLD vs. ISOcat
Significant effort in tool support by all parties
Structural differencesDifferences in terms and abbreviationsDifferences in interchange formats
European and American Projects and Standards
MPIILIT
Dobes
Intera
DAM-LR
ECHO
CLARIN
LEGO
EMELD
Data Driven Ontology
GOLD Community
SIL
Lexicons of endangere
d languages
Standards for Terminology
DCR
GOLD
Standards for Lexicons
LMF
LIFT
ISO IS 12620:2009 DCR
ISO FDIS 24613:2008 LMF
UF
MethodologyBottom up approach
Analyze existing lexica to identify commonalities and differences in lexical structure and content
TofaUdiArchi
IwaidjaMocoviSalar
Kayardild
LLIFT example
<entry id="_123"> <!-- here we've inserted the underscore so the id conforms to xml datatype ID, which cannot begin with a number. --
> <trait name="original-id" value="123"/> <!-- this is where we'll
keep the original id, since we may need it and we have to put an underscore in front of the entry id, so that it conforms to the
datatype id format. We considered using <field> as it seemed more semantically appropriate, but <field> would require
<form> inside it, which would in turn require a language attribute, and
we don't want that. <field> has no appropriate attributes we could use, either. -->
<lexical-unit><!-- optional --><!-- the headword --> <form lang="fuh"> <!-- regarding the lang attribute: The
format (based on RFC 4646bis or superseding document): ISO language code-script type-ISO country code. Only the
ISO language code is really necessary, though. Q: What to do if we need more than one language code to
cover a given form, though? For instance, in Tamashek, where what Heath calls 'dialects' have separate ISO codes?
A: Use the private use 'x-' format, ie: taq-x-ttq-thz. NB
everything following the x- is considered private use, so put anything conforming to the standard first. OR: x-qta (use a temp code, and map it in a URI to all three
required codes) Not sure if this would work if we're trying to map individuals to their different possible
combinations of dialects, though. --> <text>cow</text>
</form> </lexical-unit>
<variant> <!-- optional --> <!-- alternate spellings or forms - these can't have any different meaning or grammatical info,
as variant can't have <sense> under it. --> <form lang="fuh">
<text>dabere</text> </form> </variant>
<variant> <!-- a second variant is possible --> <form lang="fuh">
<text>dabbere</text> </form>
Shoebox example
\_sh v3.0 400 Iwaidja\_DateStampHasFourDigitYear
\lx a\lc Lexical citation ((R) => root)
\ps Part of speech\de Definition
\ge Gloss-English\re Reversal
\xv Example vernacular\xe Example English
\rf Reference for example\dt 11/Jul/2007
\lx a-\lc a-\a a-
\ps v. prefix\de third person plural intransitive subject prefix
\ge 3pl\re they
\ng This is the neutral form; the 'towards' form is |fv{ayuwu-}, 'away' form
is |fv{ijb-} ~ |fv{ijuwu-}\sd verb prefix
\sd inflectional prefix\rf PL93
\xv Amalkban.\xe They move outside.
\dt 15/Jul/2007
\lx a-\lc a-\a a-
\ps n. pref.\de their (with possessed body parts)
\ge 3pl\re their (with possessed body parts)
\sd noun prefix\sd inflectional prefix
\dt 29/Nov/2006
Lexus example
<lexicalEntry><headword_x0020_group>
<date_x0020__x0028_last_x0020_entered_x0029_>11/Jul/2007</date_x0020__x0028_last_x0020_entered_x0029_>
<headword>a</headword><citation_x0020_form>Lexical citation ((R) =>
root)</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group>
<part_x0020_of_x0020_speech/><sense_x0020_number_x0020_group>
<contextualized_x0020_example_x0020_group><example_x0020__x0028_free_x0020_translation_x0029_/>
<contextualized_x0020_example/></contextualized_x0020_example_x0020_group>
<definition_x0020_group><English_x0020_reversal/>
<English_x0020_gloss/><definition/>
</definition_x0020_group><reference_x0020_group>
<reference/></reference_x0020_group>
</sense_x0020_number_x0020_group></part_x0020_of_x0020_speech_x0020_group>
</headword_x0020_group></lexicalEntry><lexicalEntry>
<headword_x0020_group><date_x0020__x0028_last_x0020_entered_x0029_>12/Jul/2007</
date_x0020__x0028_last_x0020_entered_x0029_><headword>^(d)angkarranaka</headword>
<citation_x0020_form>angkarranaka</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group>
<part_x0020_of_x0020_speech>?</part_x0020_of_x0020_speech><sense_x0020_number_x0020_group>
<reference_x0020_group><reference>IwNo05:19Ap</reference>
</reference_x0020_group><contextualized_x0020_example_x0020_group>
ce></reference_x0020_group><_x0032_D_x0020_group>
<grammatical_x0020_note>The d-initial form is found after prefixes ending in K-; elsewhere the root begins with |fv{a}. The citation form
is |fv{dangkarranaka}.</grammatical_x0020_note></_x0032_D_x0020_group>
MethodologyTop down approach
Analyze existing standards for lexical resources (GOLD/LIFT and LMF/DCR) to identify commonalities and differences at the conceptual level.
Harmonize concepts using ISO 12620 Data Category Registry Harmonize model approaches Harmonize interchange formats
Harmonizing 12620 data categories
All linguistic concepts will be registered in the ISO 12620 Data Category Registry (ISOcat)
Analysis of existing ISOcat data categories vs. GOLD vs. MDF
ISOcat 12620 Data Category RegistryGOLD Comunity
\+DatabaseType MDF 4.0\ver 5.0
\desc Standard Format markers defined in _Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter_. David F. Coward, Charles E. Grimes, and Mark R. Pedrotti. Waxhaw, NC: SIL, 1998. (2nd edition)
\+mkrset \lngDefault English
\mkrRecord lx
\+mkr an\nam Antonym
\desc Used to reference an antonym of the lexeme, but using the \lf (lexical function) field for this is better practice.\lng vernacular
\mkrOverThis sn\CharStyle
\-mkr
\+mkr bw\nam Borrowed word (loan)
\desc Used for denoting the source language of a borrowed word.\lng English
\mkrOverThis se\CharStyle
\-mkr
\+mkr ce\nam Cross-ref. gloss (E)
\desc Gives the English gloss(es) for the vernacular lexeme referenced by the preceding \cf field.\lng English
\mkrOverThis cf\CharStyle
\-mkr
MDF type file
Harmonizing 12620 data categories
Example: part of speech
Determiner
Definite articlePartOfSpeech
article
Indefinite article
Is a
Is a Is a
...
Complex Closed Simple
ISOcat: MorhoSyntax Profile
GOLD ontology
\+mkr ps
\nam Part of speech\desc Classifies the part of speech. This must reflect the part of speech of the vernacular lexeme (not the national or English gloss). Consistent labeling is important; use the Range Set feature. Sense numbers are
beneath \ps in this hierarchy; don't mark different \ps fields with sense numbers.\lng English
\rngset adj adv …… n num pn post prtcl v \mkrOverThis se
\mkrFollowingThis va\CharStyle
\-mkr
MDF Multi Dictionary Format
Harmonizing 12620 data categories Gold example 2
In some cases GOLD contains additional information
Additional extensions to the conceptual domainisA relations between GOLD concepts
GOLD ontology
Harmonizing 12620 data categories Relation Registries
Relation Registries describes relations not handled through the ISO 12620 model
Simple relationse.g MDF /PartOfSpeech/ ‘equals’ MorphoSyntax /PartOfSpeech/GOLD relations (GOLD ontology is a Relation Registry)
Compositional Relations (DC is composed of multiple more granular DCs)
e.g. UDI MDF \1d (First dual) person:firstPerson, grammaticalNumber: dual, value:…
Model specific relationse.g. TBX model
tbx:hasPartOfSpeechproperty
class
tbx:termNoteType
datcat:partOfSpeechclass
datcat:Verbinstance
datcat:Nouninstance
datcat:properNouninstance
domain
range
Harmonizing 12620 data categories Relation Registries
Relation registries
Data Category registries
resource registries
Harmonizing interchange formats Possibility to use TEI?
Can TEI serve as interchange format for LMF and be accepted by CLARIN community?
Decision needs to be made before end 2010 to be useful for RELISH
ODD (One Document does all)DocumentationSchema information
Schema documents validate xml data structure
In August a workshop is organized to discuss the possibility of using TEI as an interchange format with representatives from ISO, CLARIN, TEI and endangered languages community
Adapting the tools
Relish project will result in tool adaptation to support the interoperability aspects and interchange formats
Conclusions and remarks
Minority and less resourced languages and tools are starting to actively participate in the standards discussionsbecoming part of the e-infrastructure landscapehave the opportunity to play a mature role in the area of language resources
We need organizations and individuals who are actively involved and represent the position of less resources languages in these discussions
Results from Relish project may be useful for other less resourced language resources as well
Thank you for your attention
Relish was made possible through the DFG/NEH Bilateral Digital Humanities Program