21
Closing the Gap: Data Models for Documentary Linguistics Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne [email protected]

Closing the Gap: Data Models for Documentary Linguistics

Embed Size (px)

DESCRIPTION

Talk at Latrobe University (May 2005, Melbourne)

Citation preview

Page 1: Closing the Gap: Data Models for Documentary Linguistics

Closing the Gap: Data Models for

Documentary Linguistics

Baden HughesDepartment of Computer Science and Software Engineering

The University of [email protected]

Page 2: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 2

Overview

� Overall Context� The Electronic Data Format Challenge� Common Problems� Data Encoding Models

� Lexicons, interlinear texts, paradigms, syntactic trees, annotation standards, query languages

� Linguistic Motivations vs Computational Interests� New Types of Data Exploration� Effects on Linguistic Analysis� New Tools� Conclusions

Page 3: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 3

Overall Context

� Large amounts of human language data continues to be managed in electronic form and analysed in fieldwork-driven linguistic documentation

� Increasing focus on acquisition-centric methodologies which have vastly increased the rate of growth of linguistic data

� Reasonably static basic linguistic data structures largely grounded in print domain

Page 4: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 4

The Electronic Data Format Challenge

� The methods used for the digital encoding of linguistic data are often disparate� Often at best reduced to native formats supported by

widely-used tools such as Shoebox� Conversion is typically complex and lossy

� Sometimes this can’t be predicted in advance� Many utility manipulation functions required to move

data between analytical applications and outputs� These functions are largely external to analytical

environments, with some notable exceptions (eg regular expression manipulation)

Page 5: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 5

Common Problems

� Despite diversity of language and analytical approach, many documentary and descriptive linguists face a common challenge: the interoperability and longevity of electronic data generated in fieldwork settings.

� Repurposing data� Publishing data on the web� Publishing in papers� New analysis tools� New generation formats

Page 6: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 6

The Emergence of Abstract Language Data Encoding Models� Recently, a number formal data encoding models for

linguistic data types have emerged from projects investigating "best practice" methods for preserving linguistic data.

� We will briefly consider models for� lexicons� interlinear texts� paradigms� syntactic trees� annotation standards� query languages

Page 7: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 7

Data Models (1)

� Lexicons� Bell & Bird (2001)

� Interlinear Text� Bow, Hughes & Bird (2003)� Hughes, Bird & Bow (2003)

� Linguistic Paradigms� Penton, Bow, Bird & Hughes (2004)� Penton & Bird (2004)

Page 8: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 8

Data Models (2)

� Syntactic Trees� Lai & Bird (2004)

� Annotation Standards� Farrar, Lewis & Langendoen (2002)� Farrar & Langendoen (2003)

� Query Languages� Bird, Chen, Davidson, Lee & Zheng (2005)� Cassidy & Bird (2000) � Taylor (2004)

Page 9: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 9

Linguistic Motivations

� Data models – so what ?� It is the combined utility of these models that makes

them attractive to documentary linguists� The challenge is to lower the barrier to use of these

technologies in fieldwork and analytical contexts� Linguistics (mostly) don’t care about the technology,

they just want to do linguistics!� Computer scientists are generally not interested in

linguistics …

Page 10: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 10

Computational Interests

� The development of such models may be inherently interesting to computationally inclined researchers� Human language data encoding and annotation is

genuinely interesting in computer science terms; unfortunately basic data modelling isn't

� Technologists have a bad habit of providing advice which is intended well but lacks traction for non-technical communities (eg “use XML”)

� Many of the solutions are XML-based, but contain many more components than just XML encoded data

Page 11: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 11

New Types of Data Exploration (1)

� Open implemented solutions for a range of manipulations are available� Lexicons

� Generation of different types of lexicons

� Interlinear Text (see following examples …)� Generation of different types of interlinear text � Induction of morphosyntactic glossing from lexicons� Generation of lexicons from interlinear text� Enrichment of lexicons from interlinear text

Page 12: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 12

Nenets Interlinear (1)

Page 13: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 13

Nenets Interlinear (2)

Page 14: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 14

New Types of Data Exploration (2)

� Open implemented solutions for a range of manipulations are available� Syntactic Trees

� Induction of trees from interlinear text� Creation of interlinear text from syntactic tree drawing� Creation of lexicons from syntactic trees

� Paradigms (see following examples …)� Generation of different types of paradigms� Induction of paradigms from interlinear text� Annotation of interlinear text from paradigms� Enrichment of lexicons from paradigms

Page 15: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 15

Kanarese Paradigm (1)

Page 16: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 16

Kanarese Paradigm (2)

Page 17: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 17

Effects on Linguistic Analysis

� Integrated encoding standards for linguistic data affect the practice of linguistic analysis� Some analysis types are now easier� New possibilities emerge� New analytical challenges are discovered� Data linkage/integration is certainly one of the

improvements

Page 18: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 18

New Tools

� The next generation of tools which support these data models natively are emerging eg FIELD, ELAN, Toolbox (almost)

� “Middleware” which allows the translation of legacy formats to and from these models are reasonably widely available

� Analytical tools are increasingly being implemented with web-grounded technologies and using web-derived models

� Open source/open data approaches are becoming pervasive

Page 19: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 19

Conclusion

� Reducing the gap between computationally tractable representations on which a high degree of functionality can be built and simple underlying formats driven by fieldwork-oriented tools

� Reduces the intermediate data-munging steps which require technical knowledge rather than linguistic knowledge is advantageous to all parties

� While we are not quite “there yet”, the light at the end of the tunnel is definitely there

� Growing community of philosophically aligned computer scientists and linguists

Page 20: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 20

References

� Bell & Bird, 2001. A Preliminary Study of the Structure of Lexicon Entries. Proceedings of the Workshop on Web-Based Language Documentation and Description.

� Bow, Hughes & Bird 2003. Towards a General Model for Interlinear Text. Proceedings of EMELD 2003.

� Farrar, Lewis & Langendoen, 2002. A Common Ontology for Linguistic Concepts.Proceedings of the Knowledge Technologies Conference.

� Farrar & Langendoen, 2003. A linguistic ontology for the Semantic Web. GLOT International 7(3)

� Hughes, Bird & Bow, 2003. Encoding and Presenting Interlinear Text Using XML Technologies. Proceedings of ALTW 2003.

� Lai & Bird, 2004. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis. Proceedings of ALTW 2004.

� Penton, Bow, Bird & Hughes, 2004. Towards a General Model for Linguistic Paradigms.Proceedings of EMELD 2004.

� Penton & Bird, 2004. Representing and Rendering Linguistic Paradigms. Proceedings of ALTW 2004.

� Bird, Chen, Davidson, Lee & Zheng, 2005. Extending XPath to Support Linguistic Queries. Proceedings of PLANX 2005.

� Cassidy & Bird, 2000. Querying databases of annotated speech. Proceedings of the Eleventh Australasian Database Conference.

� Taylor, 2004. XSLT as a Linguistic Query Language. BSc(Hons) Thesis, University of Melbourne.

Page 21: Closing the Gap: Data Models for Documentary Linguistics

Latrobe Uni - Linguistics Seminar - 20050505 21

Questions ? Comments ?