OLAC: Open Language Archives Community OLAC : The Open Language Archives Community Gary F. Simons SIL International and Graduate Institute of Applied Linguistics

OLAC: Open Language Archives Community

OLAC: The Open Language Archives Community

Gary F. SimonsSIL International andGraduate Institute of Applied Linguistics

DRIVER Summit, Goettingen, 16-17 Jan 2008

2

What is OLAC?

www.language-archives.org

►OLAC is an international partnership of institutions and individuals who are creating a world-wide virtual library of language resources by: Developing consensus on best current practice for the

digital archiving of language resources Developing a network of interoperating repositories and

services for housing and accessing such resources

►Founded in December 2000 Now has 34 participating archives 12 European participants (bolded on next slide)

3

► Aboriginal Studies Electronic Data Archive ► Academia Sinica► Alaska Native Language Center► Archive of Indigenous Languages of Latin America ► ATILF Resources► Berkeley Language Center► Centre de Ressources pour la Description de l'Oral► CHILDES Data Repository► Comparative Corpus of Spoken Portuguese► Cornell Language Acquisition Laboratory► Dictionnaire Universel Boiste 1812► DOBES catalogue (MPI, Nijmegen)► Ethnologue: Languages of the World► European Language Resources Association► Laboratoire Parole et Langage► Linguistic Data Consortium Corpus Catalog► LINGUIST List Language Resources

► Natural Language Software Registry► Online Database of Interlinear Text (ODIN)► Oxford Text Archive► PARADISEC► Perseus Digital Library► Research Papers in Computational Linguistics► Rosetta Project 1000 Language Archive► SIL Language and Culture Archives► Surrey Morphology Group Databases► Survey for California and Other Indian Languages► TalkBank ► Tibetan and Himalayan Digital Library► TRACTOR► Typological Database Project

► University of Bielefeld Language Archive► University of Queensland Flint Archive► Virtual Kayardild Archive (Melbourne)

Who’s involved?

4

How does it work?

►Based on OAI Protocol for Metadata Harvesting Adds a community-specific archive description to the

Identify response Defines a new olac metadata format We operate a static repository gateway for participants

with small collections (needs olac format only) http://www.language-archives.org/sr

We operate an aggregator that harvests all participants and crosswalks them to oai_dc format http://www.language-archives.org/cgi-bin/olaca3.pl

5

OLAC metadata format

►Based on the Dublin Core metadata set Record format follows the DC guidelines for

implementing Qualified DC in XML

Adds community-specific controlled vocabularies: Linguistic Data Type to qualify Type

Linguistic Field to qualify Subject

Participant Role to qualify Creator and Contributor

ISO 639-3 to qualify Language and Subject

6

Who’s involved?

7

Controlled vocabularies for language identification

►Situation: 6,912 living languages are used throughout the world Source: Ethnologue,

15th editionhttp://www.ethnologue.com

►Problem: The standard used in the library community (MARC language codes, or ISO 639-2) Has codes for fewer than 400 languages

Uses 66 “collective” codes to handle the other 6,500, e.g. South American Indian (Other) [sai] covers 421 languages

Bantu (Other) [bnt] covers 612 languages

8

ISO 639-3

► In 2002, ISO TC37 invited SIL to propose a comprehensive standard compatible with 639-2

►Result: ISO 639-3, Alpha-3 code for comprehensive coverage of languages (published 2007-02-05) Codes for ~6,900 living languages

Codes for ~600 extinct, historical, ancient, and constructed languages

RA site: http://www.sil.org/639-3/

►OLAC uses this controlled vocabulary for identi-fying the languages a resource is in or about

9

What is the current coverage of OLAC?

All archivesExcluding

Ethnologue

Items in catalog 30,591 23,292

ISO 639-3 languages included

7,299 3,134

Items with online open access

16,018 8,719

10

Current developments

► In first year of a 3-year NSF sponsored grant to in-crease use and coverage by an order of magnitude

1. Develop guidelines and services that encourage best common practices among language archives that will facilitate language resource discovery with precision through OLAC (and attract more archives to join).

2. Develop services to bridge the resource catalogs of the repository, library, and web domains (e.g. OAI, MARC, Google) to facilitate language resource discovery with precision through OLAC. E.g. User searches OLAC aggregator for a specific

639-3 code and finds hits in external aggregators

11

External interoperation

► Strategy 1: Use existing cataloging information to identify languages with precision 639-2 codes for individual languages Language names in LC subject headings, Call numbers

► Strategy 2: Promote use of 639-3 in cataloging ISO639-3 is now an encoding scheme in DC Terms iso639-3 has been added to the MARC standard as a

recognized identifier for a source in Field 041 E.g., these are valid 041 fields for a grammar in English of

Lushootseed [lut] of the Salishan [sal] family 041 1_$asal$aeng (using 639-2 by default) 041 17$alut$aeng$2iso639-3 (using 639-3)

12

Conclusion

►OLAC would like to establish interoperation with the DRIVER infrastructure. We could: Implement a driver set on OLAC aggregator Harvest language resources from DRIVER aggregator

►OLAC is pleased that DRIVER already recommends ISO 639-3 as best practice with Language element We are available to advise institutions who need help

implementing this We are looking for partners who will help advocate adop-

tion of 639-3 in other guidelines and standards so as to broaden the base for language-related interoperation

Documents

OLAC: Open Language Archives Community OLAC : The Open Language Archives Community Gary F. Simons SIL International and Graduate Institute of Applied Linguistics