Upload
julia-gibson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
OLAC: Open Language Archives Community
OLAC: The Open Language Archives Community
Gary F. SimonsSIL International andGraduate Institute of Applied Linguistics
DRIVER Summit, Goettingen, 16-17 Jan 2008
2
What is OLAC?
www.language-archives.org
►OLAC is an international partnership of institutions and individuals who are creating a world-wide virtual library of language resources by: Developing consensus on best current practice for the
digital archiving of language resources Developing a network of interoperating repositories and
services for housing and accessing such resources
►Founded in December 2000 Now has 34 participating archives 12 European participants (bolded on next slide)
3
► Aboriginal Studies Electronic Data Archive ► Academia Sinica► Alaska Native Language Center► Archive of Indigenous Languages of Latin America ► ATILF Resources► Berkeley Language Center► Centre de Ressources pour la Description de l'Oral► CHILDES Data Repository► Comparative Corpus of Spoken Portuguese► Cornell Language Acquisition Laboratory► Dictionnaire Universel Boiste 1812► DOBES catalogue (MPI, Nijmegen)► Ethnologue: Languages of the World► European Language Resources Association► Laboratoire Parole et Langage► Linguistic Data Consortium Corpus Catalog► LINGUIST List Language Resources
► Natural Language Software Registry► Online Database of Interlinear Text (ODIN)► Oxford Text Archive► PARADISEC► Perseus Digital Library► Research Papers in Computational Linguistics► Rosetta Project 1000 Language Archive► SIL Language and Culture Archives► Surrey Morphology Group Databases► Survey for California and Other Indian Languages► TalkBank ► Tibetan and Himalayan Digital Library► TRACTOR► Typological Database Project
► University of Bielefeld Language Archive► University of Queensland Flint Archive► Virtual Kayardild Archive (Melbourne)
Who’s involved?
4
How does it work?
►Based on OAI Protocol for Metadata Harvesting Adds a community-specific archive description to the
Identify response Defines a new olac metadata format We operate a static repository gateway for participants
with small collections (needs olac format only) http://www.language-archives.org/sr
We operate an aggregator that harvests all participants and crosswalks them to oai_dc format http://www.language-archives.org/cgi-bin/olaca3.pl
5
OLAC metadata format
►Based on the Dublin Core metadata set Record format follows the DC guidelines for
implementing Qualified DC in XML
Adds community-specific controlled vocabularies: Linguistic Data Type to qualify Type
Linguistic Field to qualify Subject
Participant Role to qualify Creator and Contributor
ISO 639-3 to qualify Language and Subject
6
Who’s involved?
7
Controlled vocabularies for language identification
►Situation: 6,912 living languages are used throughout the world Source: Ethnologue,
15th editionhttp://www.ethnologue.com
►Problem: The standard used in the library community (MARC language codes, or ISO 639-2) Has codes for fewer than 400 languages
Uses 66 “collective” codes to handle the other 6,500, e.g. South American Indian (Other) [sai] covers 421 languages
Bantu (Other) [bnt] covers 612 languages
8
ISO 639-3
► In 2002, ISO TC37 invited SIL to propose a comprehensive standard compatible with 639-2
►Result: ISO 639-3, Alpha-3 code for comprehensive coverage of languages (published 2007-02-05) Codes for ~6,900 living languages
Codes for ~600 extinct, historical, ancient, and constructed languages
RA site: http://www.sil.org/639-3/
►OLAC uses this controlled vocabulary for identi-fying the languages a resource is in or about
9
What is the current coverage of OLAC?
All archivesExcluding
Ethnologue
Items in catalog 30,591 23,292
ISO 639-3 languages included
7,299 3,134
Items with online open access
16,018 8,719
10
Current developments
► In first year of a 3-year NSF sponsored grant to in-crease use and coverage by an order of magnitude
1. Develop guidelines and services that encourage best common practices among language archives that will facilitate language resource discovery with precision through OLAC (and attract more archives to join).
2. Develop services to bridge the resource catalogs of the repository, library, and web domains (e.g. OAI, MARC, Google) to facilitate language resource discovery with precision through OLAC. E.g. User searches OLAC aggregator for a specific
639-3 code and finds hits in external aggregators
11
External interoperation
► Strategy 1: Use existing cataloging information to identify languages with precision 639-2 codes for individual languages Language names in LC subject headings, Call numbers
► Strategy 2: Promote use of 639-3 in cataloging ISO639-3 is now an encoding scheme in DC Terms iso639-3 has been added to the MARC standard as a
recognized identifier for a source in Field 041 E.g., these are valid 041 fields for a grammar in English of
Lushootseed [lut] of the Salishan [sal] family 041 1_$asal$aeng (using 639-2 by default) 041 17$alut$aeng$2iso639-3 (using 639-3)
12
Conclusion
►OLAC would like to establish interoperation with the DRIVER infrastructure. We could: Implement a driver set on OLAC aggregator Harvest language resources from DRIVER aggregator
►OLAC is pleased that DRIVER already recommends ISO 639-3 as best practice with Language element We are available to advise institutions who need help
implementing this We are looking for partners who will help advocate adop-
tion of 639-3 in other guidelines and standards so as to broaden the base for language-related interoperation