Upload
kylar
View
43
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Spoken Language Corpora for the Official African Languages of South Africa. Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics. Background. - PowerPoint PPT Presentation
Citation preview
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 11
Spoken Language Spoken Language Corpora for the Official Corpora for the Official African Languages of African Languages of
South AfricaSouth Africa
Jens AllwoodJens AllwoodGöteborg University, Department of LinguisticsGöteborg University, Department of Linguistics
Leif GrönqvistLeif GrönqvistVäxjö University, School of Mathematics and Systems Växjö University, School of Mathematics and Systems EngineeringEngineering
Göteborg University, Department of LinguisticsGöteborg University, Department of Linguistics
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 22
BackgroundBackground
Corpus work in GothenburgCorpus work in Gothenburg A project cooperation with UNISA A project cooperation with UNISA
(University of South Africa) in (University of South Africa) in PretoriaPretoria
Financed by SIDA and NRFFinanced by SIDA and NRF– Travel money for GöteborgTravel money for Göteborg– Some more money for Pretoria covering Some more money for Pretoria covering
practical corpus workpractical corpus work
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 33
Why?Why?
Creating support for survival of Creating support for survival of endangered languagesendangered languages
Linguistic corpora are very important Linguistic corpora are very important resources for a languageresources for a language
Spoken language corporaSpoken language corpora– UnexploredUnexplored– speech recognition/synthesisspeech recognition/synthesis– language learninglanguage learning– standardizationstandardization
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 44
Who?Who? African Languages: Ncedile, Mmemesi Linguistics (UNISA): Rusandré Hendrikse, : Rusandré Hendrikse,
MvuyesiMvuyesi Linguistics (Göteborg): Jens, LeifLinguistics (Göteborg): Jens, Leif
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 55
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 66
THE ASMARA DECLARATION – THE ASMARA DECLARATION – 2000 (UNESCO)2000 (UNESCO)
Dialogue among Dialogue among African languages is African languages is essential: African essential: African languages must use languages must use the instrument of the instrument of translation to translation to advance advance communication communication among all people, among all people, including the including the disabled.disabled.
All African children All African children have the inalienable have the inalienable right to attend right to attend school and learn in school and learn in their mother their mother tongues. All effort tongues. All effort should be made to should be made to develop African develop African languages at all languages at all levels of education.levels of education.
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 77
THE ASMARA DECLARATION – THE ASMARA DECLARATION – 2000 (UNESCO), cont’d2000 (UNESCO), cont’d
Promoting research Promoting research on African on African languages is vital languages is vital for their for their development, while development, while the advancement the advancement of African research of African research and documentation and documentation will be best served will be best served by the use of by the use of African languages.African languages.
The effective and The effective and rapid development rapid development of science and of science and technology in technology in Africa depends on Africa depends on the use of African the use of African languages and languages and modern technology modern technology must be used for must be used for the development of the development of African languages.African languages.
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 88
OBJECTIVESOBJECTIVES To develop a platform of computer To develop a platform of computer
supported basic linguistic resources for the supported basic linguistic resources for the previously disadvantaged languages of SApreviously disadvantaged languages of SA
The resources will be in the form of The resources will be in the form of • Archived audio-visual recordings of activity-Archived audio-visual recordings of activity-
based natural language usebased natural language use• Machine-readable transcriptions of recordings Machine-readable transcriptions of recordings
for corpus-driven searchesfor corpus-driven searches• Morphologically tagged corpora for corpus-Morphologically tagged corpora for corpus-
based searchesbased searches• Other kinds of analysis – manual or automaticOther kinds of analysis – manual or automatic
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 99
Spoken language corpora Spoken language corpora for:for:
XhosaXhosa ZuluZulu NdebeleNdebele SiswatiSiswati Southern Southern
SothoSotho
Tswana, Tsonga, Tswana, Tsonga, VendaVenda
Northern SothoNorthern Sotho (Pedi)(Pedi)
AfrikaansAfrikaans EnglishEnglish
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1010
PROJECT MANAGEMENTPROJECT MANAGEMENT
Goteborg/Unisa
Nguni Sotho Venda
Rhodes
Fort Hare
UPE/Vista
Natal
Unizul
Tsonga
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1111
PROJECT PHASES: 2002-2004PROJECT PHASES: 2002-2004
1.1. Ongoing Audio-video recordings of activity-Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs based spoken language use (min. 200hrs p/l).p/l).
2.2. Transcriptions (enriched with comment Transcriptions (enriched with comment lines) of recordings in machine-readable lines) of recordings in machine-readable text format.text format.
3.3. Checking and editing of transcriptions.Checking and editing of transcriptions.
4.4. Manual morphological tagging of corpora.Manual morphological tagging of corpora.
5.5. Automated tagging of corpora.Automated tagging of corpora.
6.6. Research outputs. Research outputs.
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1212
Workshop overviewWorkshop overview
The Asmara The Asmara Declaration - Declaration - NcedileNcedile
What’s the point What’s the point of spoken of spoken language corpora? language corpora? – Jens– Jens
Overview of the Overview of the project and it’s project and it’s phases – Rusandréphases – Rusandré
The recording phase The recording phase – Jens/Mmemesi– Jens/Mmemesi
The transcription The transcription phase – phase – Jens/MvuyesiJens/Mvuyesi
The checking phase The checking phase – Jens/Ncedile– Jens/Ncedile
The tagging phase – The tagging phase – Leif/RusandréLeif/Rusandré
Research output - Research output - JensJens
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1313
The workshops, etcThe workshops, etc
Seminars at Seminars at UNISA, PretoriaUNISA, Pretoria
Rhodes University, Rhodes University, GrahamstownGrahamstown
University of the University of the Transkei, UmtataTranskei, Umtata
Natal University, Natal University, DurbanDurban
Other placesOther places
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1414
Contacts from the Contacts from the workshopsworkshops
DurbanDurban– IsizuluIsizulu Programme,Programme, University University ofof D Durbanurban::
NN NN Gumede, Gumede, CT CT Gumede, Gumede, NP NP Ndimande, Ndimande, NN NN MathonsiMathonsi– IsizuluIsizulu ProgrammeProgramme University University ofof NatalNatal
NS NS Turner, Turner, S S Naidoo, Naidoo, CNT CNT Ntshangase, Ntshangase, MP MP Kufa, Kufa, SE SE XimbaXimba GrahamstownGrahamstown
– African Languages, Rhodes UniversityAfrican Languages, Rhodes University Bulelwa Nosilela, John Claughton, Ntosh MazwiBulelwa Nosilela, John Claughton, Ntosh Mazwi
– ISEA, Rhodes UniversityISEA, Rhodes University Prof Laurence Wright, Ms Cossie RasanaProf Laurence Wright, Ms Cossie Rasana
– Vista, Port EdwardVista, Port Edward: Prof BB Mkonto : Prof BB Mkonto – SAUL, Fort HareSAUL, Fort Hare: Mr Zandisile Wilberforce: Mr Zandisile Wilberforce– Dept. Sport, Arts & Culture, GrahamstownDept. Sport, Arts & Culture, Grahamstown: Vaugham : Vaugham
JapthaJaptha UmtataUmtata
– UNITRA, African LanguagesUNITRA, African Languages: : RM RM Nakin, Nakin, N N VapiVapi
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1515
The transcription headerThe transcription header
@ Recorded activity ID: V010501@ Recorded activity ID: V010501@ Activity type: Informal @ Activity type: Informal
conversationconversation@ Recorded activity title: @ Recorded activity title:
Getting to know each otherGetting to know each other@ Recorded activity date: @ Recorded activity date:
2002072520020725@ Recorder: Britta Zawada@ Recorder: Britta Zawada@ Participant: A = F2 (Lunga)@ Participant: A = F2 (Lunga)@ Participant: B = F1 (Bukiwe)@ Participant: B = F1 (Bukiwe)@ Transcriber: Mvuyisi Siwisa@ Transcriber: Mvuyisi Siwisa@ Transcription date: 20020805@ Transcription date: 20020805@ Checker: Rusandre Hendrikse@ Checker: Rusandre Hendrikse@ Checking date: 20020912@ Checking date: 20020912
@ Anonymised: No@ Anonymised: No@ Activity Medium: face-to-@ Activity Medium: face-to-
faceface@ Activity duration: 00:44:30@ Activity duration: 00:44:30@ Other time coding: Each @ Other time coding: Each
sectionsection@ Tape: V0105@ Tape: V0105@ Section: Family affairs@ Section: Family affairs@ Section: Crime@ Section: Crime@ Section: Unemployment@ Section: Unemployment@ Section: Closing@ Section: Closing@ Comment: Medunsa open @ Comment: Medunsa open
ended conversation between ended conversation between two adult speech therapy two adult speech therapy students Bukiwe and Lungastudents Bukiwe and Lunga
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1616
Contrastive stress, pauses and Contrastive stress, pauses and lengtheninglengthening
$B: abanye$B: abanye ke bazihlalele ke bazihlalele njenje:: // abanyeabanye ABAZANGEABAZANGE bafune sikolo bafune sikolo //// uyayiqonda ke uyayiqonda ke la meko yokungabikho mzali uqhubayo la meko yokungabikho mzali uqhubayo // uthi uthi aba baza emva kwam bobabini aba baza emva kwam bobabini ABAZANGEABAZANGE bafunde kuyaphi bafunde kuyaphi //// kodwa ke kodwa ke //// andigxeki andigxeki nto kuba ke nto kuba ke // ndibakhona ngethuba le ndibakhona ngethuba le ngxaki nobhuti ke ngxaki nobhuti ke [2 [2 abeyinkxaso kakhuluabeyinkxaso kakhulu ]2 ]2
$$AA: [2 ya : [2 ya // m:m: ewe ]2 hayi izinto zikuthixo ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshatandiseza kutshata ndiseza kutshata
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1717
OverlapsOverlaps
§ Religion§ Religion$B: uyakhonza kanene$B: uyakhonza kanene$$AA: ndiyakhonza owu ndiyamthand{a} : ndiyakhonza owu ndiyamthand{a}
[4 [4 < < uthixo uthixo > > ndiyamthanda ndiyamthanda andisoze ndimlahle undibonisile andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into ukuba mkhulu nantso ke into efunekayo qha ]4efunekayo qha ]4 kuphela kuphela
$B: $B: [4 nantso ke sisi [4 nantso ke sisi // e: e:// e: e: ]4 ]4@ < name >@ < name >
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1818
Comment LinesComment Lines
$$AA: kunetha imvula sinemithwalo engaka : kunetha imvula sinemithwalo engaka < < yebhegi >yebhegi > < < yho yho yho yho yho yho >> nako sisanako sisa
@ < loan English: bag > @ < loan English: bag >
@ < gesture: hand wipes >@ < gesture: hand wipes >
$B: esingazi lo mntwana ngoba kaloku siza $B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu apha asazi mntu < wakwandungwana >< wakwandungwana > ukuba wayengekho ngesasitheni na asazi ukuba wayengekho ngesasitheni na asazi mntumntu < >< >
@ <@ < name: name: clan name >clan name >
@ < comment: A drops her book >@ < comment: A drops her book >
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1919
Current statusCurrent status
20 hours of Xhosa recordings and 20 hours of Xhosa recordings and transcriptionstranscriptions
A preliminary coding scheme for A preliminary coding scheme for morphologymorphology
Ongoing work on recording, Ongoing work on recording, transcription and manual coding of transcription and manual coding of morphologymorphology
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 2020
Things to doThings to do
Make transcription standards with Make transcription standards with examples for each of the nine examples for each of the nine languageslanguages
Hand tag some transcriptions for Hand tag some transcriptions for morphology for training of an morphology for training of an experimental taggerexperimental tagger
A frequency dictionary and/or a A frequency dictionary and/or a thesaurus for Xhosathesaurus for Xhosa
16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 2121
Last slideLast slide
SummarySummary Long time plansLong time plans