21
16. March 2003 16. March 2003 Allwood & Grönqvist Allwood & Grönqvist 1 Corpora for the Corpora for the Official African Official African Languages of South Languages of South Africa Africa Jens Allwood Jens Allwood Göteborg University, Department of Göteborg University, Department of Linguistics Linguistics Leif Grönqvist Leif Grönqvist Växjö University, School of Mathematics and Växjö University, School of Mathematics and Systems Engineering Systems Engineering Göteborg University, Department of Göteborg University, Department of Linguistics Linguistics

Spoken Language Corpora for the Official African Languages of South Africa

  • Upload
    kylar

  • View
    43

  • Download
    1

Embed Size (px)

DESCRIPTION

Spoken Language Corpora for the Official African Languages of South Africa. Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 11

Spoken Language Spoken Language Corpora for the Official Corpora for the Official African Languages of African Languages of

South AfricaSouth Africa

Jens AllwoodJens AllwoodGöteborg University, Department of LinguisticsGöteborg University, Department of Linguistics

Leif GrönqvistLeif GrönqvistVäxjö University, School of Mathematics and Systems Växjö University, School of Mathematics and Systems EngineeringEngineering

Göteborg University, Department of LinguisticsGöteborg University, Department of Linguistics

Page 2: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 22

BackgroundBackground

Corpus work in GothenburgCorpus work in Gothenburg A project cooperation with UNISA A project cooperation with UNISA

(University of South Africa) in (University of South Africa) in PretoriaPretoria

Financed by SIDA and NRFFinanced by SIDA and NRF– Travel money for GöteborgTravel money for Göteborg– Some more money for Pretoria covering Some more money for Pretoria covering

practical corpus workpractical corpus work

Page 3: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 33

Why?Why?

Creating support for survival of Creating support for survival of endangered languagesendangered languages

Linguistic corpora are very important Linguistic corpora are very important resources for a languageresources for a language

Spoken language corporaSpoken language corpora– UnexploredUnexplored– speech recognition/synthesisspeech recognition/synthesis– language learninglanguage learning– standardizationstandardization

Page 4: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 44

Who?Who? African Languages: Ncedile, Mmemesi Linguistics (UNISA): Rusandré Hendrikse, : Rusandré Hendrikse,

MvuyesiMvuyesi Linguistics (Göteborg): Jens, LeifLinguistics (Göteborg): Jens, Leif

Page 5: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 55

Page 6: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 66

THE ASMARA DECLARATION – THE ASMARA DECLARATION – 2000 (UNESCO)2000 (UNESCO)

Dialogue among Dialogue among African languages is African languages is essential: African essential: African languages must use languages must use the instrument of the instrument of translation to translation to advance advance communication communication among all people, among all people, including the including the disabled.disabled.

All African children All African children have the inalienable have the inalienable right to attend right to attend school and learn in school and learn in their mother their mother tongues. All effort tongues. All effort should be made to should be made to develop African develop African languages at all languages at all levels of education.levels of education.

Page 7: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 77

THE ASMARA DECLARATION – THE ASMARA DECLARATION – 2000 (UNESCO), cont’d2000 (UNESCO), cont’d

Promoting research Promoting research on African on African languages is vital languages is vital for their for their development, while development, while the advancement the advancement of African research of African research and documentation and documentation will be best served will be best served by the use of by the use of African languages.African languages.

The effective and The effective and rapid development rapid development of science and of science and technology in technology in Africa depends on Africa depends on the use of African the use of African languages and languages and modern technology modern technology must be used for must be used for the development of the development of African languages.African languages.

Page 8: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 88

OBJECTIVESOBJECTIVES To develop a platform of computer To develop a platform of computer

supported basic linguistic resources for the supported basic linguistic resources for the previously disadvantaged languages of SApreviously disadvantaged languages of SA

The resources will be in the form of The resources will be in the form of • Archived audio-visual recordings of activity-Archived audio-visual recordings of activity-

based natural language usebased natural language use• Machine-readable transcriptions of recordings Machine-readable transcriptions of recordings

for corpus-driven searchesfor corpus-driven searches• Morphologically tagged corpora for corpus-Morphologically tagged corpora for corpus-

based searchesbased searches• Other kinds of analysis – manual or automaticOther kinds of analysis – manual or automatic

Page 9: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 99

Spoken language corpora Spoken language corpora for:for:

XhosaXhosa ZuluZulu NdebeleNdebele SiswatiSiswati Southern Southern

SothoSotho

Tswana, Tsonga, Tswana, Tsonga, VendaVenda

Northern SothoNorthern Sotho (Pedi)(Pedi)

AfrikaansAfrikaans EnglishEnglish

Page 10: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1010

PROJECT MANAGEMENTPROJECT MANAGEMENT

Goteborg/Unisa

Nguni Sotho Venda

Rhodes

Fort Hare

UPE/Vista

Natal

Unizul

Tsonga

Page 11: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1111

PROJECT PHASES: 2002-2004PROJECT PHASES: 2002-2004

1.1. Ongoing Audio-video recordings of activity-Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs based spoken language use (min. 200hrs p/l).p/l).

2.2. Transcriptions (enriched with comment Transcriptions (enriched with comment lines) of recordings in machine-readable lines) of recordings in machine-readable text format.text format.

3.3. Checking and editing of transcriptions.Checking and editing of transcriptions.

4.4. Manual morphological tagging of corpora.Manual morphological tagging of corpora.

5.5. Automated tagging of corpora.Automated tagging of corpora.

6.6. Research outputs. Research outputs.

Page 12: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1212

Workshop overviewWorkshop overview

The Asmara The Asmara Declaration - Declaration - NcedileNcedile

What’s the point What’s the point of spoken of spoken language corpora? language corpora? – Jens– Jens

Overview of the Overview of the project and it’s project and it’s phases – Rusandréphases – Rusandré

The recording phase The recording phase – Jens/Mmemesi– Jens/Mmemesi

The transcription The transcription phase – phase – Jens/MvuyesiJens/Mvuyesi

The checking phase The checking phase – Jens/Ncedile– Jens/Ncedile

The tagging phase – The tagging phase – Leif/RusandréLeif/Rusandré

Research output - Research output - JensJens

Page 13: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1313

The workshops, etcThe workshops, etc

Seminars at Seminars at UNISA, PretoriaUNISA, Pretoria

Rhodes University, Rhodes University, GrahamstownGrahamstown

University of the University of the Transkei, UmtataTranskei, Umtata

Natal University, Natal University, DurbanDurban

Other placesOther places

Page 14: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1414

Contacts from the Contacts from the workshopsworkshops

DurbanDurban– IsizuluIsizulu Programme,Programme, University University ofof D Durbanurban::

NN NN Gumede, Gumede, CT CT Gumede, Gumede, NP NP Ndimande, Ndimande, NN NN MathonsiMathonsi– IsizuluIsizulu ProgrammeProgramme University University ofof NatalNatal

NS NS Turner, Turner, S S Naidoo, Naidoo, CNT CNT Ntshangase, Ntshangase, MP MP Kufa, Kufa, SE SE XimbaXimba GrahamstownGrahamstown

– African Languages, Rhodes UniversityAfrican Languages, Rhodes University Bulelwa Nosilela, John Claughton, Ntosh MazwiBulelwa Nosilela, John Claughton, Ntosh Mazwi

– ISEA, Rhodes UniversityISEA, Rhodes University Prof Laurence Wright, Ms Cossie RasanaProf Laurence Wright, Ms Cossie Rasana

– Vista, Port EdwardVista, Port Edward: Prof BB Mkonto : Prof BB Mkonto – SAUL, Fort HareSAUL, Fort Hare: Mr Zandisile Wilberforce: Mr Zandisile Wilberforce– Dept. Sport, Arts & Culture, GrahamstownDept. Sport, Arts & Culture, Grahamstown: Vaugham : Vaugham

JapthaJaptha UmtataUmtata

– UNITRA, African LanguagesUNITRA, African Languages: : RM RM Nakin, Nakin, N N VapiVapi

Page 15: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1515

The transcription headerThe transcription header

@ Recorded activity ID: V010501@ Recorded activity ID: V010501@ Activity type: Informal @ Activity type: Informal

conversationconversation@ Recorded activity title: @ Recorded activity title:

Getting to know each otherGetting to know each other@ Recorded activity date: @ Recorded activity date:

2002072520020725@ Recorder: Britta Zawada@ Recorder: Britta Zawada@ Participant: A = F2 (Lunga)@ Participant: A = F2 (Lunga)@ Participant: B = F1 (Bukiwe)@ Participant: B = F1 (Bukiwe)@ Transcriber: Mvuyisi Siwisa@ Transcriber: Mvuyisi Siwisa@ Transcription date: 20020805@ Transcription date: 20020805@ Checker: Rusandre Hendrikse@ Checker: Rusandre Hendrikse@ Checking date: 20020912@ Checking date: 20020912

@ Anonymised: No@ Anonymised: No@ Activity Medium: face-to-@ Activity Medium: face-to-

faceface@ Activity duration: 00:44:30@ Activity duration: 00:44:30@ Other time coding: Each @ Other time coding: Each

sectionsection@ Tape: V0105@ Tape: V0105@ Section: Family affairs@ Section: Family affairs@ Section: Crime@ Section: Crime@ Section: Unemployment@ Section: Unemployment@ Section: Closing@ Section: Closing@ Comment: Medunsa open @ Comment: Medunsa open

ended conversation between ended conversation between two adult speech therapy two adult speech therapy students Bukiwe and Lungastudents Bukiwe and Lunga

Page 16: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1616

Contrastive stress, pauses and Contrastive stress, pauses and lengtheninglengthening

$B: abanye$B: abanye ke bazihlalele ke bazihlalele njenje:: // abanyeabanye ABAZANGEABAZANGE bafune sikolo bafune sikolo //// uyayiqonda ke uyayiqonda ke la meko yokungabikho mzali uqhubayo la meko yokungabikho mzali uqhubayo // uthi uthi aba baza emva kwam bobabini aba baza emva kwam bobabini ABAZANGEABAZANGE bafunde kuyaphi bafunde kuyaphi //// kodwa ke kodwa ke //// andigxeki andigxeki nto kuba ke nto kuba ke // ndibakhona ngethuba le ndibakhona ngethuba le ngxaki nobhuti ke ngxaki nobhuti ke [2 [2 abeyinkxaso kakhuluabeyinkxaso kakhulu ]2 ]2

$$AA: [2 ya : [2 ya // m:m: ewe ]2 hayi izinto zikuthixo ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshatandiseza kutshata ndiseza kutshata

Page 17: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1717

OverlapsOverlaps

§ Religion§ Religion$B: uyakhonza kanene$B: uyakhonza kanene$$AA: ndiyakhonza owu ndiyamthand{a} : ndiyakhonza owu ndiyamthand{a}

[4 [4 < < uthixo uthixo > > ndiyamthanda ndiyamthanda andisoze ndimlahle undibonisile andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into ukuba mkhulu nantso ke into efunekayo qha ]4efunekayo qha ]4 kuphela kuphela

$B: $B: [4 nantso ke sisi [4 nantso ke sisi // e: e:// e: e: ]4 ]4@ < name >@ < name >

Page 18: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1818

Comment LinesComment Lines

$$AA: kunetha imvula sinemithwalo engaka : kunetha imvula sinemithwalo engaka < < yebhegi >yebhegi > < < yho yho yho yho yho yho >> nako sisanako sisa

@ < loan English: bag > @ < loan English: bag >

@ < gesture: hand wipes >@ < gesture: hand wipes >

$B: esingazi lo mntwana ngoba kaloku siza $B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu apha asazi mntu < wakwandungwana >< wakwandungwana > ukuba wayengekho ngesasitheni na asazi ukuba wayengekho ngesasitheni na asazi mntumntu < >< >

@ <@ < name: name: clan name >clan name >

@ < comment: A drops her book >@ < comment: A drops her book >

Page 19: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 1919

Current statusCurrent status

20 hours of Xhosa recordings and 20 hours of Xhosa recordings and transcriptionstranscriptions

A preliminary coding scheme for A preliminary coding scheme for morphologymorphology

Ongoing work on recording, Ongoing work on recording, transcription and manual coding of transcription and manual coding of morphologymorphology

Page 20: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 2020

Things to doThings to do

Make transcription standards with Make transcription standards with examples for each of the nine examples for each of the nine languageslanguages

Hand tag some transcriptions for Hand tag some transcriptions for morphology for training of an morphology for training of an experimental taggerexperimental tagger

A frequency dictionary and/or a A frequency dictionary and/or a thesaurus for Xhosathesaurus for Xhosa

Page 21: Spoken Language Corpora for the Official African Languages of South Africa

16. March 200316. March 2003 Allwood & GrönqvistAllwood & Grönqvist 2121

Last slideLast slide

SummarySummary Long time plansLong time plans