An HLT profile of the official South African languages 1 HLT Research Group, CSIR, South Africa 2 Graduate School of Technology Management, University of Pretoria, South Africa 3 Centre for Text Technology (CTexT), North-West University, South Africa Aditi Sharma Grover 1,2 , Gerhard B van Huyssteen 1,3 & Marthinus W. Pretorius 2

An HLT profile of the official South African languages

© Aditi Sharma Grover, Gerhard B van Huyssteen, Marthinus W. Pretorius

2. Overview Background Process Results Conclusion 3. BackgroundSouth African HLT landscape 11 official languages HLT community R&D community (universities& science councils) Very few private sector companies Various government initiatives DST: HLT road-mapping process, NHN DAC: HLT strategy, National Centre for HLT NRF: research funding 4. BackgroundChallenge SA has not yet capitalised on opportunities to create a thriving HLT industry Lack of awareness within the local HLT community Perpetuated by perceived fragmentation of South African R&D activities Lack of a unified technological profile of HLTactivities across the 11 languages 2009: a technology audit for the South African HLT landscape (SAHLTA) Align R&D activities and stimulate cooperation Similar to Dutch (BLaRK), EuroMap 5. ProcessSAHLTA Process Inventory Cursory AuditTerminology Questionnairecriteria inventory workshop 6. ProcessSAHLTA ProcessPhase 1 Inventory CursoryAuditTerminology Questionnaire criteria inventoryworkshopEstablish lingua franca Consolidate prior knowledge regarding data,modules, applications, and platforms/tools 7. ProcessSAHLTA ProcessPhase 2 Inventory CursoryAuditTerminologyQuestionnairecriteria inventoryworkshopPriorities 8. ResultsPrioritisation PrioritiesPreliminary HLT Based on international trends, local needs, andfeasibility Priority 1: Basic & robust core HLT technologyapplications, modules and data Priority 2, 3: LRs that further enhance andcomplement core LRs (priority 1), and base theirdevelopment on a strong foundation of core HLTLRs Many advanced HLT applications are priority 2, 3 Verification by larger SA HLT community Need to be updated regularly 9. ResultsPreliminary HLT Priorities Priority 1: Applications SpeechText Proofing tools Accessibility Information Telephony Extraction applications Information Retrieval Computer-assisted Human-aidedlanguage learning machine translation Voice search Machine-aided Audio management human translation 10. ResultsPreliminary HLT Priorities Priority 2: ApplicationsSpeechText OCR/ICR Access control Multilingual Embedded comprehension speech assistantsrecognition CALL Speaking devices Authorship Computer- identificationassisted training 11. ResultsPreliminary HLT Priorities Priority 3: Applications SpeechText Text generation Transcription/dictation Document classification Multimodal Summarisationinformation access QA Command&Control Dialogue systems Announcement Reference workssystems Audio books S2S translation 12. ResultsPreliminary HLT Priorities Priority 1: ModulesSpeechText G2P Complete ASR Text pre-processing Non-native ASR Normalisation Complete TTS Morphological analysis Confidence measures POS tagging Speaker ID Chunking Diarisation WSD Language ID Language/dialect ID 13. ResultsPriority 1: Data PrioritiesPreliminary HLTSpeechText Monolingual corpora Annotated Multilingual corporamonolingual corpora Test suites and Domain-/Application- corpora specific corpora Lexica (incl. named- Test suites and entity lists) corpora Domain-/Application- Pronunciation specific corporaresources (e.g. Phone sets, dictionaries, etc.) 14. ProcessSAHLTA Process Phase 3 Inventory Cursory AuditTerminologyQuestionnairecriteria inventory workshop IndexesDetailed inventoryGap analysis 15. ProcessResponse rate 16. ResultsMaturity Index Maturity stages: Under development (UD), Alpha version (AV), Beta version (BV) , Released (RV) Maturity Index Measure of the maturity of HLT components in a language. Considers the maturity stage of item against the relative importance of each maturity stage MaturityInd = (1.UD+2.AV+4.BV+8.RV)/ Weights of maturity stages 17. ResultsMaturity Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sts Sep Ses Tsv Ssw Ndb Xit L.I. 18. ResultsAccessibility Index Accessibility stages: Unspecified (UN), Not available (NA) (proprietary or contract R&D), Research and education (RE), Available for commercial purposes (CO), Available for commercial purposes and R&E (CRE) Accessibility Index Measure of the accessibility of HLT components in a language Considers the accessibility stage of an item against the relative importance of each accessibility stage AccessInd = (1.UN+2.NA+4.RE+8.CO+12.CRE)/ Weights of accessibility stages 19. ResultsAccessibility Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I. 20. ResultsHLT Language Index Impressionistic index that relatively rankslanguages based on the total quantity of HLTactivity per language Considers the stage of maturity andaccessibility of all the HLT components HLT Language Index = Maturity Index (per language, all components) +Accessibility Index 21. ResultsHLT Language Index 80 70 60 50 40 30 20 10 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I. 22. ResultsHLT Component Indexes Alternative perspective: Quantity of activity taking place withineach of the data, modules, andapplications on a HLT componentgrouping level (e.g. pronunciationresources) 23. ResultsHLT Component Indexes: Modules 24. Results HLT Detailed Inventory : Item exists, is accessible,released & of fairlyadequate quality: Item may exist butavailable for restricted usenot released/limited quality: Items do not exist : Category is notapplicable to the language 25. Results Gap Analysis (speech): Item exists, is accessible, released & of fairly adequate quality : Item may exist but available for restricted use or not released/ limited quality: Items do not exist : Category not applicable to the language 26. ResultsSAHLTA Outcomes A SAHLTA online database of LRs andapplications (alpha)www.meraka.org.za/nhnaudit 27. Results SAHLTA Outcomes 28. ConclusionSummary Few resources available, of basic nature Several factors influence this: HLT expert knowledge and interests Availability of data resources Market needs of a language Relatedness to other world languages 29. ConclusionRecommendations Further resource development based on gap analysis Also of more advanced LRs Availability and distribution of existing LRs To enable usage, licensing agreements need to be in place Funding: support by government in formative years Also industry stimulation programmes (e.g. support forR&D consortia) Collaborations: across SA and internationally, also based on gap analysis Human capital development (HCD): scientific & technical, cross silos of academic disciplines, especially for lesser-resourced languages 30. ConclusionAcknowledgments DST project sponsorship Prof Sonja Bosch & Prof Laurette Pretorius results of the 2008 BLaRK survey Audit mini-workshop contributors Prof. Danie Prinsloo (UP), Prof. Sonja Bosch (UNISA), Mr. Martin Puttkammer(NWU), Prof. Gerhard van Huyssteen (CSIR), Prof. Etienne Barnard (CSIR), Dr.Febe de Wet (US), Dr. Marelie Davel (CSIR) Numerous audit participants Various HLT RG members guidance and supportwww.meraka.org.za/nhnaudit