Upload
allison-carson
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Status and Challenges of Local Status and Challenges of Local Language Computing and BRAC Language Computing and BRAC University’s InitiativeUniversity’s Initiative
Naushad UzZamanNaushad UzZaman
Research ProgrammerResearch Programmer
Center for Research on Bangla Language Center for Research on Bangla Language ProcessingProcessing
BRAC UniversityBRAC University
D. Net’s 5th Anniversary Seminar Series: Youth and ICTs: ICT and Localization
29th January, 2006
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 2
OutlineOutline
• Statistics of Bangla language speaker
• Localization and local language computing
• BRAC University’s Initiative
• Local and Regional Initiatives
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 3
Statistics of Bangla language Statistics of Bangla language speakersspeakers
• Spoken by 245 million people
• 7th most widely spoken language
• Spoken mainly in Bangladesh and Indian state of West Bengal
• More than 144 million people from Bangladesh
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 4
Why localization?Why localization?
• The masses can harness the power of information
• National Interest: digital divide, governance, language preservation, …
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 5
LocalizationLocalization
• Internationalized software in local languages
• Few groups are working actively– Ankur, Ekushey, D.Net (content development)
• Active projects– Linux, Mozilla, Open Office
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 8
Larger pictureLarger picture• Good start, but a long way to!
• Local language computing: advanced applications– Optical character recognition– Machine translation– Speech synthesis– Speech recognition– Dialog systems
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 9
ChallengesChallenges
• Language Resources– Fonts– Lexicon (word list)– Corpus (collection of texts)– Tag the lexicon and corpus
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 10
Challenges for next few years!Challenges for next few years!• Language processing research
– Document authoring (desktop, web (blog, forums, emails), etc)
– Morphological analyzer– Speech processing– Information Retrieval (web searching, name
searching, spelling checker)– OCR (Optical Character Recognition) – Syntactic analysis (can be used in MT)– Machine Translation– And many more…
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 11
Status of Bangla ComputingStatus of Bangla Computing
• Scattered work done, very little unification
• Scarcity of free and open-source software
• Little or no attention paid to computational linguistics - the backbone
• Many individuals are working, results few good publications in ICCIT, IUB’s ICCPB and other conferences
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 12
BRAC University’s InitiativeBRAC University’s Initiative• Research Lab (Center for Research on Bangla
Language Processing)– 9 full-time Research staff (6 CS background, 3
linguistics background)– Seed funding from PAN Localization project of IDRC– Students working part-time, doing internship– Software/documents all OPEN SOURCE
• Academics– Course on Natural Language Processing– Student projects and theses on NLP
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 13
Status of BU Research lab’s workStatus of BU Research lab’s work• Publications
– ICCIT 2004: 3 (Morphology 2, spelling checker)– BU Journal: 1 (Morphological parsing)– IASTED CI: 1 (Name searching)– IEEE NLP KE 05: 1 (Spelling checker)– ICCIT 2005: 1 (Morphology)– Undergraduate Thesis: 3 (Phonetic encoding, OCR,
Bangla text input in mobile)– Total: 10
• 4 more research paper submitted • Ongoing thesis: 4
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 14
Status of BU Research lab’s workStatus of BU Research lab’s work
• Invited talks:– University of Toronto CS Seminar– Stanford University NLP group (May 2005)– IDRC Partners Conference in Cambodia
(June 2005)– IJCNLP 2005, Jeju Island, Korea (October
2005)
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 15
Language ResourcesLanguage Resources
• Fonts: Good open-source fonts available
• Lexicon:– 80+ thousand list of words; expected to be
110 thousand in the next release– Tagging and annotation is underway.
Significant and large project
• Corpus:– Yet to begin
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 16
Language processing researchLanguage processing research• Document authoring
– Editor, Banglapad: • open source, platform independent, rich text editor (supports Bangla spell
checking, export to html)• Status: Version 1, Release candidate 1• http://sourceforge.net/projects/banglapad
– Transliteration, pata: • Type phonetically in English, you will get similar sounding dictionary word• Desktop application: http://sourceforge.net/projects/pata; Status: Complete • Web based transliteration: Status: Expected by June 2006
– Community network tools:• Set of tools to community networking (blogs, forums, etc) in Bangla.• Not only content authoring but also web services such as spelling checker.• Status: Expected by early 2007
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 17
Language processing researchLanguage processing research• Morphology:
– verb morphology is reasonably complete– noun morphology is somewhat usable, but much
more needs to be done– statistical methods for dealing with Bangla compound
words and blends are being worked on
• Grapheme To Phoneme (G2P):– Digital pronunciation dictionary – Useful step for speech processing– Status: Expected by June 2006
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 18
Language processing researchLanguage processing research• Speech Processing
– Text-to-speech: • Voice for Festival. • Status: First demo expected by May 2006.
– Automatic Speech Recognition: • Limited vocabulary segmented speech recognition.• Status: First demo expected by August 2006.
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 19
Language processing researchLanguage processing research• Information Retrieval:
– Spelling checker:• Gives phonetic suggestion and ranks phonetically• http://sourceforge.net/projects/puspaspeller/ • Integrated with other text editors, Banglapad• Status: Complete
– Searching • Phonetic web searching for Bangla• Input can be English or Bangla• Status: Expected by June 2006
– Name searching• Can be used in hospital, institutes, census, etc• Status: Expected by October 2006
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 20
Language processing researchLanguage processing research• Pattern recognition/image
processing/document processing:– Document skew correction: Bangla document skew
corrector based on Radon transform. Complete.– Segmentation:
• Bangla line segmentation: Complete• Bangla word segmentation: Complete• Bangla character segmentation: Work in progress. The large
number of combinations (consonant clusters and the non-spacing marks) complicates this task. This is omnifont, so must work with any typeface.
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 21
Language processing researchLanguage processing research– Pattern recognition:
• Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non-spacing marks pose a significant challenge.
• Hidden Markov Model (HMM) based recognizer: Just started, first implementation expected in May, 2006.
• Syntax:– Very preliminary work on Bangla syntax using the
Lexical Functional Grammar (LFG) formalism– Also a parallel effort using the Head-driven Phrase
Structure Grammar (HPSG) formalism
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 22
Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network (PanL10n)• Phase I 2004-2006: 7 country collaboration
1. BRAC University, Bangladesh2. Department of IT, Bhutan3. National ICT Development Agency, Cambodia4. Science Tech and Environment Agency, Laos5. Madan Puraskar Pustakalaya, Nepal6. University of Colombo School of Computing, Sri
Lanka7. Afghanistan
• Phase II proposed for 2007-2010
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 23
Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network Phase II (2007-
2010):• Further development of user-end local
language technology• Development of user end training for using the
local language technology• Conduction of this training• Local language content development• Measuring effects of using local language
technology
N. UzZaman, BRAC University
ICT and Localization, 29/1/06 25
SummarySummary• Local language computing • Significant challenges, from language resources
to human resources• 30+ years work for English and Western
languages; just beginning for Bangla• Include students from CS, linguistics• OPEN SOURCE a must for knowledge sharing!• Other universities should also come forward