25
Status and Challenges of Status and Challenges of Local Language Computing and Local Language Computing and BRAC University’s Initiative BRAC University’s Initiative Naushad UzZaman Naushad UzZaman Research Programmer Research Programmer Center for Research on Bangla Center for Research on Bangla Language Processing Language Processing BRAC University BRAC University D. Net’s 5 th Anniversary Seminar Series: Youth and ICTs: ICT and Localization 29 th January, 2006

Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language

Embed Size (px)

Citation preview

Status and Challenges of Local Status and Challenges of Local Language Computing and BRAC Language Computing and BRAC University’s InitiativeUniversity’s Initiative

Naushad UzZamanNaushad UzZaman

Research ProgrammerResearch Programmer

Center for Research on Bangla Language Center for Research on Bangla Language ProcessingProcessing

BRAC UniversityBRAC University

D. Net’s 5th Anniversary Seminar Series: Youth and ICTs: ICT and Localization

29th January, 2006

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 2

OutlineOutline

• Statistics of Bangla language speaker

• Localization and local language computing

• BRAC University’s Initiative

• Local and Regional Initiatives

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 3

Statistics of Bangla language Statistics of Bangla language speakersspeakers

• Spoken by 245 million people

• 7th most widely spoken language

• Spoken mainly in Bangladesh and Indian state of West Bengal

• More than 144 million people from Bangladesh

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 4

Why localization?Why localization?

• The masses can harness the power of information

• National Interest: digital divide, governance, language preservation, …

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 5

LocalizationLocalization

• Internationalized software in local languages

• Few groups are working actively– Ankur, Ekushey, D.Net (content development)

• Active projects– Linux, Mozilla, Open Office

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 6

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 7

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 8

Larger pictureLarger picture• Good start, but a long way to!

• Local language computing: advanced applications– Optical character recognition– Machine translation– Speech synthesis– Speech recognition– Dialog systems

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 9

ChallengesChallenges

• Language Resources– Fonts– Lexicon (word list)– Corpus (collection of texts)– Tag the lexicon and corpus

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 10

Challenges for next few years!Challenges for next few years!• Language processing research

– Document authoring (desktop, web (blog, forums, emails), etc)

– Morphological analyzer– Speech processing– Information Retrieval (web searching, name

searching, spelling checker)– OCR (Optical Character Recognition) – Syntactic analysis (can be used in MT)– Machine Translation– And many more…

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 11

Status of Bangla ComputingStatus of Bangla Computing

• Scattered work done, very little unification

• Scarcity of free and open-source software

• Little or no attention paid to computational linguistics - the backbone

• Many individuals are working, results few good publications in ICCIT, IUB’s ICCPB and other conferences

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 12

BRAC University’s InitiativeBRAC University’s Initiative• Research Lab (Center for Research on Bangla

Language Processing)– 9 full-time Research staff (6 CS background, 3

linguistics background)– Seed funding from PAN Localization project of IDRC– Students working part-time, doing internship– Software/documents all OPEN SOURCE

• Academics– Course on Natural Language Processing– Student projects and theses on NLP

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 13

Status of BU Research lab’s workStatus of BU Research lab’s work• Publications

– ICCIT 2004: 3 (Morphology 2, spelling checker)– BU Journal: 1 (Morphological parsing)– IASTED CI: 1 (Name searching)– IEEE NLP KE 05: 1 (Spelling checker)– ICCIT 2005: 1 (Morphology)– Undergraduate Thesis: 3 (Phonetic encoding, OCR,

Bangla text input in mobile)– Total: 10

• 4 more research paper submitted • Ongoing thesis: 4

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 14

Status of BU Research lab’s workStatus of BU Research lab’s work

• Invited talks:– University of Toronto CS Seminar– Stanford University NLP group (May 2005)– IDRC Partners Conference in Cambodia

(June 2005)– IJCNLP 2005, Jeju Island, Korea (October

2005)

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 15

Language ResourcesLanguage Resources

• Fonts: Good open-source fonts available

• Lexicon:– 80+ thousand list of words; expected to be

110 thousand in the next release– Tagging and annotation is underway.

Significant and large project

• Corpus:– Yet to begin

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 16

Language processing researchLanguage processing research• Document authoring

– Editor, Banglapad: • open source, platform independent, rich text editor (supports Bangla spell

checking, export to html)• Status: Version 1, Release candidate 1• http://sourceforge.net/projects/banglapad

– Transliteration, pata: • Type phonetically in English, you will get similar sounding dictionary word• Desktop application: http://sourceforge.net/projects/pata; Status: Complete • Web based transliteration: Status: Expected by June 2006

– Community network tools:• Set of tools to community networking (blogs, forums, etc) in Bangla.• Not only content authoring but also web services such as spelling checker.• Status: Expected by early 2007

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 17

Language processing researchLanguage processing research• Morphology:

– verb morphology is reasonably complete– noun morphology is somewhat usable, but much

more needs to be done– statistical methods for dealing with Bangla compound

words and blends are being worked on

• Grapheme To Phoneme (G2P):– Digital pronunciation dictionary – Useful step for speech processing– Status: Expected by June 2006

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 18

Language processing researchLanguage processing research• Speech Processing

– Text-to-speech: • Voice for Festival. • Status: First demo expected by May 2006.

– Automatic Speech Recognition: • Limited vocabulary segmented speech recognition.• Status: First demo expected by August 2006.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 19

Language processing researchLanguage processing research• Information Retrieval:

– Spelling checker:• Gives phonetic suggestion and ranks phonetically• http://sourceforge.net/projects/puspaspeller/ • Integrated with other text editors, Banglapad• Status: Complete

– Searching • Phonetic web searching for Bangla• Input can be English or Bangla• Status: Expected by June 2006

– Name searching• Can be used in hospital, institutes, census, etc• Status: Expected by October 2006

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 20

Language processing researchLanguage processing research• Pattern recognition/image

processing/document processing:– Document skew correction: Bangla document skew

corrector based on Radon transform. Complete.– Segmentation:

• Bangla line segmentation: Complete• Bangla word segmentation: Complete• Bangla character segmentation: Work in progress. The large

number of combinations (consonant clusters and the non-spacing marks) complicates this task. This is omnifont, so must work with any typeface.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 21

Language processing researchLanguage processing research– Pattern recognition:

• Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non-spacing marks pose a significant challenge.

• Hidden Markov Model (HMM) based recognizer: Just started, first implementation expected in May, 2006.

• Syntax:– Very preliminary work on Bangla syntax using the

Lexical Functional Grammar (LFG) formalism– Also a parallel effort using the Head-driven Phrase

Structure Grammar (HPSG) formalism

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 22

Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network (PanL10n)• Phase I 2004-2006: 7 country collaboration

1. BRAC University, Bangladesh2. Department of IT, Bhutan3. National ICT Development Agency, Cambodia4. Science Tech and Environment Agency, Laos5. Madan Puraskar Pustakalaya, Nepal6. University of Colombo School of Computing, Sri

Lanka7. Afghanistan

• Phase II proposed for 2007-2010

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 23

Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network Phase II (2007-

2010):• Further development of user-end local

language technology• Development of user end training for using the

local language technology• Conduction of this training• Local language content development• Measuring effects of using local language

technology

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 24

D.Net’s InitiativeD.Net’s Initiative

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 25

SummarySummary• Local language computing • Significant challenges, from language resources

to human resources• 30+ years work for English and Western

languages; just beginning for Bangla• Include students from CS, linguistics• OPEN SOURCE a must for knowledge sharing!• Other universities should also come forward