39
Computing in Vietnamese: Progress & Challenges {James} ĐBá Phước 杜伯福 IMUG 2005-05-19

Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Embed Size (px)

Citation preview

Page 1: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Computing in Vietnamese:Progress & Challenges

{James} ĐỖ Bá Phước杜 伯 福IMUG 2005-05-19

Page 2: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Overview

Vietnamese writingLatin: Quốc ngữIdeographic: Chữ Nôm

ConsiderationsRepertoireCharacter encodingInput methodsFonts

Page 3: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Quốc ngữ{National script}

Page 4: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Orthographic units

Vowelsa ă â e ê i o ô ơ u ư y

Consonantsb c d đ g h k l m n p q r s t v x

Tone marks_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_

A vowel can combine with one tone mark

Page 5: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

192 characters, 6 “too many”

A total of 192 upper- and lower-case “pre-composed” characters6 characters beyond 8-bit character set

6 characters missing from original ISO 10646 repertoire (in 1988)Restored through Unicode-10646 merger

Page 6: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

43 8-bit character sets

Pre-composedDual fontsTCVN 5712:1995, aka “ABC”

(TCVN = Tiêu chuẩn Việt Nam {Vietnam Standard})

Glyph overlapVNI

CombiningWindows Vietnamese (cp-1258)

Page 7: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Unicode

Encodes both:Combining characters (from Unicode)Pre-composed characters (from ISO)

Getting more widely supported

Wide acceptance after and for the WebTCVN 6909:2001

Pre-composed characters only

Page 8: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Vietnamese writing

Handwritingv i e ^ t ́ viết {write}v i e ^ ́t viết {write}

TelexTypewriter

Dead-keyCarriage stops at combination of diacritics, thenBase letter

Computer

Page 9: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Telex convention

a ă â e ê i o ô ơ u ư yaw aa ee oo ow uw

b c d đ g h k l m n p q r s t v xdd

_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_f r x s j

Example: vieets viết

Page 10: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Computer input methods

TelexVNIVIQR (VIetnamese Quoted-Readable)

MnemonicInternet RFC 1456

TCVN 6064:1995Orthographic units

Page 11: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

VIQR, aka VietNet

a ă â e ê i o ô ơ u ư ya( a^ e^ o^ o+ u+

b c d đ g h k l m n p q r s t v xdd

_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_` ? ~ ’ .

Example: vie^’t viết

Page 12: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Are diacritics necessary?

Ma ghostMà butMả tombMã codeMá motherMạ rice seedlingMua to buyMưa rain

Page 13: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

TCVN 6064:1995

Corresponds to orthographic unitsCloser to handwriting

Native to:Windows XP

Emits combining character sequencesMac OS X

Emits pre-composed characters in OS X 10.4 TigerWas emitting combining character sequences

Page 14: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

TCVN 6064:1995

ControlAltGrAltControlShift/.,mnbvcxzShift

Enter\';lkjhgfdsaCapsLock][poiuytrewqTab

BkSp=-0987654321`

ControlAltGrAltControlShift/.,mnbvcxzShift

Enter\';lkjhgfdsaCapsLockơưpoiuytrewqTab

BkSp₫-đ̣́̃̉̀ôêâă`

Page 15: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Input software

http://unikey.sourceforge.net Different input conventionsMultiple character encodingsClipboard converter

Extremely convenient for handling documents in legacy encodings

Free, lightweight, powerfulUse from your USB memory device!

Page 16: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Localization

LocaleTranslation of computer terminologyWindows XP and Office 2003 SE

Using LIP (Language Interface Pack)

Page 17: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Progress

Vietnamese web sites are universally in UnicodeException: http://www.vietmercury.com

This site never shows up in Vietnamese web searches!

SearchWeb: Google, Yahoo!, MSNDesktop: Google, MSN

Blogs, wikis, ...Desktop & server applications

Page 18: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Progress & Challenges

Unicode-savvy?Yahoo!Mail, AOL, AIM Mail: charset “iso8859-1”Eudora

Unicode-savvy!Outlook Express, Outlook, ThunderbirdGmail, Netscape MailYahoo!Messenger, MSN Messenger, Skype

Not enough Unicode fonts with VietnameseExample: Trebuchet MS (which reverts to Arial)

Page 19: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Challenges

User educationGIGO

Any non-Unicode string

LegacyEncodings

Remove all non-Unicode fonts

No physical standard keyboard

Page 20: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Chữ Nôm{Demotic script}

Page 21: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Chữ Nôm

Started to appear in the Xth century, after a thousand years of Chinese ruleBased on Chinese charactersIn use for the next thousand yearsNow replaced by Quốc ngữ

Page 22: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm example

Sound

cốt {sound}Meaning

mộc {tree}

pillarcột

Latin scriptNômideographic script

Quốc ngữLatin script

EnglishVietnamese

Page 23: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm dictionaries

1971 ~ Tự điển chữ Nôm {Nôm Dictionary}, NguyễnQuang Xỹ & Vũ Văn Kính1988 ~ Chu-Nomu Jiten 字 字 , Takeuchi Yonosuke1999 ~ Đại từ điển chữ Nôm {Nôm Super-Dictionary},Vũ Văn Kính2004 ~ Giúp đọc Nôm và Hán-Việt {Nôm & Hán-ViệtReading Guide}, Father Anthony Trần Văn KiệmSoon ~ Từ điển chữ Nôm tiếng Việt {Vietnamese NômDictionary}, Nguyễn Quang Hồng

Page 24: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm standard encoding

First proposed in 1992Unicode 3.1

9,299 characters, of which5,067 characters in BMP (CJKV Extension A)4,232 “Nôm proper” in Plane 2 (CJKV Extension B)

IRG: CJKV Extension CAbout 2,200 additional characters(IRG = Ideographic Rapporteur Group)(CJKV = Chinese, Japanese, Korean, Vietnamese)

Page 25: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm input methods

http://www.viethoc.com/hannom/bango_intro.phpSource

6 dictionariesOther databases

Currently available for input16,638 Chinese characters (from 22,975 possible)11,600 Nôm characters (from 20,732 possible)

HanoKeyHanoSoft

Page 26: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm fonts

9,299 charactersMojikyo Institute, TokyoDynalab, Taipei: DFSong Light Vietnam

30,000+ charactersĐạo Uyển, Viên Chiếu Monastery: HanNom A & B

17,000+ charactersNôm Na Group, Hà Nội: Nôm Na Tong Light4,415 basic Hán-Nôm components

Page 27: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Nôm online

http://www.viethoc.com/hannom/tdnom_beta.phpNôm Annotated DictionaryUses HanNom fonts, Java applet

http://nomfoundation.org/nomdb/lookup.phpNôm Lookup ToolUses .GIF images (was SVG)

http://www.huesoft.com.vn/hannom/http://sager-pc.cs.nyu.edu/~huesoft/

Việt-Hán-Nôm Dictionary

Page 28: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Challenges

RepertoireNewly discovered charactersNo coordination between active groups

Character encodingsSlow international standardization processPrivate Use AreaNo coordination between active groups

Page 29: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Challenges

Input methodsLarge character repertoire

FontsUnicode surrogates

Page 30: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Other options

PresentationSVG (Scalable Vector Graphics)CDL (Character Description Language)

http://www.wenlin.com/cdl/

Page 31: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Wrapup

Page 32: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Brief historyXth century ~ first Nôm writing1651 ~ first Latin-based dictionary1910 ~ Quốc ngữ adopted nationally1991 ~ Quốc ngữ orthographic units in Unicode 1.01993 ~ RFC 1456 (VIQR)1993 ~ Quốc ngữ pre-composed in Unicode 1.1, ISO/IEC 10646-11995 ~ TCVN 5712 (8-bit), 5773 (Chữ Nôm)1995 ~ TCVN 6064 (keyboard)2000 ~ Chữ Nôm in Unicode 3.12001 ~ TCVN 6909 (Unicode)2004 ~ First International Nôm Conference2005 ~ Vietnamese Windows XP and Office 2003 SE

Page 33: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Dichotomies (& synergy?)

Latin, ideographicDifferent encodingsCombining, pre-composedTelex, VNI, TCVN 6064FontsActive working groups

Page 34: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Challenges

Standardization Qn NômRepertoire ☺

Character encoding ☺

Input methodsFonts ☺

Page 35: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Challenges

Usage Qn NômLegacy ☺

Application support ☺

User education

Page 36: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Ultimately

To make Vietnamese like any other language (such as English) in computersGoal: an ordinary user of Vietnamese on computers should not have to know about UTF-8 or character encodings at all

Page 37: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Thanks!

[email protected]://vietual.blogspot.com

Page 38: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Copyright©2005 by JDo. All rights reserved.

Acknowledgements

With thanks to:Roger ShermanDavid MurphyJames Turley

for enabling the presentation at IMUG and on the webHồ Văn TiếnNgô Thanh NhànKen LundeTex Texin

for comments and correctionsThe IMUG (International Macintosh Users Group) audience

for interesting questions and a very lively exchange

Page 39: Computing in Vietnamese: Progress & Challenges · Computing in Vietnamese: Progress & Challenges {James} ĐỖBá Phước 杜伯福 IMUG 2005-05-19

Q & A