54
www.cdacnoida. in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

Internationalization Localization & Unicode

Embed Size (px)

DESCRIPTION

Karunesh Arora Vijay Gugnani C-DAC Noida. Internationalization Localization & Unicode. “Everyone has the right... to seek, receive and impart information and ideas through any media regardless of frontiers” -- Universal Declaration of Human Rights. - PowerPoint PPT Presentation

Citation preview

ww

w.c

dacn

oida

.in

1

Internationalization Localization & Unicode

Karunesh Arora

Vijay Gugnani

C-DAC Noida

ww

w.c

dacn

oida

.in

“Everyone has the right... to seek, receive and impart

information and ideas through any media regardless of

frontiers” -- Universal Declaration of Human Rights

ww

w.c

dacn

oida

.in

3

Internationalization

Internationalization, which is often referred as i18n, depicts the practice of designing and developing a application, product or document in a way that makes it easily localizable for target audiences that vary in culture, region, or language.

ww

w.c

dacn

oida

.in

4

Why Internationalization?

• To remove barriers to local and international access

• Adaptation to local, regional, linguistic or cultural needs.

• To provide global reach

• ROI, Revenue generation

ww

w.c

dacn

oida

.in

5

Internationalization Vs. Localization

Localization is the actual adaptation to meet the language, cultural, and other requirements for specific target audience.

While internationalization gives us the technology and tools to target a given audience, it’s the act of localization that makes it accessible.

ww

w.c

dacn

oida

.in

6

What goes with localization?

• Localization is much more than translation.

Specifically, localization refers to adaptation to other language, which involves appropriate:

– Language Translation– Locale transformation and Cultural aspects

ww

w.c

dacn

oida

.in

7

Language Translation

• Most languages are used in many countries, not just those where they are dominant or “official”

• People migrate and take languages with them

• Over enough time, most languages evolve differently in different locations

Languages and Countries

ww

w.c

dacn

oida

.in

8

Scripts and Languages• A “script” may be defined as collection of related

characters

– It is common for several languages to share most, but not all characters from a given script

– Scripts are often given the same name as one of the languages that uses them

• Arabic script, but Arabic, Farsi, Urdu,… languages

– Scripts are also given common name for a group of languages• Devanagri script for Hindi, Marathi, Nepali, Konkani etc.

Language Translation:

ww

w.c

dacn

oida

.in

9

Language Translation

• Identify ‘Translatable’ and ‘Non-translatable’ strings• Gender and number agreement, ordering of segments in a sentence

e.g. Page number ->

e.g. Number of pages ->

• Many languages can take at least 30% more space Tool –

उपकरण (HI) & ग्रा�हक - customer (EN)

– Design should be compatible, or else the UI may have to be redesigned– Narrow columns often cannot accommodate long Target language

equivalent words

Some Points to consider:

ww

w.c

dacn

oida

.in

10

• Avoid ambiguous phrases• ‘Display options’

– Options of the display -- as Noun Noun– Show the options (all of them) – as Verb Noun

• Proverbs and metaphors may not have equivalents in target language

• Keep Web pages and paragraphs short. • Avoid text in graphics.• Use simple grammatical structures. • Use everyday language. • Provide clues.

Language Translation

Some Points to consider… Contd.:

ww

w.c

dacn

oida

.in

11

• Follow source language conventions.

• Avoid acronyms.

• Abbreviations may have to be expanded when translated

• Check spelling and grammar.

• The more compact the source writing, the longer the Translation

• Brief translators about the purpose and target audience

• All items in a menu or set of check boxes should have the same grammatical structure

Language Translation

Some Points to consider… Contd.:

ww

w.c

dacn

oida

.in

12

Locale

• Set of parameters that define the user’s language, country and cultural preferences

ww

w.c

dacn

oida

.in

13

Different aspects of locale

• Names & Titles• Calendars,• Numeric, Date and Time formats, Addresses,• Currencies, Paper size, Weights & measures• Input Mechanism, • Language Selection,• Oral Pronunciation

ww

w.c

dacn

oida

.in

14

Titles and Names

• In India, it is required to specify etc.)– these titles do not necessarily translate

• Family name is not always last (In South & West part of country)

• Sorting can be based on last name or first

• Salutations in letters (e.g. Dear) are different in different locales e.g.

ww

w.c

dacn

oida

.in

15

Titles and Names

Source: Delhi Press Prakashan

ww

w.c

dacn

oida

.in

16

Calendars

• The Gregorian calendar should not always be assumed– Proper localization of some software requires the use (at

least as an option) of calendars distinct to a culture• E.g. Vikram Samvat/ Saka / Hijri calendar in India

• Calendars of various religions where year 0 was not 2006 years ago

– Fiscal-year based calendars vary widely• Some have 13 months (364/28) or 53 weeks

ww

w.c

dacn

oida

.in

17

Date formats

• Date separators depend on locale ‘/’, ‘-’, ‘.’

• ‘am’ and ‘pm’ are not used universally (many cultures use 24 hour clock)– ISO standard dates are unambiguous yyyy-mm-dd

hh:mm:ss

Non ISO date 01-03-02 means different things in different locales. If not using ISO, then display dates in the locale of the user Preferably use a ‘long’ form with the month spelled out (in the correct

language)

ww

w.c

dacn

oida

.in

18

Formatting Numbers

• locale dependent, not the language of application• Group separation

– Number of digits in a group• In English and ISO it is 3 while for Indic languages its

different 1,23,456 i.e. ##,##,##,###– Group separator

• In English ‘,’, but ISO uses space, and some locales use ‘.’ or none

• Decimal separator ‘.’, ‘.’, ‘,’• Negative symbol ‘-’, ‘~’, ‘(…)’

ww

w.c

dacn

oida

.in

19

Currency

• Use the currency symbol of the data– i.e. INR doesn’t automatically translate to £ or $ when

the locale changes

• Format depends on the user’s locale, not the currency– Differences in formats:

• Symbol

• Position (before or after the currency)

• Blanks separating the symbol from the data

ww

w.c

dacn

oida

.in

20

Currency contd…

Different ways of expressing Rs. 1000

Rs.1000 OR Rs. 1000/- or Rs.1,000/- or Rs. 1000.00INR 10001000 Rupees 1000 रुपये�

Strong currencies like Indian need decimal precision (e.g. 2 digits after the decimal point for paisa)

ww

w.c

dacn

oida

.in

21

Language selection

• Avoid using national flags to choose preferred language– Multiple countries use the same language

• Display of language selection order?

• Language of displaying languages ?– In the language itself, or with a translation in the default language of the

operating system

ww

w.c

dacn

oida

.in

22

Pronunciation

• Important for Speech based systems

– Higher recognition accuracy can be obtained by tailoring voice input to regional dialects

– Voice output in the wrong dialect can make an application sound ‘foreign’

– Applications supported with regional dialects have better impact

ww

w.c

dacn

oida

.in

23

Culture

• Culture is a complex collection of experiences which condition daily life;

• It includes • history, • social structure, • geographical effects, • religion, • traditional customs and everyday usage.

ww

w.c

dacn

oida

.in

24

Cultural issues

• Icons, symbols and images

• Colors, myths, beliefs and feelings

• Humour

• Geographical & environmental effects

• Customs & traditions

• Social Security Numbers

ww

w.c

dacn

oida

.in

25

Icons & Symbols

• Icons that are a play on words do not translate– e.g.

• A dust bin for dumping files• A rocket for launching an application• A scissors for cutting in edit operation• “B”, “I”, “U”

• Some concepts have been found extremely hard to represent as an icon– E.g. Sorting (‘A->Z’ is not universal)

• Images of people or body parts such as hands– Considered inappropriate in some cultures– What skin color do you use?– People Images need to be localized for each country

ww

w.c

dacn

oida

.in

26

Colors & Humour

• The color white may represent purity and green prosperity in the Indian context, but it may not be the same in another culture.

• Humour generally does not get translated

• People are sensitive to different things in different cultures

• Jokes/cartoons can be offensive

ww

w.c

dacn

oida

.in

27

Customs & Traditions

• In the Indian culture, people show respect to their elders and renowned personalities by addressing them in plural.

e.g. Dr. Manmohan Singh is the prime minister of India.

डॉ�. मनम�हन सिं��ह भा�रत क� प्रधा�नम�त्री� ह�।

Similarly, in social relationships, there are several words to address a relation

e.g. for ‘uncle’ - चा�चा�, त�ऊ, म ��

ww

w.c

dacn

oida

.in

28

Unicode provides a unique number for every character,

no matter what the platform,no matter what the program,no matter what the language.

Unicode?

Source: http://unicode.org

ww

w.c

dacn

oida

.in

29

Universal Character Encoding

• Unique number for every character

ww

w.c

dacn

oida

.in

30

Unifies all Languages

• 96 thousand characters, so far

• All characters accessible at the same time, in the same document:

क, க, ಔ,…

ww

w.c

dacn

oida

.in

31

Wide Spread Support

• Developed & supported by industry leaders:– Apple, HP, IBM, JustSystem, Microsoft, Oracle,

SAP, Sun, Sybase, Unisys, …

• Supported in standards: – XML, HTML, Java, ECMAScript (JavaScript),

LDAP, CORBA 3.0, WML, Perl, etc.

• Implemented in:– All modern operating systems, browsers, and other

products

ww

w.c

dacn

oida

.in

32

IDN

–http://भा�षा�.in

ww

w.c

dacn

oida

.in

33

Information about Unicode

• www.unicode.org

– Online Standard

– Technical Reports

– FAQs

– General Information

– Discussion Forums, Conferences

ww

w.c

dacn

oida

.in

34

Resources Availability

• System APIs:

– Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, …

• Languages

– Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, …

• Cross-platform libraries:

– ICU, Rosette, …

ww

w.c

dacn

oida

.in

35

Indic Support in Unicode

• ISCII the basis for characters and allocation

• DIT is member of Consortium

• Reports have been submitted on missing characters, clarifications or corrections of usage

ww

w.c

dacn

oida

.in

36

ISCII : Similarities

• Within script, layout and contents nearly identical

• Independent + dependent vowels

• Halant model for representing conjuncts

– conjuncts / half-forms not directly encoded

– represented by sequences instead

• Phonetic sequence – order in syllables

ww

w.c

dacn

oida

.in

37

ISCII : Differences

• Unicode is stateless:

– No shifting to get different scripts

– Each character has a unique number

• Unicode is uniform:

– No extension bytes necessary

– All characters coded in the same space

ww

w.c

dacn

oida

.in

38

Advantages

• Accessible Information across the globe

• Seamless multilingual documents

• Opens up software export market, beyond English

• Connects India to the world

ww

w.c

dacn

oida

.in

39

The Future

• The world is moving rapidly to Unicode

• Unicode makes India open to the world– The world comes to you, and– You go to the world

ww

w.c

dacn

oida

.in

40

Multiple Forms

• UTF-8: maximal compatibility with 8-bit systems

• UTF-16: good storage, interoperability with Windows/Java

• UTF-32: simplest processing

• Fast, lossless conversion

ww

w.c

dacn

oida

.in

41

W3C Internationalization Activity

ww

w.c

dacn

oida

.in

42

• Presentation / Styling issues– Styling of first character

If some styling feature is to be applied to the starting character, then whether it will be applied to a single character, conjunct character, a syllable or a Grapheme cluster.

e.g.

स्थि�तित (Position)

प्रस्था�न (Departure)

स्वर (Vowel)

को�श (Dictionary)

हिंदी& (Hindi)

हिन्दी& (Hindi)

क्षे त्री�ये  (Regional)

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

43

• Presentation / Styling issues– Styling of first character

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

44

• Presentation / Styling issues– In Cursive Text

like Arabic and Urdu

the styling is applied

to whole word

Saabiq -> Former

Urdu

Source: Rashtriya Sahara

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

45

• Presentation / Styling issues– Vertical arrangement of characters

If some string is written in vertical mode, then writing each character on a new line may not be suitable

http://www.w3.org/International/notes/firstletter.html

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

46

• Presentation / Styling issues

– Horizontal spacing

e.g.

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

47

• Presentation / Styling issues– Bullets and numbers

Number schemes to be supported in Indian languages also.

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

48

• Presentation / Styling issues

– Collation

A means to search and order data in a way that makes sense in their particular culture

Myths - One collation is good enough Unicode enabled – sorting is already covered

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

49

• Presentation / Styling issues

Some Issues in Indian Languages

ww

w.c

dacn

oida

.in

50

• Presentation issues– Underlining of the characters

अन्ये भा�षा�ओं म+ भा� अन,वा�दी

Some Issues under discussion in IL

ww

w.c

dacn

oida

.in

51

• Searching issues

– Problem in searching in languages sharing same script and some words being same but semantically different

Some Issues

ww

w.c

dacn

oida

.in

52

Issues on presentation on other devices

• Addressing Input mechanism, predictive input for vernacular languages

• Handling display issues in Hand held devices with smaller screen, in cases of translation

• Standardizing encoding issues in communication for taking care of cost of bandwidth (ISCII / Unicode / Compressed Unicode), connectivity and on-the-fly conversion of encodings

ww

w.c

dacn

oida

.in

53

References and acknowledgements

• http://www.w3.org/international

• Articles by Richard Ishida, Felix Sasaki, W3C

• http://macchiato.com/slides/UnicodeAndIndia.ppt , Presentation by Mark Davis

• www.site.uottawa.ca/ftppub/courses/Winter/csi5122/coursenotes/5122Internationalization.ppt

ww

w.c

dacn

oida

.in

54

Thank you