Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium

What’s New in Globalization?

Mark DavisPresident & Cofounder

The Unicode Consortium

The Unicode Standard, Version 5.0

“Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.”

— Donald E. Knuth“For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.”

— Bill Gates“The path W3C follows to making text on the Web truly global is Unicode.”

— Sir Tim Berners-Lee, KBE“Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.”

— James Gosling

The Unicode Standard, Version 5.0

Obsoletes previous versions

Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few.

Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes

Systematic framework for improved text processing

Improvements to the Unicode Encoding Model for UTF-8, …

Rigorous stability of case folding and identifiersImproved interoperability and backward compatibility

Enabling additional new ways to optimize code

U5.0 Unicode Character Database

Unicode: far more than a list of characters

Properties: key to how characters function

Changes in 5.0Scripts: Unassigned code points → Zzzz

Casing Stability: Upper → folded

BIDI: Consistent Bidi_Mirrored

Now Normative: kIICore

Line Break: SE Asian → Complex_Context

New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties

General99,08

9

Private Use

137,468

Surrogate 2,048

Noncharacter 66

Reserved875,44

1

U5.0 Conformance

Stable Case-Folded≈ Upper → Lower

Much clearer encoding / property model

Stable Approved Named Character Sequences

Bengali, Gurmukhi, Tamil changes

Combining grapheme joiner clarified

Disunification of Diacritics

5.0 Annexes: Core

UAX #9: Bidirectional AlgorithmTightened conformance requirements

UAX #15: Unicode Normalization FormsNew Stream-Safe Text Format

Appendix of characters requiring special handling

Expanded info on stability guarantees

Additional detailed figures, guidelines

UAX #31: Identifier and Pattern SyntaxAdded profiles & information on usage

U5.0 Annexes: Boundaries

UAX #14: Line Breaking PropertiesRules modified to improve behavior

Now Normative (conformance clauses reorganized)

UAX #29: Text BoundariesEdge cases improved

Tailorings for text boundaries now in Unicode CLDR

Format of the rules changed to ease implementation

Additional guidelines on regex, identifiers,…

U5.0 Characters by Script

Phags Pa

Phoenician

Devanagari

Hebrew

Greek

Kannada

Nko

Common

Latin

Inherited Cyrillic

Cuneiform

Balinese

Unicode Character Timeline

1

10

100

1,000

10,000

100,000

1,000,000

2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0

Letter

Symbol

Mark

Number

Punctuation

Control/Format

Separator

Unicode Guide for Programmers

Adjunct to Standard

Concise Guide for Software Globalization

Crucial Concepts

Key “Gotchas”Recognize and Avoid

Details onEncoding & conversions:

UTF-8, 16, 32 & BOM

Using character properties

Text Operations

Unicode Common Locale Data Repository: CLDR

Key locale data for world languages

Most extensive standard repository of locale data

XML format

Δευτέρα, 05 Σεπτεμβρίου 2005

Montag, 5. September 2005

￥ 1,234.57 1 234,57руб.

Arabic – arabskiBulgarian – bułgarskiCzech – czeski…

Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…

AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…

Z < Å

Unicode CLDR 1.4

121 languages and 142 territories – 360 locales in all

25% more locale data; over 17,000 new/modified items

Repository separated into language vs locale data

Language-specific segmentation (word/line breaks…)

Transliterations (eg Ελληνικά ↔ Ellēniká)

Data for lenient date/time formatting and parsing

Programmer asks for “numeric day” + “abbreviated month”

Best format pattern returned, eg “dd.MMM”

+ Quarters in dates (eg 2006Q1)

BCP 47 compatibility + extensions

BCP 47 Language Tags

Usage: HTTP, HTML, XML; CLDR Locale IDs…

RFC 4646; Obsoletes RFCs 1766, 3066

Addresses problems in RFC3066ISO standards: stability / accessibility / ambiguity

Parseability, Extensibility; Registration speed

Identification of script (where necessary):Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

Unicode Security

Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…

Non visual problems: buffer overflows, non-shortest form,…

UTR# 36 Unicode Security ConsiderationsGuidelines & Recommendations

UTS# 39. Unicode Security MechanismsAlgorithms & Data

Limitations on Repertoire

Testing for Confusables

http://www.unicode.org/reports/tr36/

http://www.unicode.org/reports/tr39/

Internationalized Domain Names

One instance of broad problemMany RFCs use Nameprep – limited to Unicode 3.2

Unicode recommendationsNarrow the repertoire: exclude symbols, punctuation

Expand the coverage: currently only Unicode 3.2.

IETF idn-nextsteps publishedSome positive developments, but misreads Unicode, needs more work

URL → IRI

International Resource Identifier (IRI)

UTF-8, %-escaped

Example:http://w3.org/International/articles/idn-and-iri/JP納豆/引き割り納豆.html http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D... %E8%B1%86.html

See http://ietf.org/rfc/rfc3987.txt

http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D%E8%B1%86/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A%E7%B4%8D%E8%B1%86.html



http://w3.org/International/articles/idn-and-iri/%0BJP%E7%B4%8D...%E8%B1%86.html



http://ietf.org/rfc/rfc3987.txt

Ideographic Variation Database

U+82A6 ashi: multiple forms

The first occurrence – any glyph

Second occurrence is in the name of the town Ashiya – customarily displayed with form #4

Registration for variants

Ideographic Variation Database

Variation SelectorIdentifies a restriction on the appearance of a character

Character + Variation Selector = Variation Sequence

Han ideographsImpossible to build a single collection for everyone: requirements from scholars, governments and publishers…

Instead, registration of multiple independent collections

Unicode Ideographic Variation DatabaseA given variation sequence is used in at most one collection

Makes interchange of variation sequences reliable.

Registration, not Assessment

ICU 3.6

Mature, portable C/C++/Java int’l libraries

Unicode 5.0, UCA 5.0, CLDR 1.4

ICU4CCharset Detection

Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,…

ICU4J Globalization Preferences

Flexible date/time formats*, Charset conversion*

Near-Term Issues

Unicode 5.0.1, Unicode 5.1

CLDR / BCP 47bis

LDAP

Collation Registry

IANA Charset Registry

Unicode 5.1 - possibilities

CharactersCJK Unified Ideographs Extension C

Minority Scripts: Cham and Lanna

Malayalam chillu

…

Properties/BehaviorNormalization process for stable strings

…

CLDR 1.5 / BCP 47bis

CLDR 1.5

Data Submission Starting November

New structures / data

BCP 47

Adding ~7,000 (!) new language subtags

Possibly other changes…

LDAP

Now has definitive comparison

(good)

Stuck at Unicode 3.2

(bad)

http://www.ietf.org/rfc/rfc4518.txt

http://www.ietf.org/rfc/rfc4518.txt

Collation Registry

Nearing approval

Adds ability to register comparisons

Workable for basic cases

http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-14.txt

http://www.ietf.org/internet-drafts/%0Bdraft-newman-i18n-comparator-14.txt

http://www.ietf.org/internet-drafts/%0Bdraft-newman-i18n-comparator-14.txt

IANA Charset registry

Currently limited usefulness

Ill-defined

Missing mapping tables

Incomplete

Inaccurate

Regime Change

Hope for future improvements!

What’s New in Globalization?

Mark DavisPresident & Cofounder

The Unicode Consortium

Documents

Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium