Upload
gabriel-reyes
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
What’s New in Globalization?
Mark DavisPresident & Cofounder
The Unicode Consortium
The Unicode Standard, Version 5.0
“Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.”
— Donald E. Knuth“For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.”
— Bill Gates“The path W3C follows to making text on the Web truly global is Unicode.”
— Sir Tim Berners-Lee, KBE“Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.”
— James Gosling
The Unicode Standard, Version 5.0
Obsoletes previous versions
Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few.
Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes
Systematic framework for improved text processing
Improvements to the Unicode Encoding Model for UTF-8, …
Rigorous stability of case folding and identifiersImproved interoperability and backward compatibility
Enabling additional new ways to optimize code
U5.0 Unicode Character Database
Unicode: far more than a list of characters
Properties: key to how characters function
Changes in 5.0Scripts: Unassigned code points → Zzzz
Casing Stability: Upper → folded
BIDI: Consistent Bidi_Mirrored
Now Normative: kIICore
Line Break: SE Asian → Complex_Context
New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties
General99,08
9
Private Use
137,468
Surrogate 2,048
Noncharacter 66
Reserved875,44
1
U5.0 Conformance
Stable Case-Folded≈ Upper → Lower
Much clearer encoding / property model
Stable Approved Named Character Sequences
Bengali, Gurmukhi, Tamil changes
Combining grapheme joiner clarified
Disunification of Diacritics
5.0 Annexes: Core
UAX #9: Bidirectional AlgorithmTightened conformance requirements
UAX #15: Unicode Normalization FormsNew Stream-Safe Text Format
Appendix of characters requiring special handling
Expanded info on stability guarantees
Additional detailed figures, guidelines
UAX #31: Identifier and Pattern SyntaxAdded profiles & information on usage
U5.0 Annexes: Boundaries
UAX #14: Line Breaking PropertiesRules modified to improve behavior
Now Normative (conformance clauses reorganized)
UAX #29: Text BoundariesEdge cases improved
Tailorings for text boundaries now in Unicode CLDR
Format of the rules changed to ease implementation
Additional guidelines on regex, identifiers,…
U5.0 Characters by Script
Phags Pa
Phoenician
Devanagari
Hebrew
Greek
Kannada
Nko
Common
Latin
Inherited Cyrillic
Cuneiform
Balinese
Unicode Character Timeline
1
10
100
1,000
10,000
100,000
1,000,000
2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0
Letter
Symbol
Mark
Number
Punctuation
Control/Format
Separator
Unicode Guide for Programmers
Adjunct to Standard
Concise Guide for Software Globalization
Crucial Concepts
Key “Gotchas”Recognize and Avoid
Details onEncoding & conversions:
UTF-8, 16, 32 & BOM
Using character properties
Text Operations
Unicode Common Locale Data Repository: CLDR
Key locale data for world languages
Most extensive standard repository of locale data
XML format
Δευτέρα, 05 Σεπτεμβρίου 2005
Montag, 5. September 2005
¥ 1,234.57 1 234,57руб.
Arabic – arabskiBulgarian – bułgarskiCzech – czeski…
Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…
AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…
Z < Å
Unicode CLDR 1.4
121 languages and 142 territories – 360 locales in all
25% more locale data; over 17,000 new/modified items
Repository separated into language vs locale data
Language-specific segmentation (word/line breaks…)
Transliterations (eg Ελληνικά ↔ Ellēniká)
Data for lenient date/time formatting and parsing
Programmer asks for “numeric day” + “abbreviated month”
Best format pattern returned, eg “dd.MMM”
+ Quarters in dates (eg 2006Q1)
BCP 47 compatibility + extensions
BCP 47 Language Tags
Usage: HTTP, HTML, XML; CLDR Locale IDs…
RFC 4646; Obsoletes RFCs 1766, 3066
Addresses problems in RFC3066ISO standards: stability / accessibility / ambiguity
Parseability, Extensibility; Registration speed
Identification of script (where necessary):Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
Unicode Security
Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…
Non visual problems: buffer overflows, non-shortest form,…
UTR# 36 Unicode Security ConsiderationsGuidelines & Recommendations
UTS# 39. Unicode Security MechanismsAlgorithms & Data
Limitations on Repertoire
Testing for Confusables
Internationalized Domain Names
One instance of broad problemMany RFCs use Nameprep – limited to Unicode 3.2
Unicode recommendationsNarrow the repertoire: exclude symbols, punctuation
Expand the coverage: currently only Unicode 3.2.
IETF idn-nextsteps publishedSome positive developments, but misreads Unicode, needs more work
URL → IRI
International Resource Identifier (IRI)
UTF-8, %-escaped
Example:http://w3.org/International/articles/idn-and-iri/JP納豆/引き割り納豆.html http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D... %E8%B1%86.html
See http://ietf.org/rfc/rfc3987.txt
Ideographic Variation Database
U+82A6 ashi: multiple forms
The first occurrence – any glyph
Second occurrence is in the name of the town Ashiya – customarily displayed with form #4
Registration for variants
Ideographic Variation Database
Variation SelectorIdentifies a restriction on the appearance of a character
Character + Variation Selector = Variation Sequence
Han ideographsImpossible to build a single collection for everyone: requirements from scholars, governments and publishers…
Instead, registration of multiple independent collections
Unicode Ideographic Variation DatabaseA given variation sequence is used in at most one collection
Makes interchange of variation sequences reliable.
Registration, not Assessment
ICU 3.6
Mature, portable C/C++/Java int’l libraries
Unicode 5.0, UCA 5.0, CLDR 1.4
ICU4CCharset Detection
Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,…
ICU4J Globalization Preferences
Flexible date/time formats*, Charset conversion*
Near-Term Issues
Unicode 5.0.1, Unicode 5.1
CLDR / BCP 47bis
LDAP
Collation Registry
IANA Charset Registry
Unicode 5.1 - possibilities
CharactersCJK Unified Ideographs Extension C
Minority Scripts: Cham and Lanna
Malayalam chillu
…
Properties/BehaviorNormalization process for stable strings
…
CLDR 1.5 / BCP 47bis
CLDR 1.5
Data Submission Starting November
New structures / data
BCP 47
Adding ~7,000 (!) new language subtags
Possibly other changes…
LDAP
Now has definitive comparison
(good)
Stuck at Unicode 3.2
(bad)
http://www.ietf.org/rfc/rfc4518.txt
Collation Registry
Nearing approval
Adds ability to register comparisons
Workable for basic cases
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-14.txt
IANA Charset registry
Currently limited usefulness
Ill-defined
Missing mapping tables
Incomplete
Inaccurate
Regime Change
Hope for future improvements!
What’s New in Globalization?
Mark DavisPresident & Cofounder
The Unicode Consortium