23
New in Unicode Mark Davis, John Jenkins

New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Embed Size (px)

Citation preview

Page 1: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

New in Unicode

Mark Davis, John Jenkins

Page 2: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Agenda

Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data Repository Expanded Role for Consortium

Page 3: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Unicode 4.1.0

Released 2005 March 31 New Characters New Unicode Character Database New Specifications

Page 4: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

1,273 New Characters

Roundtripping for HKSCS and GB 18030

Five new currency signs Additional characters for Indic and

Korean Eight new scripts

Page 5: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Changes in the Standard

Conformance Changes Modifications to Default Case Operations Clarification of Decomposition Mappings

Other Changes SPACE not recommended as base for

nonspacing marks Use of CGJ to prevent reordering, prevent

contractions in sorting/matching (UCA) Positioning of Meteg Rendering of Thai Combining Marks

Page 6: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Unicode Character Database

Determines the behavior of characters in modern software: Alphabetics, Letters, Numbers, Identifiers,

Scripts, … New properties

Grapheme_Cluster_Break, Sentence_Break, Word_Break, Pattern_Syntax, and Pattern_White_Space

Revised Property Values Eg Alphabetic ⊃ ( Lowercase ∪ Uppercase )

Expanded documentation Each release now complete, not delta

Page 7: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

New Specifications

UAX #31: Identifier and Pattern Syntax Basis for Backwards-Compatible Identifiers

Programming Languages Resources and Services

Basis for Stable Syntax characters Whitespace Operators

UAX #34: Unicode Named Character Sequences Mechanism for identifying/naming significant

sequences Standardized list

Page 8: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Major Revisions in Annexes

UAX #15: Unicode Normalization Forms Correction for Idempotency Problem Enhanced discussion of Hangul

UAX #14: Line Breaking Properties Modifications for Hangul Changes because SPACE not recommended as

base for nonspacing marks Separated all suggested tailorings into separate

section UAX #29: Text Boundaries

Using new properties, adding Joiner/Non-Joiner Modifications to Word -Break

Page 9: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

UTS #10: Unicode Collation Algorithm

Basis for language-sensitive sorting, searching, and matching

Synchronized with Unicode 4.1.0 New:

Characters Revised Weights Specification: matching, ignorables,

Thai, …

Page 10: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

UTS #18: Unicode Regular Expressions

Regular expressions used widely in programs, for matching patterns (eg Wildcards)

Unicode expands the scope drastically

Explicit Conformance Clauses POSIX-Conformance

Page 11: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

UAX #36: Unicode Security

Incorrect usage of Unicode can expose programs or systems to possible security attacks! Examples:

Numbers: ৪୨ = 42 ! Bengali { ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, Oriya { ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}.

Domain Names:  String UTF-16 Internal - IDNA

1a at.com 0061 0308 0074 002E 0063 006F 006D xn--t-zfa.com

1b at.com 00E4 0074 002E 0063 006F 006D xn--t-zfa.com

2a tοp.com 0074 03BF 0070 002E 0063 006F 006D xn--tp-jbc.com

2b tοp.com 0074 006F 0070 002E 0063 006F 006D top.com

4a sos.com 0073 006F 0337 0073 002E 0063 006F 006D xn--sos-rjc.com

4b søs.com 0073 00F8 0073 002E 0063 006F 006D xn--ss-lka.com

Page 12: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Character Mapping ML

XML format for the interchange of mapping data for character encodings and aliases.

Promoted to Unicode Technical Standard; with new Conformance section (2).

Added explicit text about multi-character mappings.

Page 13: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Common Locale Data Repository

Common, necessary software locale data for world languages

XML format for effective interchange

Δευτέρα, 05 Σεπτεμβρίου 2005

Montag, 5. September 2005

¥ 1,234.571 234,57руб.

Arabic – arabskiBulgarian – bułgarskiCzech – czeski…

Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…

AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…

Z < Å

Page 14: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Typical Locale Data

Dates/time formats Number/Currency formats Measurement Systems Collation Specifications (UCA-based)

Used for sorting, searching, matching Tailorings of translated names for

language, territory, script, timezones, currencies, …

...

Page 15: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Latest Release: CLDR 1.3 296 locales: 96 languages, 130 territories

Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic [አማርኛ]; Arabic [العربية]; Armenian [Հայերէն]; …

Territories: Afghanistan [افغانستان]; Albania [Shqipëria]; Algeria [الجزائر]; Argentina; Armenia [Հայաստանի Հանրապետութիւն]; Australia; Austria [Österreich]; Azerbaijan [Azərbaycan, Азәрбај ан]; …ҹ

Complete set of generated POSIX-format data Plus tool to generate versions tuned for different

platforms. Expanded locale data

Timezone localizations Including UN M.49 continents and regions Many other revisions and additions of data

New Tests & Tools

Page 16: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Expanded Role for Consortium

Dedicated to the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.

Providing the fundamental specifications for full software globalization, full interoperability

Page 18: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Institutional & Supporting Members(New Membership

Categories)

Page 20: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Liaison Members Center of Computer and

Information Development (CCID), Beijing, China

High Council of Informatics (HCI), Iran

Information and Communication Technology Agency of Sri Lanka (ICTA)

The International Forum for Information Technology in Tamil (INFITT)

The Internet Engineering Task Force (IETF)

ISO/IEC JTC1/SC2 and WG2 Linguistic Society of America (

LSA)

National Endowment for the Humanities (NEH)

National Information Standards Organization (NISO)

NSAI/ICTSCC/SC4:Irish standardization: Codes, Character Sets, and Int’lization

Open I18n.org: The Free standards Group Open Internationalization Initiative

Research Institute for ILCAA, Tokyo University of Foreign Studies

Research Institute for the Languages of Finland (RILF)

Special Libraries Association (SLA )

Technical Committee on Information Technology (TCVN/TC1), Hanoi, Viet Nam

United Nations Group of Experts on Geographical Names (UNGEGN)

World Wide Web Consortium - W3C I18N Core Working Group

Page 21: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Unicode Technical Committee

Multiple Globalization Standards The Unicode Standard, including UAXes Unicode Technical Standards: Collation,

… Unicode Technical Notes: Best

Practices, Background Information Quarterly F2F Meetings Email Discussion

Page 22: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

CLDR Technical Committee

Meetings Short, frequent: Telecon + Instant

Messaging Email Discussion

Data All additions / revisions in bug database Anyone can file; committee assesses,

vets

Page 23: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data

Why Join? Support the technology

That enables your success in international, technical, and emerging markets.

Protect your investment The stability you need The extensions you require The developments you call for: security, …

Demonstrate your leadership For the goal that all the world's languages

can be used on computers everywhere, from mobile phones to mainframes.