View
219
Download
0
Category
Preview:
Citation preview
11 21st International Unicode Conference Dublin, Ireland, May 2002
ISO/IEC 10646 &The Unicode® Standard
Mike KsarSenior Program Manager
International Standards StrategyMicrosoft Corporation
JTC1/SC2/WG2 Convener
Screenplay by Asmus Freytag
2 20th International Unicode Conference Washington, DC, January 2002
Background Relation between Unicode and
ISO/10646 What is the same What is different What is being merged
Synchronization Shared Process and Policies Aligned Program of Work Common publication resources
Beyond character coding Character properties & Collation Internationalization Products and Standards
Summary
Outline
3 20th International Unicode Conference Washington, DC, January 2002
The Internet The internet pushes the envelope
on internationalization Users have easy access to documents
worldwide, in any character set Servers can be accessed by users from
anywhere, speaking any language Software can no longer be targeted to
a single national market The need for a single character set standard
was never greater. Why do we have two?
4 20th International Unicode Conference Washington, DC, January 2002
Common Charter
Develop a standard of graphic character repertoire and coding for an international graphic character set ... of the written form of the languages of the world.
5 20th International Unicode Conference Washington, DC, January 2002
Organizations
… and other National Bodies
SC 2: Codes and Character Sets
SC 22: Programming Languages..
WG 20: Internationalization
WG 2: ISO/IEC 10646
IRG: Ideographic Rapporteur Group
ISO/IEC
JTC 1: Information Technology INCITS: Information Technology
L2: Codes, Character Sets, and Internationalization
ANSI (US)
NB
UTC: Unicode Technical
Committee Bidi and other subcommittees
The Unicode Consortium
Mem
ber
Liaison
6 20th International Unicode Conference Washington, DC, January 2002
ISO Framework
Basis for other standards: ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C
Well established and recognized ISO development process of standardization
Worldwide expertise through national standards bodies, industry and liaison organizations
Identified as one of the standards for procurement requirements by major organizations and agencies
7 20th International Unicode Conference Washington, DC, January 2002
Unicode Framework
Consortium with open membership Industry backing Direct support from key implementers Open to academic and user input Cooperation with ISO, JTC1, ECMA,
IETF, CEN/TC304 & W3C Unicode Technical Committee (UTC)
9 20th International Unicode Conference Washington, DC, January 2002
10646
&
Unicode
10646
&
Unicode
1 Universal Code
. . .ASCII ISO 646
Part-2Part-1 Part-.... . .
ISO 8859-x
WindowsIBM Other. . .
Industry 8-bit Codes
National/Industry Multibyte Codes
Development Path
10 20th International Unicode Conference Washington, DC, January 2002
Sources of Characters International standards
JTC1/SC2 coded character sets JTC1/SC18 text formatting and presentation ISO TC46 bibliographic community
National standards and committees China (GB2312), Japan (JIS 208),
Korea (KSC 5601) and many others Widely supported vendor character sets Regional standards committees
ASMO, ECMA ATG & Bidi & SC2/WG2/IRG Liaison organizations:
Unicode, inc., ECMA, ITU-TS, AFII, TCA, W3C, CEN/TC304 and others
User communities STIX
11 20th International Unicode Conference Washington, DC, January 2002
ISO/IEC 10646
Milestones
1984: ISO starts developing 1991: Convergence with Unicode 1993: ISO/IEC 10646-part 1, First edition
Architecture & Basic Multilingual Plane Equivalent to Unicode 1.1
1998: ISO/IEC TR 15285An operational model for characters and glyphs
1995 – 1999: Technical amendments UTF-8, UTF-16, Korean, Tibetan, Braille, etc. Unicode 2.0 is equivalent through amendment 7
2000: ISO/IEC 10646-1, Second edition 3 technical corrigenda 31 amendments since 10646-1: 1993 first edition Equivalent to Unicode 3.0
2001: ISO/IEC 10646-2 for Planes 1, 2 & 14 Unicode 3.1 includes repertoires of both 10646-1
and 10646-2 plus two additional characters 2002: Amd-2 to part 1
Equivalent to Unicode 3.2
12 20th International Unicode Conference Washington, DC, January 2002
Unicode 14 Years(1988-2002)
1988: First use of name Unicode 1991: Unicode Consortium founded 1991: Unicode, Version 1 1991: First Implementers' Workshop 1991: Convergence with ISO/IEC 10646 Liaison to ISO/IEC 10646 Working Group 1992: First Unicode Technical Reports 1993: Unicode, Version 1.1 1996: Version 2.0 published 2000: Version 3.0 published Dramatic increase in number and scope of
Unicode-based implementations 2001: Version 3.1 published 2002: Version 3.2 2002: 20th International Unicode Conference
13 20th International Unicode Conference Washington, DC, January 2002
Background Relation between Unicode and
ISO/10646 What is the same What is different What is being merged
Outline
14 20th International Unicode Conference Washington, DC, January 2002
Code Space & Structure
Plane 16Private
UsePlane 15Private
Use
Plane 14
Plane 02
Plane 01
Plane 00BMP
. . .
. . .. ..
Planes
ISO/IEC 10646 Parts 1 and 2• Only use code space in planes 0 to 16• Define characters only in planes0 (BMP), 1, 2 & 14 so far
• Reserve planes 15, 16for private use
15 20th International Unicode Conference Washington, DC, January 2002
A Plane in 10646
Plane (16-bits)
Row
Cell
A plane is the basic division of code-space in ISO/IEC 10646
The first plane (Plane 0) is the Basic Multi-lingual Plane (BMP)
Unicode 3.1 matches planes 0-16
65,536 characters
16 20th International Unicode Conference Washington, DC, January 2002
Basic Multilingual Plane
Reserved for accessing code points outside BMP(2048)
Alphabets, Symbols, CJK Auxiliary, Hangul, . . .
Unified Chinese, Japanese, Korean Ideographs
C1 ControlsC0 Controls
Private Use (6K), Compatibility Area, Arabic Presentation Forms, . . .(8190)
18 20th International Unicode Conference Washington, DC, January 2002
Adopted Form ISO/IEC 10646 is a 16-bit or 32-bit code
UCS-2: for accessing code points in BMP, 2-bytes (16-bits) UCS-4: canonical form for accessing any code point using
4-bytes (32-bits) Transformation formats
UTF-8: for use in 8-bit environments (e.g. HTML, XML) (variable length code, 1 to 6 bytes/character)
UTF-16: for use with UCS-2 to access sixteen additional planes beyond the BMP
Note: Unicode 3.2 supports UTF-8, UTF-16 and UTF-32. UTF-32 is equivalent to UCS-4, with an upper limit of
10FFFFx.
19 20th International Unicode Conference Washington, DC, January 2002
Implementation Levels Implementation level for combining
sequences Level 1: only precomposed characters Level 2: restricted combining sequences Level 3: unrestricted combining sequences
Unicode has no formal restrictions on combining sequences An implementation may choose to support a subset
of characters which does not contain any or all combining characters
20 20th International Unicode Conference Washington, DC, January 2002
Collections
for Subsets
The Unicode declared subset is the whole of the BMP plus planes 1-16 accessible through UTF-16
Collections of coded graphic characters
The collections listed below are ordered by collection number. An * in the “positions” column indicates that the collection is a fixed collection.
Collection number and name Positions
1 BASIC LATIN 0020 - 007E *
2 LATIN-1 SUPPLEMENT 00A0 - 00FF *
3 LATIN EXTENDED-A 0100 - 017F *
4 LATIN EXTENDED-B 0180 - 024F
5 IPA EXTENSIONS 0250 - 02AF
6 SPACING MODIFIER LETTERS 02B0 - 02FF
Etc.
21 20th International Unicode Conference Washington, DC, January 2002
Unicode Implements BMP plus next 16 planes Three encoding forms
UTF-8 UTF-16 UTF-32 (0 to 10FFFF)
Implementation level 3 No subsets
Unicode encourages transparency so that implementations can at least retransmit every character undamaged, but the level of support is otherwise explicitly left to the implementation
22 20th International Unicode Conference Washington, DC, January 2002
Unicode - 10646 Relationship ISO/IEC 10646 is a character encoding standard Unicode is code for code compatible with
ISO/IEC 10646 Unicode defines additional specifications about
behavior and use of characters such as bidi algorithm, ordering, mappings, equivalence algorithm and other semantics
Conformant implementations of Unicode are conformant implementations of ISO/IEC 10646
23 20th International Unicode Conference Washington, DC, January 2002
Unicode: Beyond 10646In addition to character codes Unicode specifies: Behavior and use of characters A complete bidi algorithm An equivalence algorithm Normalization Additional character properties and semantics
for spacing, zero-width space, combining characters, numeric, case and casing, directionality, letters, math operators etc
24 20th International Unicode Conference Washington, DC, January 2002
Unicode: Beyond 10646 (Cont.)
Which combining marks are non-spacing marks Order and use of double-diacritic non-spacing
marks A mapping for compatibility characters Default shaping behavior of cursive scripts Default mapping tables for conversion to and
from other character set standards Rendering for Indic characters Line breaking
25 20th International Unicode Conference Washington, DC, January 2002
Background Relation between Unicode and
ISO/10646 What is the same What is different What is being merged
Synchronization Shared Process and Policies Aligned Program of Work Common publication resources
Outline
26 20th International Unicode Conference Washington, DC, January 2002
Continued Cooperation Architecture changes:
UTF-32 (Proposed Amendment) Restricts UCS-4 to planes 0 to 16
Future editorial and technical corrigenda to second edition |of ISO/IEC 10646-1: 2000 (will be part of Unicode 3.2)
Repertoire extensions (included in Unicode 3.2) ISO/IEC 10646-2 (planes 1, 2 & 14)
Plane 1, mathematics, hieroglyphs, music symbols, etc Plane 2, CJKV ideographic extensions Plane 14, language tags
Support current and future implementers Increase awareness and provide technical help Continued synchronization of future editions of
ISO/IEC 10646 and the Unicode Standard
27 20th International Unicode Conference Washington, DC, January 2002
Going in the Same Direction
One standard No dialects Common usage
Common Encoding Forms UTF-8 UTF-16 UTF-32/UCS-4
Cooperation with ISO Examples: UTF-8, UTF-16, UTF-32,
EURO, collation, tags
Incorporation into other standards IETF WWW Consortium (W3C)
Shared expertise for lesser-used and obscure scripts
28 20th International Unicode Conference Washington, DC, January 2002
WG2 Program of Work
1st Amendment 10646-1:2000 March 2002 2nd Amendment 10646-1: 2000 December 2002 1st Amendment 10646-2: 2001 2003
WG2 future meetings: Meeting 42 – Dublin, Ireland May 2002 Meeting 43 – Tokyo, Japan December 2002
29 20th International Unicode Conference Washington, DC, January 2002
Background Relation between Unicode and
ISO/10646 What is the same What is different What is being merged
Synchronization Shared Process and Policies Aligned Program of Work Common publication resources
Beyond character coding Character properties & Collation Internationalization Products and Standards
Outline
30 20th International Unicode Conference Washington, DC, January 2002
Collation & Character Properties
ISO/IEC 14651 Collation Standard Produced by SC22/WG20 Internationalization Matches Unicode Collation Algorithm Unicode Technical Standard (UTS) #10
Unicode Character Database Collection of character classification and
properties Geared towards the needs of implementers Supports Internationalization http://www.unicode.org/Public/UNIDATA
31 20th International Unicode Conference Washington, DC, January 2002
Language Innovation
SOURCE: C / C++ JAVA / C# Identifiers ASCII Unicode
Comments Local charset byte oriented
Unicode
Literals L“ ” converts local charset
Unicode
Data Types:
char Byte oriented Unicode
wchar_t Unicode on some implementations
N/A
32 20th International Unicode Conference Washington, DC, January 2002
Products Are Here!
19941995 1996 1997
93
Types of Products
Increased Function of Products
1998- 1999
2000 and beyond
Full Set
Phase 2: Increased FunctionalityMore Scripts, Combining Characters, etc.
Phase 1: Deliver a full set of productsBrowsers, Development Tools, Fonts, Word Processors, etc.
33 20th International Unicode Conference Washington, DC, January 2002
Products Are HERE! Microsoft: Windows XP, Office XP, Internet Explorer 6.0,
ECMAScript, C#/CLI Compaq: Tru64 Unix HP: HP-UX & Printers Netscape: communicator 6.0, JavaScript, ECMAScript Sun: Solaris & Java Apple: Cyberdog, Mac OS X Lotus: Lotus Suite Asian solutions: JustSystems (Ichitaro) and Star+Globe
(MASS) Databases: Software AG, Sybase, Oracle, DB2, NCR
Teradata, Progress Software SAP platform Fonts: Adobe, Agfa/Monotype, Apple Advanced
Typography, Bitstream, OpenType Tools and libraries: several vendors
34 20th International Unicode Conference Washington, DC, January 2002
Version 3.2
Is Here!
Version 3.2 is in sync withboth parts of ISO/IEC 10646 and 1st amendment to 10646-1 total repertoire of 95156
characters completed math repertoire for
MathML and other uses Further restriction on
ill-formed UTF-8
http://www.unicode.org/unicode/reports/tr28
35 20th International Unicode Conference Washington, DC, January 2002
Background Relation between Unicode and
ISO/10646 What is the same What is different What is being merged
Synchronization Shared Process and Policies Aligned Program of Work Common publication resources
Beyond character coding Character properties & Collation Internationalization Products and Standards
Summary
Outline
36 20th International Unicode Conference Washington, DC, January 2002
Common Repertoire The character repertoire of Unicode and
ISO/IEC 10646 are exactly identical Three matching encoding forms
There are minor differences in Terminology Publication format
Any conformant Unicode implementation conforms to ISO/IEC 10646
37 20th International Unicode Conference Washington, DC, January 2002
Unicode Extends... Character semantics
“Discover and catalogue” Canonical and compatibility equivalence
Relate characters to their established use Technical reports with implementation
guidelines Normalization Script behavior such as bi-directional algorithm
Active promotion of the standard
38 20th International Unicode Conference Washington, DC, January 2002
What Do 10646 and Unicode Do for You?
Global interoperability - write once run everywhere; One source code one binary with user installable/callable locales
Simplified software - one application with one code set versus multiple applications and managing different code sets
Data stability - A single common and widely adopted format
Reduced costs - development, maintenance, training
39 20th International Unicode Conference Washington, DC, January 2002
Great Expectations
Enhance global interoperability Enhance data interchange Permit easier development of
localizable products Reduce development cost of
localized application software Replace retrofitting with
concurrent development
40 20th International Unicode Conference Washington, DC, January 2002
Recommendations Buy the international standard
(including all published amendments) as well as the Unicode standard Watch for updates on the web including Unicode
technical reports and ISO amendments Join the Unicode consortium, W3C, your national
body standards committee or other organization to influence standards development processes
Define your needs and communicate them to your vendors
Build products that support ISO/IEC 10646 and The Unicode Standard
4545 21st International Unicode Conference Dublin, Ireland, May 2002
Thank You!
Recommended