40
The CLDR Tutorial Presented by Steven R. Loomis ( @srl295 ) Senior Software Engineer IBM Corporation (Content: John Emmons, IBM, CLDR-TC Chair) Internationalization and Unicode Conference 40 Santa Clara, California USA – November, 2016

The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

The CLDR Tutorial

Presented by Steven R. Loomis ( @srl295 ) Senior Software Engineer

IBM Corporation

(Content: John Emmons, IBM, CLDR-TC Chair)

Internationalization and Unicode Conference 40Santa Clara, California USA – November, 2016

Page 2: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Tutorial Outline

➲ Goals of the Project, Brief History➲ What Does the CLDR Project Deliver?➲ Publishing Schedule➲ Technical Committee and Project Contributors➲ Terminology and Definitions➲ Data Submission and Vetting Procedures➲ Main Data Types➲ Supplemental Data Types

Page 3: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

What is Locale Data?

➲ Any data that is or can be used to define the behavior ofcomputer software or hardware where the culturallyaccepted behavior is DIFFERENT in depending on thewritten language being used, or the region of the world inwhich it is being used, or both.

➲ The Unicode Standard is NOT locale data.➲ CLDR defines two broad categories of locale data: “main”

data and “supplemental” data.

Page 4: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

What is Locale Data?

➲ Any data that is or can be used to define the behavior ofcomputer software or hardware where the culturallyaccepted behavior is DIFFERENT in depending on thewritten language being used, or the region of the world inwhich it is being used, or both.

➲ The Unicode Standard is NOT locale data.➲ CLDR defines two broad categories of locale data: “main”

data and “supplemental” data.

Page 5: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Goals of the Project

➲ Provide the most comprehensive, useful and accuraterepository of locale data in the computer industry.

➲ Facilitate sharing of data, so that computer systems andapplications can talk to one another and expect to getconsistent results.

➲ Locale data is a moving target – Keeping it always fresh andup to date is critical.

➲ Stability over time – Want to minimize gratuitous changes tothe data where customary usage hasn't changed.

➲ We want to REFLECT, and not DIRECT.

Page 6: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

History➲ IBM ICIR ⇒ ICU data ⇒ original CLDR data

● Arose out of the realization that in today's world, computer systems andapplications are heterogeneous and diverse, and will contain multipletypes of hardware and software from multiple vendors.

➲ Early CLDR data came from a comparison of locale specificbehavior as then currently implemented by major OS andsoftware companies.

➲ 2003: First release (1.0)(hosted by openi18n.org)

➲ 2004: First release (1.1) under Unicode

Page 7: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Changes Over the Years

➲ More types of data, both main and supplemental.➲ Many more languages represented.➲ Many more organizations contributing.➲ Refinement of the submission/vetting process.➲ Online data submission/vetting tool.➲ Move from CVS to SVN (Subversion) for source code control➲ Tools, tools, tools – most based on Java 8.

– Conversion to other formats, POSIX, ICU, JSON➲ JSON Data on Github➲ Release numbering change from x.y to x.0 for major releases.

CLDR 30 is actually the 21st major release of CLDR.

Page 8: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

What Does the CLDR Project Deliver?

➲ DATA of course! - XML is the “official” mechanism forpublication of the data. JSON equivalent is now published ongithub.

➲ DTD – Defines the structural details ➲ LDML specification – Also known as Unicode TR#35 –

Defines exactly how the XML data above is to be interpreted.➲ Tools – Java source code for CLDR tools, including

conversion utilities and CLDR utility classes.

Page 9: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Publishing Schedule

➲ Two major releases each year➲ “Big” release publishes in September.

– Full data submission/vetting using survey tool.– Most active contributions from linguists in participating organizations

➲ “Small” release publishes in March.– Survey tool is not typically used, unless we are targeting specific

languages for data submission.– All changes handled via CLDR Trac/SVN ticketing process.

Page 10: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Technical Committee and Project Contributors

➲ Any Unicode member organization is welcome to participatein CLDR TC meetings.

➲ Meetings are held weekly via teleconference on Wednesdaysat 8:00 AM Pacific Time.

➲ Additional meetings held on Fridays as necessary (also 8:00AM Pacific), usually to review/assign CLDR Trac tickets.

➲ Much more informal than the UTC. (Thank heavens for that!)

Page 11: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR TC Officers

➲ John Emmons - IBM➲ Dr. Mark Davis - Google

Page 12: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Who are the “Big Players”?

➲ Apple➲ Google➲ IBM➲ Microsoft➲ Wikimedia➲ If you're not a “Big Player”, (yet!) never fear! Your

organization's vote counts just as much as the “Big Players”.More than 300 contributors to the latest release.

Page 13: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Terminology and Definitions

Page 14: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Main Locale Data

➲ Most of the data differs on a per language basis, ratherthan on a per country basis.

➲ Not overly complex – Can be understood by a translatoror linguist without being a programmer.

➲ Can use survey tool for data submission / vetting.➲ Uses BCP47 compliant language tags for naming, and the

locale inheritance model.

Page 15: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Examples of Main Locale Data

➲ Exemplar Characters➲ Locale display names

– Names of languages, scripts, and regions➲ Date and Time Formatting Elements

– Includes support for many different calendar types– Time zone names and time zone city names

➲ Number and Currency Formatting Elements➲ Unit Formatting Elements

Page 16: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Supplemental Locale Data

➲ Any data that we publish that doesn't meet all of the criteriafor “main” locale data

➲ Uses CLDR Trac/SVN Ticketing Process for data changes

Page 17: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Examples of Supplemental Locale Data

➲ First day of the week➲ Preferred clock – 12 hour vs. 24 hour➲ Preferred measurement system➲ Collation data➲ Rule based number formatting ( number spellout )➲ Keyboards➲ Country / Language / Population data

Page 18: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Locale Inheritance Model➲ To “inherit” in CLDR means “get the data” from a different locale file if you can't find it in the

requested one. Allows us to keep data sizes smaller and more consistent.➲ “Normal” inheritance means “truncate fields between the underscores until you get to root”.➲ Normal inheritance can be explicitly overridden. Secondary script usage is one such case.

German (de) Portuguese (pt) S. Chinese (zh)T. Chinese(zh_Hant)

root

Portuguese (Portugal) (pt_PT)

Portuguese (Angola) (pt_AO)

German (Austria) (de_AT)

Page 19: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Draft Status

➲ CLDR uses a “draft” attribute on every data element, which isintended to be an indicator of the level of vetting that has beendone on that item.

➲ Applications and libraries often choose to filter out data thatdoesn't meet a particular threshold. ICU only takes contributedor higher.

➲ approved, contributed, provisional, unconfirmed.<language type="de" draft="contributed">德文 </language>

➲ For main data, draft status is determined by the establishedvoting rules.

Page 20: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Coverage Levels

➲ Broad groupings of data items that provide a hierarchy of datafields in terms of importance.

➲ Core, basic, moderate, modern, comprehensive➲ Organizations can set coverage targets on a per language

basis.➲ Most set “modern” as the coverage target, which provides a

fairly complete set of data for most applications.

Page 21: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Trac/SVN Ticket Resolution Process

➲ Ticket is opened in CLDR Trac➲ CLDR TC reviews each new ticket during weekly call

– Assign for fix– Return for more info or reject– Assign for design (to be assigned for fix once design is approved by

the TC)➲ Owner checks correction into SVN, assigns reviewer and

requests review.➲ Reviewer closes ticket, or returns with feedback.

Page 22: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Some Guiding Principles

➲ Reflect, don't direct.➲ Require references if reasonable, but also trust people

whenever possible.➲ Common usage can differ with “official” or “standard” position.

Reflect common usage in case of discrepancies.➲ Be aware of political sensitivities.➲ Promote stability whenever possible.

Page 23: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool Data Submission - Phases

➲ Preparation➲ Beta➲ Submission➲ Vetting➲ Dispute Resolution➲ Testing➲ Prepare for Publish

Page 24: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool – Voting Rules

➲ Each organization can assign each of its contributingmembers in one of the following categories:

– TC member: ( 20 points )– Expert: (8 points)– Regular Vetter (4 points)– Guest (1 point)

➲ Liason or associate members can only assign guest access. Also,interested individuals (non-Unicode members) can also request aguest account.

➲ For each data item, the organization's vote is the highest pointvalue from an individual contributor. For example, 3 regularvetters from the same organization all agreeing on the same valuereceive a collective vote of 4 points for the organization, not 12points.

Page 25: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Voting Rules – Part 2

➲ Value with the highest number of votes goes into CLDR, aslong as it has as many or more than the previous release'svalue.

➲ Number of votes determines the published “draft status” of theitem.

➲ Two sets of rules – standard and “high bar”.➲ Standard rules:

– 4+ points → draft=”approved”– Conflicting 4 point votes → draft=”provisional”.– 2-3 points → draft=”contributed”– 1 point → draft=”unconfirmed”

Page 26: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Voting Rules – Part 3

➲ Some locales are designated as “high bar” locales due to theirfrequent usage and high visibility. These locales have ahigher threshold for approval, as follows:

– 8+ points → draft=”approved”– Conflicting 4 point votes → draft=”provisional”– 5-7 points → draft=”contributed”– 1-3 points → draft=”unconfirmed”

➲ Individual data fields can be designated by the TC as “needs TCapproval”. This is normally reserved for high visibility fields thatcould cause significant problems if they were to be changed.Thus, a 20 point vote is required for approval, and these items canbe “flagged” for TC review during the survey process.

Page 27: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool - Preparation

➲ New structure is added to DTD and to English➲ Functional enhancements to the survey tool are implemented

and tested.➲ Coverage Levels are determined➲ Organizations create accounts for their participants

Page 28: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool – Beta

➲ Survey tool is opened for people to try, but any votessubmitted get tossed once data submission phase begins.

➲ Beta hasn't worked well for us up until now, since not enoughpeople do things during the beta period. Considering an “earlyvoting” phase where a smaller set of experienced vetterscould vote for real, and help us shake out any bugs.

Page 29: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool – Submission

➲ “Live” voting begins. Contributors are allowed to enter newvalues, change existing values, and vote to confirm valuesthat already exist.

➲ ST Forums can be used to ask questions, and discussspecifics of any given data item.

➲ 6-8 weeks in duration.

Page 30: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool – Vetting

➲ First step in slowing down the flow of changes.➲ New values can't be entered, except by a TC member, or

unless needed to eliminate an error or warning.➲ Existing values can be voted upon, votes can be changed.➲ Goal is to drive consensus and find the best value.➲ Duration 2-3 weeks.

Page 31: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool – Dispute Resolution

➲ Guests, regular vetters, and experts can no longer vote.➲ Only TC members can vote, in an effort to eliminate errors

and resolve disputes across organizations.➲ Goal is to get the data tests to pass so that data can be

moved from ST to our Subversion repository.

Page 32: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Survey Tool - Testing

➲ Once data is moved from ST to subversion, all remaining workfor the release is done via the CLDR Trac/SVN TicketResolution Process.

➲ All changes must pass CLDR's automated build process,which consists of:

– Data testing, same tests that run in the ST context.– Unit testing – additional tests for consistency and correctness.

➲ Additional tests defined as part of the BRS.

Page 33: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Prepare for Publish

➲ BRS = The Big Red Switch– Contains all the tasks that we need to perform in order to make the

data good enough to publish.– Many tasks and tests, highlights include:

● CLDRModify – Program that puts data in consistent structure and order,minimizes country data, removes deprecated items.

● Work items pertaining to supplemental data.● ICU Testing and Integration● TR35 updates

Page 34: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Main Data Types

➲ Exemplar Characters➲ Characters and Delimiters➲ Context Transforms➲ Locale Display Names➲ Dates➲ Numbers➲ Units➲ List Patterns➲ POSIX yes/no

Page 35: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Numbers Data – Key Concepts

➲ Numbering system – A mechanism through which numericvalues are represented. There are two types:

– Numeric – Numbers are represented with a set of 10 digits, and eachdigit represents place value. Most sets of digits are associated witha single script in Unicode, so the 4 letter script identifier usuallyserves as the numbering system identifier.

– Algorithmic – Cannot be represented with a simple set of digits, butrequires a more complex set of rules. These rules are defined inCLDR's RBNF data (rule based number formatting).

Page 36: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Numbers Data – Key Concepts

➲ Numbers can have different symbols and different formats asthey are applied to different numbering systems.

– Example: Use Arabic percent symbol for numbers written using Arabicdigits, but regular percent symbol when using western digits.

➲ Special numbering system identifiers: native, traditional,finance.

➲ Rules for standard number formatting, as well as compactdecimals and currency values.

Page 37: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Dates Data – Key Concepts

➲ Many different calendar types are supported, but aliasesfor many fields are provided in order to reduce theoverall size of the data. Gregorian calendar is the base.

➲ Two contexts: format vs. stand-alone.➲ 3 widths for month names, 4 for weekday names.➲ Stock formats – 4 widths each for date, time, and

combo.➲ Available formats – Allows more flexible formatting by

matching requested skeletons and retrieving availablepatterns.

➲ Interval formatting, relative date/time, and durations alsosupported.

Page 38: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

CLDR Supplemental Data Types➲ Collation➲ RBNF – Rule Based Number Formatting➲ Keyboards➲ Transforms➲ Character Fallbacks➲ Day Periods➲ Gender➲ Likely Subtags➲ Language Matching➲ Metazones➲ Plurals➲ Telephone Codes➲ Subdivisions➲ Windows Zones to CLDR TimeZone

Mappings

➲ Time Data (12 vs. 24 hour clock)➲ First Day of Week➲ Weekends➲ Currency Usage by Date➲ Currency Digits and Rounding➲ Territory Containment➲ Language and Script Data➲ Country / Language / Population / Literacy➲ Calendar Era Start and End Dates➲ Calendar Preferences by Country➲ Measurement System Usage➲ Territory and Currency Code Mappings➲ Inheritance Overrides ( explicit Parent

Locales )

Page 39: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

Useful Links and Resources

➲ CLDR Main Web Site– http://unicode.org/cldr

➲ CLDR Trac – Bug tracking and feature requests– http://unicode.org/cldr/trac

➲ CLDR Survey Tool– http://st.unicode.org/cldr-apps/

➲ Unicode Technical Standard #35 – LDML Specificationhttp://www.unicode.org/reports/tr35/

Page 40: The CLDR Tutorial - Unicode® Conference · History IBM ICIR ⇒ ICU data ⇒ original CLDR data Arose out of the realization that in today's world, computer systems and applications

More Resources

➲ JSON Data for CLDR https://github.com/unicode-cldr/cldr-json

➲ “CLDR Users” Mailing List– http://unicode.org/consortium/distlist.html