Using Unicode for Linguistic Data

Using the Unicode Standard for Linguistic Data: Preliminary Guidelines

Deborah AndersonResearcherDept. of Linguistics, UC Berkeley

Using Unicode for Linguistic DataIntroduction: E-MELD and its mission What is the situation for character encoding? The role of this presentation

Using Unicode for Linguistic DataBackground: What is Unicode?

Core Concepts

Practical Issues: How do I get Unicode to work?

Organization of the Unicode Standard

Finding the character you need

Other practical issues

Further recommendations

Using Unicode for Linguistic DataBackground: What is Unicode? Unicode is the international character

encoding standard It assigns a unique number to every

character and this number stays the same “no matter what the platform, no matter what the program, no matter what the language”


Example:

the Unicode character code for Latin capital letter A is: U+0041

Unicode format: U+xxxx (xxxx is in hex)

Using Unicode for Linguistic DataBackground: What is Unicode? Used for “plain text” representation

(i.e., 0045 002D 004D 0045 004C 0044 = E-MELD)

Different from “rich text,” which is plain text with additional information (including formatting information, such as font size, styles, etc.)


Example: Superscripts(a) Plain text: use Unicode characterse.g., for use 02B0 for superscript “h”(b) Rich text: apply superscript style to a base character to get the superscript “h”e.g., <sup>h</sup> (This can be done on MS Word by selecting the “superscript” formatting feature on the “font” menu.)

Using Unicode for Linguistic DataBackground: What is Unicode? Widely supported by computer companies

and national bodies: many current fonts, keyboards, and software are based on Unicode

But… the process to get characters incorporated can be lengthy (2+ years), so there can be lag-time before they appear in fonts, etc.

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.

Characters are “the smallest components of written language that have semantic value” (TUS, p. 13)

Glyphs: the surface representation of abstract characters; what appears on the page or on your monitor


1. Characters, not glyphs.Example:

Abstract Character: a Unicode’s (small Latin letter a) domain

Glyphs: a, a, a,a, a Font’s

domain



Don’t take glyphs in the Unicode

Standard charts as definitive:



Characters aren’t necessarily the same as graphemes:

Spanish ch

Unicode c + h



There is not always a 1-1 relationship between a character and glyph:

(a) Arabic: one character can have different glyphs depending upon position in a word

(b) Devanagari: the glyph for ksha is made up of 3 characters: ka + virama + sha


2. No new precomposed forms or digraphs

Example:

+


3. No variants

4. No idiosyncratic characters


4. Unify, wherever possible

Greek letter beta is unified with IPA beta (voiced bilabial fricative)


4. Unify, wherever possible

0283 LATIN SMALL LETTER ESH (voiceless post-alveolar fricative)

222B INTEGRAL symbol

Using Unicode for Linguistic Data

Practical Issues: Getting Unicode to Work

Using Unicode for Linguistic Data Practical Issues: Getting Unicode to Work

A recent operating system (Mac OS 9.2, X, Windows CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2+)

A recent browser (IE, Safari, OmniWeb, Mozilla/Netscape)

A Unicode text editor (Word 2000, 2002, Unipad, Apple “TextEdit”)

An input mechanism (“insert symbol,” keyboard, Keyman)

Using Unicode for Linguistic Data Getting Unicode to Work

A Unicode-enabled font (Code2000, Lucida Sans Unicode, SIL’s Doulos, Gentium, Arial Unicode MS)

Note: Be wary of “Unicode” fonts; they may only be partially Unicode-compliant.

Organization of the Unicode Standard

Organization of the Unicode Standard:Unicode Code Charts

Using Unicode for Linguistic Data Unicode Code Charts

Using Unicode for Linguistic Data Code Chart (Phonetic Extensions block)

Using Unicode for Linguistic Data Code Chart (Phonetic Extensions block)



Using Unicode for Linguistic Data Steps to using UnicodeFinding the character you need

1. See if it is in Unicode: Check the IPA blocks (etc.) on the

Unicode website Check Appendix 2 of the IPA Handbook

or a Web version of the IPA symbols

Using Unicode for Linguistic DataSteps to using Unicode Finding the character you needNote: In looking through Unicode and using

“insert Symbol”/font charts, be careful of “spoof buddies”:

Using Unicode for Linguistic Data Steps to using UnicodeFinding the character you need

2. See if it is in the process of being proposed: Check on Unicode’s Proposed New

Characters page Ask on the Transcription email list Ask on Unicode email list Verify the character you need is a true

character, and not a variant

Using Unicode for Linguistic Data Steps to using UnicodeIf you find a character that is missing

Work with the Peter Constable to get it proposed.A proposal is composed of: the character’s name a representative glyph information on the character’s properties a representative sample of the character in

context a short bibliography with references

Using Unicode for Linguistic Data Steps to using UnicodeHow can I use a character not yet in Unicode? Use FontLab or work with a font foundry to

create a font in the interim, using the Private Use Area (PUA); fully document PUA chars.

Use markup / entities Use Scalable Vector Graphics.

TEI is preparing guidelines, but nothing has yet been finalized.

Using Unicode for Linguistic Data Steps to using UnicodeFor those languages without an orthography Use Unicode characters if possible Verify character properties are similar Stay away from certain characters:

Presentation forms Letterlike symbols Number forms

Using Unicode for Linguistic Data Steps to using UnicodeHow do I tell if my font is Unicode-compliant? Set your font as the default for your browser,

then look at a test page, such as Alan Wood’s IPA Extensions page.

Use font utilities to check the fonts on your system (see Alan Wood’s website)

Using Unicode for Linguistic Data Steps to using UnicodeWhat about my data that is in a non-Unicode

font?

If possible, upgrade your documents to Unicode, converting to a Unicode font.

Use a converter If the font you use isn’t included, create a

converter and have it hosted on a publicly available website

Using Unicode for Linguistic Data Steps to using UnicodeEncoding Forms

Different ways to represent the hex-based integer as a series of bytes: A series of 8-bit values (UTF-8) A 16-bit value (UTF-16) A 32-bit value (UTF-32)

Using Unicode for Linguistic Data Steps to using UnicodeEncoding Forms

Reason for different forms: different implementation needs

Some tradeoffs for storage/processing Suggestion: Use UTF-8 or UTF-16

Using Unicode for Linguistic Data Steps to using UnicodeFurther recommendations

Groups of users (i.e., Athabaskanists) should publicly document Unicode values for the orthography and give font recommendations.

Provide feedback on missing characters to Peter Constable.

Appendices1: Linguistic letters and Symbols in Unicode2: Characters known to be missing3: Normalization

end

Documents

Using Unicode for Linguistic Data