40
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC

Using Unicode for Linguistic Data

  • Upload
    renee

  • View
    47

  • Download
    3

Embed Size (px)

DESCRIPTION

Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley. Using Unicode for Linguistic Data. Introduction: E-MELD and its mission What is the situation for character encoding? The role of this presentation. - PowerPoint PPT Presentation

Citation preview

Page 1: Using Unicode for Linguistic Data

Using the Unicode Standard for Linguistic Data: Preliminary Guidelines

Deborah AndersonResearcherDept. of Linguistics, UC Berkeley

Page 2: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataIntroduction: E-MELD and its mission What is the situation for character encoding? The role of this presentation

Page 3: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode?

Core Concepts

Practical Issues: How do I get Unicode to work?

Organization of the Unicode Standard

Finding the character you need

Other practical issues

Further recommendations

Page 4: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode? Unicode is the international character

encoding standard It assigns a unique number to every

character and this number stays the same “no matter what the platform, no matter what the program, no matter what the language”

Page 5: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode?

Example:

the Unicode character code for Latin capital letter A is: U+0041

Unicode format: U+xxxx (xxxx is in hex)

Page 6: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode? Used for “plain text” representation

(i.e., 0045 002D 004D 0045 004C 0044 = E-MELD)

Different from “rich text,” which is plain text with additional information (including formatting information, such as font size, styles, etc.)

Page 7: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode?

Example: Superscripts(a) Plain text: use Unicode characterse.g., for use 02B0 for superscript “h”(b) Rich text: apply superscript style to a base character to get the superscript “h”e.g., <sup>h</sup> (This can be done on MS Word by selecting the “superscript” formatting feature on the “font” menu.)

Page 8: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataBackground: What is Unicode? Widely supported by computer companies

and national bodies: many current fonts, keyboards, and software are based on Unicode

But… the process to get characters incorporated can be lengthy (2+ years), so there can be lag-time before they appear in fonts, etc.

Page 9: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.

Characters are “the smallest components of written language that have semantic value” (TUS, p. 13)

Glyphs: the surface representation of abstract characters; what appears on the page or on your monitor

Page 10: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.Example:

Abstract Character: a Unicode’s (small Latin letter a) domain

Glyphs: a, a, a,a, a Font’s

domain

Page 11: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.

Don’t take glyphs in the Unicode

Standard charts as definitive:

Page 12: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.

Characters aren’t necessarily the same as graphemes:

Spanish ch

Unicode c + h

Page 13: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

1. Characters, not glyphs.

There is not always a 1-1 relationship between a character and glyph:

(a) Arabic: one character can have different glyphs depending upon position in a word

(b) Devanagari: the glyph for ksha is made up of 3 characters: ka + virama + sha

Page 14: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

2. No new precomposed forms or digraphs

Example:

+

Page 15: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

3. No variants

4. No idiosyncratic characters

Page 16: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

4. Unify, wherever possible

Greek letter beta is unified with IPA beta (voiced bilabial fricative)

Page 17: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataCore Concepts:

4. Unify, wherever possible

0283 LATIN SMALL LETTER ESH (voiceless post-alveolar fricative)

222B INTEGRAL symbol

Page 18: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data

Practical Issues: Getting Unicode to Work

Page 19: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Practical Issues: Getting Unicode to Work

A recent operating system (Mac OS 9.2, X, Windows CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2+)

A recent browser (IE, Safari, OmniWeb, Mozilla/Netscape)

A Unicode text editor (Word 2000, 2002, Unipad, Apple “TextEdit”)

An input mechanism (“insert symbol,” keyboard, Keyman)

Page 20: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Getting Unicode to Work

A Unicode-enabled font (Code2000, Lucida Sans Unicode, SIL’s Doulos, Gentium, Arial Unicode MS)

Note: Be wary of “Unicode” fonts; they may only be partially Unicode-compliant.

Page 21: Using Unicode for Linguistic Data

Organization of the Unicode Standard

Page 22: Using Unicode for Linguistic Data

Organization of the Unicode Standard:Unicode Code Charts

Page 23: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Unicode Code Charts

Page 24: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Code Chart (Phonetic Extensions block)

Page 25: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Code Chart (Phonetic Extensions block)

Page 26: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Unicode Code Charts

Page 27: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Unicode Code Charts

Page 28: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeFinding the character you need

1. See if it is in Unicode: Check the IPA blocks (etc.) on the

Unicode website Check Appendix 2 of the IPA Handbook

or a Web version of the IPA symbols

Page 29: Using Unicode for Linguistic Data

Using Unicode for Linguistic DataSteps to using Unicode Finding the character you needNote: In looking through Unicode and using

“insert Symbol”/font charts, be careful of “spoof buddies”:

Page 30: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeFinding the character you need

2. See if it is in the process of being proposed: Check on Unicode’s Proposed New

Characters page Ask on the Transcription email list Ask on Unicode email list Verify the character you need is a true

character, and not a variant

Page 31: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeIf you find a character that is missing

Work with the Peter Constable to get it proposed.A proposal is composed of: the character’s name a representative glyph information on the character’s properties a representative sample of the character in

context a short bibliography with references

Page 32: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeHow can I use a character not yet in Unicode? Use FontLab or work with a font foundry to

create a font in the interim, using the Private Use Area (PUA); fully document PUA chars.

Use markup / entities Use Scalable Vector Graphics.

TEI is preparing guidelines, but nothing has yet been finalized.

Page 33: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeFor those languages without an orthography Use Unicode characters if possible Verify character properties are similar Stay away from certain characters:

Presentation forms Letterlike symbols Number forms

Page 34: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeHow do I tell if my font is Unicode-compliant? Set your font as the default for your browser,

then look at a test page, such as Alan Wood’s IPA Extensions page.

Use font utilities to check the fonts on your system (see Alan Wood’s website)

Page 35: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeWhat about my data that is in a non-Unicode

font?

If possible, upgrade your documents to Unicode, converting to a Unicode font.

Use a converter If the font you use isn’t included, create a

converter and have it hosted on a publicly available website

Page 36: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeEncoding Forms

Different ways to represent the hex-based integer as a series of bytes: A series of 8-bit values (UTF-8) A 16-bit value (UTF-16) A 32-bit value (UTF-32)

Page 37: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeEncoding Forms

Reason for different forms: different implementation needs

Some tradeoffs for storage/processing Suggestion: Use UTF-8 or UTF-16

Page 38: Using Unicode for Linguistic Data

Using Unicode for Linguistic Data Steps to using UnicodeFurther recommendations

Groups of users (i.e., Athabaskanists) should publicly document Unicode values for the orthography and give font recommendations.

Provide feedback on missing characters to Peter Constable.

Page 39: Using Unicode for Linguistic Data

Appendices1: Linguistic letters and Symbols in Unicode2: Characters known to be missing3: Normalization

Page 40: Using Unicode for Linguistic Data

end