Upload
felix-wilkerson
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
1
Joan M. AliprandSenior Analyst, RLG
Issues to Consider in a Multilingual, Multiscript
World
HKCAN Seminar & OpeningOctober 4, 2002
October 4, 2002 HKCAN Seminar & Opening 2
Overview
Technical issues Transcription issues Retrieval issues
October 4, 2002 HKCAN Seminar & Opening 3
Technical Issues
Alternative MARC record models Plain text of MARC Scripts available in system
October 4, 2002 HKCAN Seminar & Opening 4
Technical Issues
MARC Record Models
Only original script in main fields UNIMARC and those modelled on it Also an option with MARC 21
Romanized record with supplementary “alternate graphic representation” LC MARC 21
• Field pairing issue
October 4, 2002 HKCAN Seminar & Opening 5
Script Support
Not just fonts for display Involves input, display formatting, retrieval Size of fonts for Chinese language support
October 4, 2002 HKCAN Seminar & Opening 6
Hanzi in Character Sets
Character Set First Level Total
EACC 13,468
Big 5 5,401 13,107
Big5 Plus 21,114
CCCII 42,423 53,940
Unicode * 27,484 70,207
* First Level count = Unified Han & Extension A
October 4, 2002 HKCAN Seminar & Opening 7
Features of Plain Text
Plain text is is a pure sequence of character codes.
The simplicity of plain text gives it a natural role as a major structural element of fancy text.
Plain text must contain enough information to permit the text to be rendered legibly, and nothing more.
Source: The Unicode Standard, Version 3.0, pp. 15-16.
October 4, 2002 HKCAN Seminar & Opening 8
No Limits on Transcription?
Limits on Unicode character repertoire Implementing every character in Unicode
simultaneously pushes technical limits Even now, we don’t always transcribe
everything that appears on the cataloging source.
October 4, 2002 HKCAN Seminar & Opening 9
Transcription Issues
What needs to be transcribed? How faithful do we have to be? What does not need to be transcribed? What cannot be transcribed?
What do we do? Making distinctions
October 4, 2002 HKCAN Seminar & Opening 10
What needs to be transcribed?
“Language and script” in bibliographic records AACR2 Rule 1.0E
Specific elements of the description
October 4, 2002 HKCAN Seminar & Opening 11
Descriptive Elements for Books
Title and Statement of Responsibility Area Edition Area Publication, Distribution, etc. Area Physical Description Area Series Area Notes Area (Required)
Source: The IFLA Functional Requirements for Bibliographic Records,by Olivia M. A. Madison – Figure 2. (LRTS, 44:3, July 2000)
October 4, 2002 HKCAN Seminar & Opening 12
Descriptive Elements for Books
Title and Statement of Responsibility Area Edition Area Publication, Distribution, etc. Area Physical Description Area Series Area Notes Area (Required)
Source: The IFLA Functional Requirements for Bibliographic Records,by Olivia M. A. Madison – Figure 2. (LRTS, 44:3, July 2000)
October 4, 2002 HKCAN Seminar & Opening 13
“ Language and Script” inAuthority Records
Non-English headings• Uniform, controlled heading established under
applicable source of authority
• Variant and related headings (tracings for cross-references, etc..)
Recording the source• Authority heading in any script/language
• Example: Romanized heading from non-Roman source
Links to other authority files
October 4, 2002 HKCAN Seminar & Opening 14
How faithful do we have to be?
Deliberate omission Instructions in rules Conventionally ignored
Inability to transcribe Variety of approximation techniques
Same but different Added distinction required
October 4, 2002 HKCAN Seminar & Opening 15
AACR (1968) Rule 133
A. General rule.… Chinese characters considered archaic, decorative, etc.are represented by the corresponding forms found in the K‘ang hsi tzu tien or Ueda’s Daijiten, if possible, butsimplified characters are transcribed as such. ...
October 4, 2002 HKCAN Seminar & Opening 16
How faithful do we have to be?
Deliberate omission Instructions in rules Conventionally ignored
Inability to transcribe Variety of approximation techniques
Same but different Added distinction required
October 4, 2002 HKCAN Seminar & Opening 17
Deliberate Omission
Instruction in rules Omit if not needed for legibility
• Examples: ©, ®, * on date of birth, † on date of death
• Ignore a symbol if not “integral or essential” part of title
Substitution of characters (even if available)• MARC 21 Greek symbols character set has but
LCRI dictates [alpha], [beta], [gamma]
• Superscript and subscript digits transcribed as 0-9 if not essential for legibility
19 HKCAN Seminar & Opening October 4, 2002
Example courtesy of Heidi Lerner, Stanford University Libraries
October 4, 2002 HKCAN Seminar & Opening 20
Deliberate Omission
Instruction in rules Example: Trimming of lengthy title Example: Hebrew vocalization in title
Conventionally ignored Example: Title annotated with furigana Example: Printing flourishes
22 HKCAN Seminar & Opening October 4, 2002
Example courtesy of Edward A. Jajko, Hoover Institution, Stanford University
October 4, 2002 HKCAN Seminar & Opening 23
Inability to Transcribe
Language text — Romanization Instruction in rules
Symbols — Cataloger’s description Ideographs
Variety of options
October 4, 2002 HKCAN Seminar & Opening 25
Inability to Transcribe Ideographs
Placeholders The geta
Two ideographs unavailable in EACC
October 4, 2002 HKCAN Seminar & Opening 26
Inability to Transcribe Ideographs
Placeholders The geta [romanization]
October 4, 2002 HKCAN Seminar & Opening 27
Source of illustration: Cataloging Guidelines for Creating Chinese Rare Book Records in Machine-Readable Form. (RLG, 2001).
Inability to Transcribe Ideographs
Placeholders The geta, [romanization]
Best approximation? Chinese Rare Books union catalogue practice
October 4, 2002 HKCAN Seminar & Opening 28
Source of illustration: Cataloging Guidelines for Creating Chinese Rare Book Records in Machine-Readable Form. (RLG, 2001).
Inability to Transcribe Ideographs
Placeholders The geta, [romanization]
Best approximation? Chinese Rare Books union catalogue practice
• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode™
October 4, 2002 HKCAN Seminar & Opening 29
Inability to Transcribe Ideographs
Placeholders The geta, [romanization]
Best approximation? Chinese Rare Books union catalogue practice
• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode
Other alternatives Ideographic description sequence
October 4, 2002 HKCAN Seminar & Opening 30
Alternative Ideographic Description Sequences
Source: John Jenkins, Apple Computer
October 4, 2002 HKCAN Seminar & Opening 31
Categories of Variants
Official Simplifications Regional and minority characters and variants Time-specific variants National typographic variants Typeface-specific variants Misprints Unofficial simplifications Calligraphic and handwriting variants
Source: Martin J. Dürst. Exploring the Potentials of Web Technologies for the Handling of Rare Ideo-graphs and Ideograph Variants. http://www.w3.org/People/Dürst/Talks/19980409/overview.htm
October 4, 2002 HKCAN Seminar & Opening 32
Variant Forms: Unicode & EACC
Separately encoded in both Unicode and EACC
215122
455122
Not inEACC
275122
Not inEACC
33 HKCAN Seminar & Opening October 4, 2002
Where a character is a distinct form in simplified Chinese from the one used in traditional Chinese, Unicode has generally encoded two distinct characters. This is partly Unicode’s inheritance from the character sets on which it’s based, but it’s mostly a matter of recognizing that there is not a one-to-one relationship between the characters used in the two varieties of written Chinese. …
The itaiji [variant form] problem has to do with personal names. In the West we have Mark and Marc, John and Jon, Jeff and Geoff. (We won’t even attempt the number of ways William Shakespeare spelled his name, none of which were “Shakespeare.”) East Asia has the same problem: the same character can be used in personal names with many different glyphs, and people can become quite insistent that their name be written with the right glyph. ...
And there are cases where there’s just more than one way to write something, and one authority will say the entities involved are two subtly different characters, and another will say they’re just glyphic variants of each other. …
Source: John Jenkins. New Features in Unicode 3.2. (Multilingual Computing, Issue #49)
October 4, 2002 HKCAN Seminar & Opening 34
Inability to Transcribe Ideographs
Placeholders The geta, [romanization]
Best approximation? Chinese Rare Books union catalogue practice
• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode
Other alternatives Ideographic description sequence Variation selection??
• Encroaches on markup (e.g. XML)
October 4, 2002 HKCAN Seminar & Opening 35
Making Distinctions
Same graphically, different semantically The motive for authority control
Qualification Different needs in different environments
• Different scripts (ideographs vs. pinyin)
October 4, 2002 HKCAN Seminar & Opening 36
Retrieval Issues
Relationship between transcription and retrieval
Transcription and Z39.50 Differences between EACC and Unicode
October 4, 2002 HKCAN Seminar & Opening 37
Retrieval Issues Impact of Transcription
Think of implications for retrieval when making transcription decisions Should the geta be indexed?
Authority control is the key to resolving alternative “spellings” in manifestations
October 4, 2002 HKCAN Seminar & Opening 38
Retrieval Issues Transcription and Z39.50
Need to have “common ground” Script support Character repertoire for scripts Alternative record models
• Romanized retrieval an option or not
October 4, 2002 HKCAN Seminar & Opening 39
Retrieval Issues EACC, Unicode
Differences EACC character codes associate related
characters Traditional and simplified Variants preferred for language or location
No relationships via Unicode encoding• Requires separate methods to identify character
relationships
October 4, 2002 HKCAN Seminar & Opening 40
Retrieval Issues EACC, Unicode
Differences Traditional versus simplified
How to retrieve either? The many-to-one issue
Variant forms How to retrieve all? How to retrieve a particular form?
October 4, 2002 HKCAN Seminar & Opening 41
In Conclusion
Transcription is not as straightforward as it appears.
Consider how transcription decisions might affect retrieval.
Re-assess indexing needs when migrating from existing character sets to Unicode.