42
1 Joan M. Aliprand Senior Analyst, RLG Issues to Consider in a Multilingual, Multiscript World HKCAN Seminar & Opening October 4, 2002

1 Joan M. Aliprand Senior Analyst, RLG Issues to Consider in a Multilingual, Multiscript World HKCAN Seminar & Opening October 4, 2002

Embed Size (px)

Citation preview

1

Joan M. AliprandSenior Analyst, RLG

Issues to Consider in a Multilingual, Multiscript

World

HKCAN Seminar & OpeningOctober 4, 2002

October 4, 2002 HKCAN Seminar & Opening 2

Overview

Technical issues Transcription issues Retrieval issues

October 4, 2002 HKCAN Seminar & Opening 3

Technical Issues

Alternative MARC record models Plain text of MARC Scripts available in system

October 4, 2002 HKCAN Seminar & Opening 4

Technical Issues

MARC Record Models

Only original script in main fields UNIMARC and those modelled on it Also an option with MARC 21

Romanized record with supplementary “alternate graphic representation” LC MARC 21

• Field pairing issue

October 4, 2002 HKCAN Seminar & Opening 5

Script Support

Not just fonts for display Involves input, display formatting, retrieval Size of fonts for Chinese language support

October 4, 2002 HKCAN Seminar & Opening 6

Hanzi in Character Sets

Character Set First Level Total

EACC 13,468

Big 5 5,401 13,107

Big5 Plus 21,114

CCCII 42,423 53,940

Unicode * 27,484 70,207

* First Level count = Unified Han & Extension A

October 4, 2002 HKCAN Seminar & Opening 7

Features of Plain Text

Plain text is is a pure sequence of character codes.

The simplicity of plain text gives it a natural role as a major structural element of fancy text.

Plain text must contain enough information to permit the text to be rendered legibly, and nothing more.

Source: The Unicode Standard, Version 3.0, pp. 15-16.

October 4, 2002 HKCAN Seminar & Opening 8

No Limits on Transcription?

Limits on Unicode character repertoire Implementing every character in Unicode

simultaneously pushes technical limits Even now, we don’t always transcribe

everything that appears on the cataloging source.

October 4, 2002 HKCAN Seminar & Opening 9

Transcription Issues

What needs to be transcribed? How faithful do we have to be? What does not need to be transcribed? What cannot be transcribed?

What do we do? Making distinctions

October 4, 2002 HKCAN Seminar & Opening 10

What needs to be transcribed?

“Language and script” in bibliographic records AACR2 Rule 1.0E

Specific elements of the description

October 4, 2002 HKCAN Seminar & Opening 11

Descriptive Elements for Books

Title and Statement of Responsibility Area Edition Area Publication, Distribution, etc. Area Physical Description Area Series Area Notes Area (Required)

Source: The IFLA Functional Requirements for Bibliographic Records,by Olivia M. A. Madison – Figure 2. (LRTS, 44:3, July 2000)

October 4, 2002 HKCAN Seminar & Opening 12

Descriptive Elements for Books

Title and Statement of Responsibility Area Edition Area Publication, Distribution, etc. Area Physical Description Area Series Area Notes Area (Required)

Source: The IFLA Functional Requirements for Bibliographic Records,by Olivia M. A. Madison – Figure 2. (LRTS, 44:3, July 2000)

October 4, 2002 HKCAN Seminar & Opening 13

“ Language and Script” inAuthority Records

Non-English headings• Uniform, controlled heading established under

applicable source of authority

• Variant and related headings (tracings for cross-references, etc..)

Recording the source• Authority heading in any script/language

• Example: Romanized heading from non-Roman source

Links to other authority files

October 4, 2002 HKCAN Seminar & Opening 14

How faithful do we have to be?

Deliberate omission Instructions in rules Conventionally ignored

Inability to transcribe Variety of approximation techniques

Same but different Added distinction required

October 4, 2002 HKCAN Seminar & Opening 15

AACR (1968) Rule 133

A. General rule.… Chinese characters considered archaic, decorative, etc.are represented by the corresponding forms found in the K‘ang hsi tzu tien or Ueda’s Daijiten, if possible, butsimplified characters are transcribed as such. ...

October 4, 2002 HKCAN Seminar & Opening 16

How faithful do we have to be?

Deliberate omission Instructions in rules Conventionally ignored

Inability to transcribe Variety of approximation techniques

Same but different Added distinction required

October 4, 2002 HKCAN Seminar & Opening 17

Deliberate Omission

Instruction in rules Omit if not needed for legibility

• Examples:  ©, ®, * on date of birth, † on date of death

• Ignore a symbol if not “integral or essential” part of title

Substitution of characters (even if available)• MARC 21 Greek symbols character set has but

LCRI dictates [alpha], [beta], [gamma]

• Superscript and subscript digits transcribed as 0-9 if not essential for legibility

18 HKCAN Seminar & Opening October 4, 2002

Example courtesy ofStanford University

Libraries

19 HKCAN Seminar & Opening October 4, 2002

Example courtesy of Heidi Lerner, Stanford University Libraries

October 4, 2002 HKCAN Seminar & Opening 20

Deliberate Omission

Instruction in rules Example: Trimming of lengthy title Example: Hebrew vocalization in title

Conventionally ignored Example: Title annotated with furigana Example: Printing flourishes

21 HKCAN Seminar & Opening October 4, 2002

22 HKCAN Seminar & Opening October 4, 2002

Example courtesy of Edward A. Jajko, Hoover Institution, Stanford University

October 4, 2002 HKCAN Seminar & Opening 23

Inability to Transcribe

Language text — Romanization Instruction in rules

Symbols — Cataloger’s description Ideographs

Variety of options

24 HKCAN Seminar & Opening October 4, 2002

Thanks to: Beth Hoffman, Innovative Interfaces

U+9C72

October 4, 2002 HKCAN Seminar & Opening 25

Inability to Transcribe Ideographs

Placeholders The geta

Two ideographs unavailable in EACC

October 4, 2002 HKCAN Seminar & Opening 26

Inability to Transcribe Ideographs

Placeholders The geta [romanization]

October 4, 2002 HKCAN Seminar & Opening 27

Source of illustration: Cataloging Guidelines for Creating Chinese Rare Book Records in Machine-Readable Form. (RLG, 2001).

Inability to Transcribe Ideographs

Placeholders The geta, [romanization]

Best approximation? Chinese Rare Books union catalogue practice

October 4, 2002 HKCAN Seminar & Opening 28

Source of illustration: Cataloging Guidelines for Creating Chinese Rare Book Records in Machine-Readable Form. (RLG, 2001).

Inability to Transcribe Ideographs

Placeholders The geta, [romanization]

Best approximation? Chinese Rare Books union catalogue practice

• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode™

October 4, 2002 HKCAN Seminar & Opening 29

Inability to Transcribe Ideographs

Placeholders The geta, [romanization]

Best approximation? Chinese Rare Books union catalogue practice

• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode

Other alternatives Ideographic description sequence

October 4, 2002 HKCAN Seminar & Opening 30

Alternative Ideographic Description Sequences

Source: John Jenkins, Apple Computer

October 4, 2002 HKCAN Seminar & Opening 31

Categories of Variants

Official Simplifications Regional and minority characters and variants Time-specific variants National typographic variants Typeface-specific variants Misprints Unofficial simplifications Calligraphic and handwriting variants

Source: Martin J. Dürst. Exploring the Potentials of Web Technologies for the Handling of Rare Ideo-graphs and Ideograph Variants. http://www.w3.org/People/Dürst/Talks/19980409/overview.htm

October 4, 2002 HKCAN Seminar & Opening 32

Variant Forms: Unicode & EACC

Separately encoded in both Unicode and EACC

215122

455122

Not inEACC

275122

Not inEACC

33 HKCAN Seminar & Opening October 4, 2002

Where a character is a distinct form in simplified Chinese from the one used in traditional Chinese, Unicode has generally encoded two distinct characters. This is partly Unicode’s inheritance from the character sets on which it’s based, but it’s mostly a matter of recognizing that there is not a one-to-one relationship between the characters used in the two varieties of written Chinese. …

The itaiji [variant form] problem has to do with personal names. In the West we have Mark and Marc, John and Jon, Jeff and Geoff. (We won’t even attempt the number of ways William Shakespeare spelled his name, none of which were “Shakespeare.”) East Asia has the same problem: the same character can be used in personal names with many different glyphs, and people can become quite insistent that their name be written with the right glyph. ...

And there are cases where there’s just more than one way to write something, and one authority will say the entities involved are two subtly different characters, and another will say they’re just glyphic variants of each other. …

Source: John Jenkins. New Features in Unicode 3.2. (Multilingual Computing, Issue #49)

October 4, 2002 HKCAN Seminar & Opening 34

Inability to Transcribe Ideographs

Placeholders The geta, [romanization]

Best approximation? Chinese Rare Books union catalogue practice

• cf. U+303E IDEOGRAPHIC VARIATION INDICATOR in Unicode

Other alternatives Ideographic description sequence Variation selection??

• Encroaches on markup (e.g. XML)

October 4, 2002 HKCAN Seminar & Opening 35

Making Distinctions

Same graphically, different semantically The motive for authority control

Qualification Different needs in different environments

• Different scripts (ideographs vs. pinyin)

October 4, 2002 HKCAN Seminar & Opening 36

Retrieval Issues

Relationship between transcription and retrieval

Transcription and Z39.50 Differences between EACC and Unicode

October 4, 2002 HKCAN Seminar & Opening 37

Retrieval Issues Impact of Transcription

Think of implications for retrieval when making transcription decisions Should the geta be indexed?

Authority control is the key to resolving alternative “spellings” in manifestations

October 4, 2002 HKCAN Seminar & Opening 38

Retrieval Issues Transcription and Z39.50

Need to have “common ground” Script support Character repertoire for scripts Alternative record models

• Romanized retrieval an option or not

October 4, 2002 HKCAN Seminar & Opening 39

Retrieval Issues EACC, Unicode

Differences EACC character codes associate related

characters Traditional and simplified Variants preferred for language or location

No relationships via Unicode encoding• Requires separate methods to identify character

relationships

October 4, 2002 HKCAN Seminar & Opening 40

Retrieval Issues EACC, Unicode

Differences Traditional versus simplified

How to retrieve either? The many-to-one issue

Variant forms How to retrieve all? How to retrieve a particular form?

October 4, 2002 HKCAN Seminar & Opening 41

In Conclusion

Transcription is not as straightforward as it appears.

Consider how transcription decisions might affect retrieval.

Re-assess indexing needs when migrating from existing character sets to Unicode.

42

Thank You

Unicode is a trademark of Unicode, Inc., and may be registered in some jurisdictions.© 2002 RLG.