57
Multilingual Issues of Open Source ILS Jason (Qing) Zou Systems Librarian Lakehead University Grace (Guoying) Liu Systems Librarian University of Windsor January 31 2009

Multilingual Issues of Open Source ILSaccessola2.com/superconference2009/sat/1808/zou_liu.pdf · 2017. 9. 16. · Oracle, DB2, SQL MySQL, PostgreSQL Server Database Linux variants

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

  • Multilingual Issues of Open Source ILS

    Jason (Qing) ZouSystems Librarian

    Lakehead University

    Grace (Guoying) LiuSystems Librarian

    University of Windsor

    January 31 2009

  • OutlineWriting SystemsRomanizationUnicodeOpen Source ILSSimplified Chinese Version of EvergreenConclusions

  • Boroditsky, L., 2002

  • Writing SystemsRepresent the sounds of a language by written or

    printed symbols (WordNet)

    Requirements:Defined base elements/symbols (scripts)Rules and conventionsA language (generally a spoken language)

  • Types of Writing Systems

    Logographic (Chinese characters: CJK)Alphabetic (Latin alphabet: Cyrillic, Latin)Abjad (Arabic alphabet, Arabic, Hebrew) Abugida (Indian Devanagari: India, Canadian

    Aboriginal Syllabics)

  • Scripts10 major scripts, to write ~95% of all languages

    Roman abcdeéèêœ …Greek αβγδε …Cyrillic авгдеж …Hebrew … אבגדהArabic … ا ب ت ث ج حIndic (11) अआइईउऊ …Thai กขฃคฅฆ …

    Japanese– Hiragana あいうえお…– Katakana アイウエオ…

    Korean 가각갂갃간…Chinese甲乙丙丁…

    Arsenault, C., 2003

  • Multilingual Information SystemsContains records in more than one languageThe system interface is in more than one

    languageThe system is able to display text in more than

    one scriptThe system allows the end user to build

    queries in more than one script

    Arsenault, C., 2003

  • Multilingual Information SystemsTwo models for multi-script records in MARC21:Model Aoriginal scripts in 880 fieldsprimary descriptive fields using Romanized form

    Model B transcribe data directly into regularly tagged fields

  • 000 02333cam a22004454a 450245 00 |6 880-01 |a Zhong guo wen hua jing dian / |c [Zhu Xi

    deng zhu ; zhu bian Ren Jiyu ; zhi xing zhu bian Pan Yuan].250 __ |6 880-02 |a Di 1 ban260 __ |6 880-03 |a Hangzhou : |b Xi leng yin she chu ban she, |c

    2007.300__ |a 14 v. ; |c 29 cm.…

    880 00 |6 245-01/$1 |a 中國文化經典 / |c [朱熹等著 ; 主編任繼愈 ; 執行主編潘淵].

    880 __ |6 250-02/$1 |a 第1版880 __ |6 260-03/$1 |a 杭州: |b 西冷印社出版社, |c 2007.880 1_ |6 700-04/$1 |a 朱熹, |d 1130-1200.880 1_ |6 700-05/$1 |a 任繼愈, |d 1916-880 1_ |6 700-06/$1 |a 潘淵.

    From Yale University Catalogue

  • 1001 |6 01 |a Chen, Xiyong.1001 |6 01 |a 陳錫勇.24510 |6 02 |a Guodian Chu jian Laozi lun zheng / |c zuo

    zhe Chen Xiyong.24510 |6 02 |a 郭店楚簡老子論證 / |c 作者陳錫勇.250 |6 03 |a Chu ban.250 |6 03 |a 初版.260 |6 04 |a Taibei shi : |b Li ren shu ju, |c 2005.260 |6 04 |a 台北市 : |b 里仁書局, |c 2005.…

    70002 |6 06 |a Laozi. |t Dao de jing.70002 |6 06 |a 老子. |t 道德經.

    From McGill University Catalogue

  • Multilingual Information SystemsNon-Roman data in North American OPACs

    Stored? Displayed?yes

    no

    yes

    no

    Indexed?yes

    no

    Romanization Vernacular

    Cataloguing

    Retrieval

    Arsenault, C., 2003

  • Romanization

    Representation of a written word or spoken speech with Roman alphabet

    Methods Transliteration: written text, Russian Transcription: spoken word, CJK

  • ☆ Transliteration

    Language and Library

  • Transcription新年好 xin nian hao

    温莎大学 wen sha da xue

    刘国英 liu guo ying

    明清小说比较研究 Ming Qing xiao shuo bi jiao yan jiu

    李白和他的诗歌 Li Bai he ta de shi ge

  • Only Transcription is possibleTwo Romanization systems for bibliographic control in North America:Wade-Giles (through October 2000)Pinyin (After October 2000)

    Chinese Romanization

    唐宋全诗Wade-Giles: T‘ang2 Sung4 ch‘üan2 shih1Pinyin: Táng Sòng quán shī

  • Pinyin 拼音

    Literally “spell the sound”Based on Hanyu Pinyin (Chinese Phonetics) which

    was adopted in 1958 by mainland ChinaUsed for many years in libraries in Europe and

    AustraliaOct. 1, 2000, LC and other libraries in US adopted

    Pinyin

  • Issues of PinyinChinese characters47,043 in 1716 (康熙字典)~60,000 in 1990 (漢語大字典)sharing only around 1,300 syllables in

    spoken Chinese (Arsenault, 2001)High level of Homophonous ambiguity

    liu: 刘 六 流 留 柳 … (over 30 possibilities)

  • Word division (syllable integration)下午我去图书馆了. (I went to the library this

    afternoon)Pinyin: Xia wu wo qu tu shu guan le.

    or: Xiawu wo qu tushuguan le.

    Issues of Pinyin

  • No consistent rules on syllable integration中国话

    zhong guo huazhong-guo huazhongguo huazhongguohua

    More difficult to form queries

    Issues of Pinyin

  • Issues of Pinyin

    A standard based on MandarinIn cataloguing, impossible to maintain

    consistency Infrequently used characters may be

    impossible accessed by phonetic scripts

  • ☆ASCII

  • Language and UnicodeWhat is Unicode?

    Unicode provides a unique number for every character,

    no matter what the platform,no matter what the program,no matter what the language.

    http://www.unicode.org/standard/WhatIsUnicode.html

  • Unicode

    Support by large software companies

    Support by organizations

    Support by countries

  • UTF-8Unicode Transformation FormatUTF-8

    11110xxx 10xxxxxx 10xxxxxx 10xxxxxxU-00010000 –

    U-001FFFFF

    1110xxxx 10xxxxxx 10xxxxxxU-00000800 –

    U-0000FFFF

    110xxxxx 10xxxxxxU-00000080 –

    U-000007FF

    0xxxxxxxU-00000000 –

    U-0000007F

  • Unicode: examples

    家 (home) book:書 书UTF-8: E5 AE B6 E6 9B B8 E4 B9 A6

    UTF-16: 5BB6 66F8 4E66UTF-32: 00005BB6 000066F8 00004E66

    UTF-32

    UTF-16

    UTF-8

    00 00 4E 6600 00 66 F800 00 5B B6

    4E 6666 F85B B6

    E4 B9 A6E6 9B B8E5 AE B6

    书(book)書(book)家 (home)

  • Commercial ILS Major commercial ILS products Voyager Aleph Symphony Millennium

  • Open Source ILS

    Major Open Source ILS Products

    Koha

    Evergreen

  • Open Source ILS

    What is open source?

  • Open Source Software

    Perl, Php, PythonMicrosoft C++, Visual Basic, C#

    ProgrammingLanguage

    Open OfficeMicrosoft OfficeOffice

    FirefoxInternet explorerBrowser

    ApacheInternet Information server

    Web Server

    MySQL, PostgreSQLOracle, DB2, SQL Server

    Database

    Linux variants (Red hat, Debian, Ubunto)

    WindowsOperatingSystem

    Library Technology Reports 2008, vol. 44, no. 8

  • Integrated Library Systems

    Regular modules Circulation Acquisition Cataloguing System Administration OPAC

  • Store Display Index/Search Sort

    Language Issues in ILS

  • Display

    Language Issues in ILS

  • Language Issues in ILS

    Index/Search

    Sort

  • Koha

    Features Debian/Linux, Windows Perl MySQL/Zebra LibLime Company

  • http://koha.wikispaces.com/

  • http://koha.wikispaces.com/

  • Koha Debian/Linux Perl MySQL/Zebra Large community Fully developed ILS Fully support standards Language supports

  • Evergreen

    Debian/Linux, Windows OpenSRF PostgreSQL C, Perl Equinox software Company

  • Evergreen in Canada

    BC public consortium (Sitka) UPEI Project Conifer

  • Chinese Localization

    Introduction

    Goals

  • Chinese Localization

    Features are crucial to localization

    Supports Unicode

    Indexing/Searching: PostgreSQL Tsearch2

  • Simplified Chinese Version

    Interface localization Indexing Searching Sorting Others

  • Simplified Chinese Version

    Interface Utilizes Pootle Uses gettext tools to convert Portable Objects

    to Document Type Definition files Updates DTD files hourly

  • Tsearch2 (PostgreSQL full text search engine)

    Configure Tsearch2 to be able to handle Chinese records

    Utilize Chinese words segment algorithms

    Indexing

  • Indexing

    Tsearch2 (to_tsvector, to_tsquery)

    # SELECT to_tsvector('Social history of China') ;

    to_tsvector

    ------------------------------------

    'china':4 'social':1 'histori':2 (1 row)

  • Indexing

    Tsearch2 (to_tsvector, to_tsquery)

    # SELECT to_tsvector(‘中国社会历史’)

    to_tsvector

    ------------------------------------

    ‘中国社会历史’: 1 (1 row)

  • Indexing

    Tsearch2 (to_tsvector, to_tsquery)

    # SELECT olis_cn_index('中国社会历史') as titletitle ------------------------------------

    “中”:1 ‘史’:6 ‘国’:2 ‘会’:4 ‘历’:5 ‘社’:3 ‘中国’:7 ‘国社’:8 ‘会历’:10 ‘历史’:11 ‘社会’:9

  • SearchingSearching is the opposite process of indexing

    Utilize Chinese words segment algorithms to divide and form queries which can be understood by the system

  • The search phrase “中国社会历史” will be formed as “ 中 & 国 & 社 & 会 & 历 & 史”and then fed into the system

    Searching

  • Sorting

    Romanize Chinese characters with tone information

    Sort the corresponding Pinyin of Chinese records to obtain A-Z order

  • # Select utf8_pinyin(‘中国社会历史’) as pinyin;Pinyin

    -----------------------------zhong4guo2she4hui4li4shi3

    Sorting

  • CMARC

    CNMARC

    880 fields

    Import Issues

  • Unicode is not the answer for everything

    More practical to tweak a system language by language

    Conclusions

  • Thank you for your interest and attention!

    Any Questions?

  • Contact InformationJason Zou

    [email protected]

    Guoying [email protected]