37
COMP323 Foundations of Chinese Computing

COMP323 Foundations of Chinese Computing

  • Upload
    ismet

  • View
    61

  • Download
    1

Embed Size (px)

DESCRIPTION

COMP323 Foundations of Chinese Computing. Course Introduction. Lecturer Qin LU [email protected] R oo m PQ814, Tel. 27667247 Teaching Assistant ( Responsible for some Labs and Project Assignments ) Chen Yirong [email protected] R oo m QT416 , Tel. 2766 7326. - PowerPoint PPT Presentation

Citation preview

Page 1: COMP323 Foundations  of Chinese Computing

COMP323 Foundations of Chinese Computing

Page 2: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 2

Course Introduction Lecturer

Qin LU [email protected] Room PQ814, Tel. 27667247

Teaching Assistant (Responsible for some Labs and Project Assignments) Chen Yirong [email protected] Room QT416, Tel. 2766 7326

Page 3: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 3

Course Introduction COMP323 Reference Books

CJKV Information Processing: Chinese, Japanese, Korean and Vietnamese Computing (PL1074.5 .L86)

An Introduction to Chinese, Japanese and Korean Computing (QA76.H7795)

計算機中文信息處理 (PL1074.5.C42) and others Tutorials and labs: PQ604A

Tuesday Group: 9:30 – 10:30 Tuesdays Thursday Group: 9:30 – 10:30 Thursdays Try to finish the labs and the online assignment/QA

during lab hours

Page 4: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 4

Course Introduction COMP323 Website

WebCT Lecture notes available Wed. by 5pm Print as NotePage

Method of Assessment Course Work 55%

2 Programming Assignments 20% 2 online quizzes 20% 1 online homework 5% 4 online QA(labs) 8% Class attendance (punctuation) 2%

Final Examination 45%

Page 5: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 5

Course Introduction

ChineseChinese ComputingComputing

Introduction to Chinese Computing

Computer processing of data related to Chinese, involving any human-computer interaction activity where communication is achieved using Chinese language.

About one-fifth of the people in the world speak some form of Chinese as their native language, making it the language with the most native speakers.

Page 6: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 6

Fundamental Problems with Chinese Computing At Chinese Character Level

Large and not Closed Character Set Computer Representation, Input and Output

At Chinese Language Level Lack of Morphological Variation Lack of Grammar

Very Arbitrary and FlexibleSuperimposed Grammar

Texts are Running Together

Course Introduction

Page 7: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 7

Course Introduction Fundamental Problems with Chinese Computing

Page 8: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 8

Course Introduction Fundamental Problems with Chinese Language

Bi-lingual, Tri-lingual and Multi-lingual Computing Question: Is Hong Kong a multi-lingual society? How can a system be designed so that it can be

used by different languages with minimal changes?

How can a system be designed so that it can be used for multiple languages?

Distinguish Chinese and English CharactersChinese Text, English Text or Chinese Text

Mixed Together with English Text?

Page 9: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 9

Course Introduction Fundamental Problems with Chinese Language

Bi-lingual, Tri-lingual and Multi-lingual Computing Example: Count the Number of (Chinese and/or

English) Characters or Words

Multilingual Computing

多語言文字處理技術

?

Page 10: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 10

Characteristics of Chinese Language Reading System (Pronunciation) Writing System (Look)

Computer Representation of Chinese Characters Character Set Standards (GB, Big5 and Unicode ...) Encoding Schemes (ISO and UTF …)

Chinese Character Input Chinese Input Processing by (Pen, Image, Speech

and) Key Stroke Shape-based Keystroke Input Method Phonetic-based Keystroke Input Method

Tentative Teaching Content

Page 11: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 11

Chinese Character Output Bitmap and Outline Font Representation Compression Scaling Problem

Software Development for Chinese Text Processing, such as Character Searching,

Editing, and Deletion … Software Localization and Internationalization

Tentative Teaching Content

Page 12: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 12

Chinese Language Processing Word Segmentation Part-of-Speech (POS) Tagging Syntactic Analysis (Grammatical Analysis)

Chinese Information (Document) Retrieval Document Retrieval Models Language-Related Issues

Advanced Topics (possibly) Information Extraction Text Summarization

Tentative Teaching Content

Page 13: COMP323 Foundations  of Chinese Computing

Lecture 1Characteristics of Chinese

Page 14: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 14

General Characteristics The official language in China is mandarin ( 普通話 ),

but there are many dialects in spoken form (50+). Different Pronunciation across Different Dialects Relatively Unified Writing System Dialect-specific Characters and Variant Character

Writing

Different words express the same meaning, e.g. 係 and 是 (to be)

Word order reversal, e.g. 找尋 and 尋找 (look for)

The Chinese Language

叻吓吔呃咁咗咩哂哋唔唥唧啱啲喐喥喺嗰嘅嘜嘞嘢

Page 15: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 15

The Chinese Language

Page 16: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 16

Characteristics of Chinese Characters Each Chinese character associates with three

features, namely its look (called graphemics), its pronunciation (called phonetics), and its meaning (called semantics).

The Chinese Language

Graphemics(The Look)

Graphemics(The Look)

Phonetics(The Sound)

Phonetics(The Sound)

Semantic(The Meaning)

Semantic(The Meaning)

Page 17: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 17

Radicals (部首 ) Chinese characters are

composed of smaller units, called radicals.

214+ radicals are used for indexing Chinese characters.

The advantage of a radical is that one does not have to know the pronunciation of the character, but can still look up a character in a dictionary.

Chinese Writing System一丨丶丿乙亅二亠人儿入八冂冖冫几凵刀力勹匕匚匸十卜卩厂厶又口囗土士夊夊夕大女子宀寸小尢尸屮山巛工己巾乡广廴廾弋弓彐彡彳心戈戶手支攴文斗斤方无日曰月木欠止歹殳毋比毛氏气水火爪父爻爿片牙犬玄玉瓜瓦甘生用田疋疒癶白皮目矛矢石示禸禾穴立竹米糸缶网羊羽老而耒耳聿肉臣自至臼舌舛舟艮色艸虍虫血行衣襾見角言谷豆豕豸貝赤走足身車辛辰辵邑酉釆里金長門阜隶隹雨靑非面革韦韭音頁凬飛食首香馬骨高髟鬥鬯鬲鬼魚鳥鹵鹿麦麻黃黍黑黹黽鼎鼓鼠鼻齊

Page 18: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 18

Radicals Remark: Several radicals can stand alone as single

and meaningful Chinese characters.

Chinese Writing System

Radical Standalone Examples

木 Yes 本未术札朽朴朳杀杂机朵权火 Yes 炜炬炅炖炒炝炙炘炊炆炕炉

心 Yes 伈芯志忐吣忘忍态忠念忿忽

石 Yes 岩矾矿宕砀码研砆砌砂泵砍

Page 19: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 19

Strokes ( 筆劃 ) Radicals in turn are composed of smaller units,

called strokes. 30+ strokes are the most basic elements of a

character. 5 basic strokes are “一” (横 , a horizontal

stroke), “丨” (竖 , a vertical stroke), “丶” (点 , dot), “丿” (撇 , a stroke curved to the left) and “乙” (折 , a bend stroke).

Chinese Writing System

Page 20: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 20

Strokes Stroke Order ( 筆順 )

The strokes for each Chinese character are to be drawn in a certain defined order.

Basic principles are: from left to right, top to bottom, outside to inside, horizontal before vertical, left slant before right slant, center before two sides, etc.

See Animations here http://www.chinawestexchange.com/Chinese/characters.htm

Chinese Writing System

Page 21: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 21

Tree Structure of Chinese Characters

Chinese Writing System

Page 22: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 22

Character Classifications and Formation Type 1: Pictographs (Picture Characters) ( 象形 )

They look like the things they represent, e.g.

Other examples are 日 (sun), 山 (mountain), 水 (water), 鸟 (bird), 火 (fire), 木 (tree), 車 (car, cart), and 口 (month, opening), etc.

Chinese Writing System

Does this character 月 really look like a moon to you? Centuries ago, it was written like this:

Page 23: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 23

Chinese Writing System Evolution

of Chinese Characters

Page 24: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 24

Character Classifications and Formation Type 2:(Simple) Ideographs ( 指事 or 表意 )

They represent abstract concepts or ideas, such as numbers and directions, e.g. 一 (one), 二 (two), 三 (three), and 中 (center, middle), 上 (above), 下 (below) etc.

Chinese Writing System

Page 25: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 25

Character Classifications and Formation Type 3: Compound Ideographs ( 會意 )

Pictographs and ideographs can be combined to represent more complex characters, and usually reflect the combined meaning of them.

Examples: More

Interesting Animations from Internet http://www.language.berkeley.edu/fanjian/compound_ideographs.html

Chinese Writing System

sun 日 + moon 月 = bright 明person 人 + person 人 = agree/follow 从sun 日 + tree 木 = east (sun rising above

the trees in the east) 東tree 木 + tree 木 = forest 林 + one more tree 木 = full of trees 森

Page 26: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 26

Character Classifications and Formation Type 3: Compound Ideographs

Chinese Writing System

Page 27: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 27

Character Classifications and Formation Type 3: Compound Ideographs

Chinese Writing System

Page 28: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 28

Character Classifications and Formation Type 4: Phonetic Ideographs ( 形聲 )

They usually have at least two component characters, one influences the sound and the other influences the meaning.

For example, They account

for more than 90% of all Chinese charactersin use today.

Chinese Writing System

For the character “跳” ( jump ), the left part “足“ means “foot”. The meanings of those characters that contain “足” are related to “foot” in a certain way. The right part “兆” indicates the sound. They share the same vowel.

Page 29: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 29

Chinese Writing System

Thought to be the oldest types of characters, pictographs were originally pictures of things. During the past 5,000 years or so they have become simplified and stylised.

Ideographs are graphical representations of abstract ideas.

Compound pictographs and ideographs combine one or more pictographs or ideographs to form new characters. Both component parts contribute to the meaning of the compound character.

Page 30: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 30

Chinese Writing System

Semantic-phonetic compounds represent around 90% of all existing characters and consist of two parts: a semantic component or radical which hints at the meaning of the character, and a phonetic component which gives a clue to the pronunciation of the character. Characters containing the same phonetic component may have the same sound and the same tone, the same sound but a different tone, the same initial or final sound, or a different sound and a different tone. Phonetic components are generally a more reliable indication of pronunciation than semantic components are of meaning.

Page 31: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 31

Traditional and Simplified Characters Over time, frequently used and complex Chinese

characters tend to be simplified.

More about Pitfalls and Complexities of Chinese to Chinese Conversion http://www.cjk.org/cjk/c2c/c2cbasis.htm

Chinese Writing System

retain only one part from the traditional character

Page 32: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 32

Chinese Language (Chinese Text) Chinese characters are subsequently combined

with other Chinese characters as words to form more complex ideas and concepts.

Question: How many Chinese characters?

Chinese Writing System

The Chinese writing system is open-ended, meaning that there is no upper limit to the number of characters. The largest Chinese dictionaries include about 56,000 characters, but most of them are archaic, obscure or rare variant forms. Knowledge of about 3,000 characters enables you to read about 99% of the characters in Chinese newspapers and magazines. To read Chinese literature, technical writings or classical Chinese, though, you need to be familiar with about 6,000 characters.

Page 33: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 33

Pronunciation The phonetic information is not explicit.

Sometimes, you can guess the pronunciation through the component characters.

Sometimes, the pronunciation has no relation to its components at all.

It makes the learning of Chinese difficult without a phonetic transcription system.

Phonetic transcription: Dictation of pronunciations Symbols to indicate all sounds in the language -

sufficient One sound is denoted by only one symbol -

Uniqueness

Chinese Reading System

Page 34: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 34

Pronunciation Pinyin: dictating Mandarin Chinese

Vowel ( 元音 , Initial) and Consonant ( 輔音 , final)

More about Pronunciation http://www.chinese-outpost.com/language/pronunciation/mandarin-chinese-initials-and-finals-table-1.asp

Chinese Reading System

For example, consider Beijing:bei: b is an initial, and ei is a finaljing: j is an initial, and ing is a final

In speech, Chinese words are created using just 21 beginning sounds called initials, and 37 ending sounds called finals. Initials and finals, of course, combine to create the basic sounds of Chinese.

Page 35: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 35

Pronunciation Pinyin

Chinese Reading System

Page 36: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 36

Pronunciation Tones of Chinese

Chinese is a tonal Language.

Mandarin has 4 (5) tones and Cantonese has 6 (9) tones, which makes it much harder to learn than Mandarin.

Chinese Reading System

Page 37: COMP323 Foundations  of Chinese Computing

COMP323 Lecture 1 37

Pronunciation

Tones differentiate meanings.

Chinese Reading System

Everyone seems to know this one: Yes, just by saying “ma” in different tones, you can ask, “Did mother scold the horse?” 

妈骂马吗 ? (mā mà mă ma?)

鞏俐 (Gong Li, with third and fourth tones), is the name of the star of “Raise the Red Lantern” and other contemporary Chinese films. However, 公里 (gong li, with first and third tones, means kilometer.