36
1 Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17 Multilingual and Crosslingual Information System

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

  • Upload
    zonta

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

Multilingual and Crosslingual Information System. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering National Cheng Kung University 2014/02/17. Contact Information. Room: 4261, Monday 09:10 - 12:00 AM Instructor: Prof. Wen-Hsiang Lu ( 盧文祥 ) Office: 4216 - PowerPoint PPT Presentation

Citation preview

Page 1: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

1

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering

National Cheng Kung University2014/02/17

Multilingual and Crosslingual Information System

Page 2: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

2

Contact Information

• Room: 4261, Monday 09:10 - 12:00 AM• Instructor: Prof. Wen-Hsiang Lu (盧文祥 )

– Office: 4216 – Office hours: Monday 12:10 - 2:10PM– Phone: 62545– Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm– Email: [email protected]– Teaching assistant: 王廷軒

• Email: [email protected]

Page 3: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

3

Course Grading

• Class participation/presentation: 30% • Tests: 25% • Project: 25% • Homeworks: 20%

Page 4: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

4

Source Textbooks

• Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999. ( 全華科技圖書 : 02-23717725)

• Daniel Jurafsky and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2000.

• James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co, 1995.

• Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer, 1998.

• Jean Veronis, Parallel Text Processing: Alignment and Use of Translation Corpora, Kluwer, 2000.

Page 5: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

5

Other Useful Sources (1)• Reference Books

– Charniak, E. Statistical Language Learning. – Cover, T. M., Thomas, J. A. Elements of Information Theory.– Jelinek, F. Statistical Methods for Speech Recognition.

• Major Conferences:– ACL (Association of Computational Linguistics)– COLING (International Conference on Computational Linguistics )– HLT (Human Language Technology Conference)– IJCNLP (International Joint Conference on Natural Language Processing )

• Journals– Computational Linguistics– Natural Language Engineering– TALIP (ACM Transactions on Asian Language Information Processing)– TSLP (ACM Transactions on Speech and Language Processing)

Page 6: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

6

Other Useful Sources (2)

• Resource URL– http://www.aclclp.org.tw/res_other_c.php ( 中華民國計算語言學學

會 )– http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group)– http://www.phontron.com/nlptools.php (Graham Neubig)

• Tools/Software– Online Dictionary

• WordNethttp://wordnet.princeton.edu/

• HowNethttp://www.keenage.com/html/c_index.html

• The Academia Sinica Bilingual Ontological Wordnet (BOW)http://bow.sinica.edu.tw/

Page 7: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

7

CKIP ( 中研院詞庫小組 )(Chinese Knowledge and Information Processing)

• Parser: http://140.109.19.112/main.exe?id=6833• POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/

Page 8: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

8

Eric Brill's POS Tagger

• Website: http://cst.dk/online/pos_tagger/uk/

This/DT is/VBZ a/DT book/NN ./.

Page 9: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

9

Stanford Parser

• Website – http://nlp.stanford.edu/software/lex-parser.shtml

• Tools– Online version

• Stanford Parser version 1.5.1 • English & Chinese• http://josie.stanford.edu:8080/parser/

Page 10: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

10

Stanford Parser

Page 11: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

11

[Homework 1]

• Using CKIP POS (part of speech) tagger, Eric Brill’s POS tagger, and Stanford parser to tag and parse at least three sentence.

Page 12: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

12

Course Topics• Probability and Information Theory

– basics: definitions, formulas, examples.• Language Modeling

– n-gram models, parameter estimation– smoothing (EM algorithm)

• Some Linguistics– phonology, morphology, syntax, semantics, discourse

• Words and the Lexicon– word classes, mutual information, lexicography.

Page 13: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

13

Course Topics (cont.)

• Hidden Markov Models– background, algorithms, parameter estimation

• Tagging: methods, algorithms, evaluation– tag sets, HMM tagging, transformation-based, feature-based

• Grammars and Parsing: data, algorithms– statistical parsing: algorithms, parameterization, evaluation

Page 14: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

14

Course Topics (cont.)

• Applications– Machine Translation (MT)– Acoustic Speech Recognition (ASR)– Information Retrieval (IR)– Cross-Language Information Retrieval (CLIR)– Question Answering (QA)– Cross-Language Question Answering (CLQA)– Summarization– Information Extraction– …

Page 15: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

15

Course Introduction • Lecture1: Introduction• Lecture2: Mathematical Foundations• Lecture3: Linguistics Essentials• Lecture4: Corpus-based Work• Lecture5: Collocations• Lecture6: Statistical Inference: n-gram Models over Sparse Data• Lecture7: Word Sense Disambiguation• Lecture8: Statistical Alignment and Machine Translation• Lecture9: Markov Models • Lecture10: Term Translation Extraction & Cross-Language Information Retrieval• Lecture11 : Statistical/Probabilistic Models for Word Alignment & CLIR• Lecture12: Part-of-Speech Tagging• Lecture13: Probabilistic Context Free Grammars• Lecture14: Question Answering

Page 16: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

16

The Ultimate Research Goal in Natural Language Processing (NLP) • To develop an automated language understanding

system• Why is this important?

– Easy for everyone to use language– Natural Human interface for a variety of applications (e.g.,

database access, on-line tutor, robot control, etc.)– Language seems fundamental for developing an intelligent

system• iPhone Siri• IBM's DeepQA project

Page 17: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

17

Natural Language is VERY Useful

Page 18: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

18

Page 19: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

OCR Problems

19

Page 20: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

20

Page 21: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

21

Aspects of Computational Linguistics

• Description of the Language: universals, cross-linguistic research

• Implementation of Computer Model: algorithms and data structures, formal models to represent knowledge, model of the reasoning process

• Psycho-Linguistic Aspect: humans are an existence proof of the computability of language comprehension; psychological research can be used to justify a computer model; obtain human processing parameters

Page 22: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

22

NLP Issues• Why is NLP difficult?

– Many “words”, many “phenomena”, many “rules”• OED (Oxford English Dictionary): 400k words;

Finnish lexicon (of forms): ~2 ×107

• sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more!

– irregularity (exceptions, exceptions to the exceptions, ...)• potato potato es (tomato, hero,...); photo photo s, and even:

both mango mango s or mango es• Adjective / Noun order: new book, electrical engineering, general

regulations, flower garden, garden flower

Page 23: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

23

Difficulties in NLP (cont.)– Ambiguity

• books: NOUN or VERB?– you need many books vs. she books her flights online

• Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus)

– Thank you for not eating without earphones??– Thank you for drinking?? …

• Fred’s hat was blown off by the wind. He tried to catch it.– ...catch the wind or ...catch the hat ?

Page 24: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

24

Rules or Statistics?• Preferences:

– context clues: she books books is a verb– rule: if an ambiguous word (verb/nonverb) is preceded by a

matching personal pronoun word is a verb– pronoun reference:

– she/he/it often refers to the most recent noun or pronoun (but there are certainly exceptions)

– selectional restrictions:– catching hat is better than catching wind (but not always)

– semantics: – We thank people for doing helpful things or not doing annoying

things

Page 25: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

25

Solutions

• Don’t guess if you know:• morphology (inflections)• lexicons (word information)• unambiguous names• perhaps some (really) fixed phrases• syntactic rules?

• Use statistics (based on real-world data) for preferences (only?)

• No doubt: but this is an important question!

Page 26: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

26

Types of Linguistic Knowledge

• Acoustic/Phonetic Knowledge: How words are related to their sounds. (transliteration)– E ri c sson <=> 易利信

• Morphological Knowledge: How words are constructed out of basic meaning units.un + friend + ly unfriendlylove + past tense lovedobject + oriented object-oriented

Page 27: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

27

More Types of Linguistic Knowledge

• Lexical Knowledge (or Dictionary): This should include information on parts of speech, features (e.g., number, case), typical usage, and word meaning.

• Syntactic Knowledge: How words are put together to make legal sentences (or constituents of sentences).

Page 28: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

28

More Types of Linguistic Knowledge

• Semantic Knowledge: Word meanings, how words combine into sentence meaning, – e.g., Fred tossed the ball.

Semantic roles

Page 29: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

29

More Types of Linguistic Knowledge

• Pragmatic Knowledge: How context affects the interpretation of a sentence. Examples:– Louise loves him.

[Context 1:] Who loves Fred?[Context 2:] Louise has a cat. 

– What time is it?[Context 1:] Fred is fidgeting (坐立不安 )

and staring at his watch.[Context 2:] Louise has no watch. 

Page 30: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

30

More Types of Linguistic Knowledge

• World Knowledge: How other people‘s minds work, what a listener knows or believes, the etiquette ( 成規 ) of language. Examples:– Will you pass the salt?– I read an article about the war in the paper.– Fred saw the bird with his binoculars.– Tim was invited to Tom's birthday party. He went to

the store to buy him a present.

Page 31: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

31

Multilingualism Issues in Web Age

• Language barrier– There are about 6,700 languages listed in the Ethnologue

(http://www.ethnologue.com/)• Information overloading

– Scaling up of language resources• Webpages• News• Weblogs• Microblogs

Page 32: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

32

Multilingual Understanding??

Page 33: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

33

Multilingual Understanding??

Page 34: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

34

Multilingual Understanding??

Page 35: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

35

Real World Situation• Use statistical model based on REAL WORLD DATA and care

about the best sentence only • Imagine:

– Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a context X– For every possible context X, sort all the imaginable sentences W according to

P(W|X):– Ideal situation:

best sentence (most probable in context X)

P(W)

Wbest Wworst

Page 36: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering

36

Real World Situation

• Unable to specify a set of grammatical sentences using fixed “categorical” rules

• (disregarding the “grammaticality” issue)

best sentence (most probable in context X)

P(W)

Wbest Wworst