Upload
bonnie-hampton
View
378
Download
2
Embed Size (px)
Citation preview
词——词性标注( 1 )Part-of-Speech Tagging (1)
• 词性和词性标注( Part-of-Speech, or POS)
• 标签集( Tagsets )• 基于规则的词性标注( Rule-based POS tagging )• 词性标注评测( POS tagging evaluation )
词性和词性标注POS Tagging Basics
• Motivation
• 词性: Part-of-Speech, POS, word classes, morphological classes, or lexical tags
• Open-class words (e.g. noun, verb) vs closed-class words (e.g. auxiliary, conjunction)
• POS tagging is crucial for various NLP applications. 编辑这篇报道 编辑 这 篇 报
道编辑 /v 这 /r 篇 /q 报道
/n
• Tagging Ambiguity
• One word may be labeled with different POS tags.
bookI will book a flight to Chicago.
Please hand me that book.
报道这篇报道写得很及时。
央视新闻报道了春晚的筹备进展。
Fortunately, in both English and Chinese, many words have only a unique tag. POS tagging for them is thus trivial. But for others, POS tagging is non-trivial disambiguation.
• Disambiguation
• For many words, their multiple POS tags are not equally likely.
• Context information is important for tag disambiguation. For example, English articles are often followed by nouns.
can
Auxiliary (be able to)
Noun (a metal container)
Verb (to put something in a metal container)
的助词(表示所属关系)
名词(箭靶的中心)
• Tagging Methods
• Rule-based tagging
• Statistical-based (HMM-based) tagging
• Transformation-based tagging
• Memory-based tagging
• Hybrid tagging
标签集Tagsets
• Definition
• A tagset is a list of all POS tags for a particular language.
• Tagsets are manually compiled and language-specific.
• There are different tagsets for languages like English and Chinese.
• Chinese Tagsets
• 《新著国语文法》( 5 类 9 种)• 《文法简论》( 4 类 9 种)• 《信息处理用现代汉语词类及标记集规范》( 20 个大
类、 24 个小类、 8 个次小类)• 《现代汉语语法信息词典》( 4 类 9 种)• 北大《人民日报》词类标记集( 39 tags )
In-Class Exercise
• Using the Penn Treebank tagset (page 8) and the following tagging result, write about what each word’s POS is. (e.g. The: determiner)
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
基于规则的词性标注Rule-based POS Tagging
• Rule-based POS Tagging for English
• Two-stage architecture
• Stage 1: use a dictionary to assign each word all possible POS’s
• Stage 2: use hand-written disambiguation rules to decide a single POS for each word
• One of the most comprehensive rule-based approaches is the Constraint Grammar approach. The EngCG tagger is based on it.
• Rule-based POS Tagging for Chinese
• Rules based on collocation
• If an n/v word coordinates with ( using coordination conjunctions or 、) an n word, label it n.
哲学的产生是人类思想 (n) 和认识 (n, v) 的伟大变革。
• Rule-based POS Tagging for Chinese
• Rules based on collocation
• If an n/v word and an n word are in the same syntactic context, i.e., sharing or modified by the same word or structure, label it n.
生产产品时既要重视产出 (n, v) ,又要重视质量(n) 。
• Rule-based POS Tagging for Chinese
• Rules based on collocation
• If an n/v word follows a determiner that can only modify nouns, label it n.
她是一位研究自然语言处理的女教授 (n, v) 。
• Rule-based POS Tagging for Chinese
• Rules based on collocation
• If an n/v word follows an adjective that can only modify nouns, label it n.
中国为世界和平做出了伟大贡献 (n, v) 。
Similarly, there are rules based on pronouns (代词) , numerals and quantifiers (数量词) , nominal-object verbs(体宾动词) , monosyllabic adjectives (单音节形容词) and signature characters (特征字) .
• Rule-based POS Tagging for Chinese
• Rules derived from phrase structure
• Phrase structure rules are derived from the realistic grammatical phenomena in a corpus. There are generic rules and specific rules.
• Generic rules are designed for a certain POS and are function-driven.
• Specific rules are tailored for individual words and are word-driven.
• Rules Derived from Phrase Structure
• Affix rules (词缀规则)• R1: Let K1 = { 金、银、红、黄、绿、蓝、白、灰、黑 } and
length = 3, if X1 ∈ K1 and X2X3 is in the form of BB, then X (X1X2X3) is a ( 形容词 ).
• R2: Let K2 = { 一、几 } and length = 3, if X1 ∈ K2 and X2X3 is in the form of BB, then X is m ( 数量词 ).
金灿灿 (a) ,绿油油 (a)
一件件 (m) ,几次次 (m)
• Rules Derived from Phrase Structure
• Affix rules (词缀规则)• R3: Let K3 = { 老、大、小 } and X2…XL is a surname, if
X1 ∈ K3, then X is n ( 名词 ).
• R4: Let K4 = { 老、总、局 } and X1…XL-1 is a surname, if XL ∈ K4, then X is n ( 名词 )..
老陈 (n) ,大张 (n) ,小王 (n)
王老 (n) ,孙总 (n) ,赵局 (n)
• Rules Derived from Phrase Structure
• Affix rules (词缀规则)• R5: Let K5 = { 赛、酸、仪、家、学、色 }, if XL ∈ K5,
then X is n ( 名词 ).
• R6: Let K6 = { 化 }, if XL ∈ K6, then X is v ( 动词 ).
• R7: Let K7 = { 然 }, if XL ∈ K7, then X is d ( 副词 ).
篮球赛 (n) ,核苷酸 (n) 、地震仪 (n) 、音乐家 (n) 、计算语言学 (n) 、深红色 (n)
自动化 (v) ,全球化 (v)
欣然 (d) ,忽然 (d)
• Rules Derived from Phrase Structure
• Repetition rules (重叠词规则)• R8: If X is in the form of AABB and AB is an adjective
or verb, then X is also a ( 形容词 ) or v ( 动词 ) .
• R9: If X is in the form of AA and A is a verb or quantifier, then X is also v ( 动词 ) or q ( 量词 ) .
打打闹闹 (v) ,高高兴兴 (a)
朵朵 (q) 云彩,让我听听 (v)
• Rule-based POS Tagging for Chinese
• Exception to the rules
• Rules only generalize the majority. But almost every rule has exceptions.
黑蛐蛐 against R1敬老 against R4
In-Class Exercise
• Give one more counterexample (反例) to any of the rules for Chinese POS tagging (R1 – R9). Explain which rule it is against.
• Word-Level vs Sentence-Level
• Generally, sentence-level accuracy is much lower than word-level accuracy.
• Word-level accuracy is the tagging accuracy by default.
• Sentence-level accuracy makes more sense for syntactic analysis.
• Word-level accuracy makes more sense for semantic analysis.
• Error Analysis
• In word-level evaluation, a confusion matrix often helps us to better understand where we make mistakes.
Correct POS
Tagged POS