Upload
beverley-daniel
View
218
Download
1
Embed Size (px)
Citation preview
Tsinghua University 1
Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation
Wei Qiao, Maosong Sun and Wolfgang MenzelState Key Lab of Intelligent Tech. & Sys.
Tsinghua UniversityDepartment Informatic, Hamburg University
Tsinghua University 2
Part Ⅰ
Background
Tsinghua University 3
Introduction
Chinese word segmentationCombination ambiguity 火 把 (torch) 火 (fire) 把 (make)
Overlapping ambiguity
a. 先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b. 首先要关注整体,其次要注意细节 其次 要 (secondly we
should)
★
火 把
Tsinghua University 4
Overlapping ambiguity string (OAS)Length; Order; Intersection length; Structure
Maximal overlapping ambiguity string (MOAS)
True / Pseudo ambiguity MOAS e.g. 其次要 ( TM ) : 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部 (measure word) 长篇小说
Related Terms
order2order10 1 2 3
0-2, 1-3
3
Tsinghua University 5
[Sun et al.,1999]100 million characterA set of core for MOAS is found
[Li, et al., 2003] 650 million characterSimilar method is used to improve the performance of segmenter
Previous Work
Tsinghua University 6
Two basic issues remain unsolved in their work:
Only include news data, the results need further validatedDetermine the core of pseudo OA strings. both for general-purpose and domain-specific.
Motivation
Tsinghua University 7
Statistical Properties of MOAS
From General CorpusFrom Domain-specific Corpus
Part Ⅱ
Tsinghua University 8
Data SetCBC : 929,963,468 charactersRich in content (from 1920’s) covering rich categories such as novel, essay, news……
Chinese Word ListPeking University, with 74,191 entries
Automatically find totally 733,066 distinct MOAS types in CBC
From General Corpus
Tsinghua University 9
Detailed DistributionPerspective 1: Length
From General Corpus
Tsinghua University 10
Perspective 2: Order
From General Corpus
Tsinghua University 11
Perspective 3: Intersection Length
From General Corpus
Tsinghua University 12
Perspective 4: Structure distribution
From General Corpus
Tsinghua University 13
Top N Frequent MOAS --Core candidate
3500 ~ 50.78%
7000 ~ 60.43%
40000 ~ 80.39%
From General Corpus
Tsinghua University 14
Stability VS Corpus size
From General Corpus
# of MOAS VS Corpus size
# of top N MOASVS Corpus size
Top 7000
Tsinghua University 15
Pseudo MOAS DetectionRelax definition on “Pseudo”
Eg. “ 出国门”: 出 国门 (go abroad) in almost all the
cases 出国 门 (the way to go abroad) small
possibility
5,507 PM and 1,439 TM judged by hand
Token coverage of PM and TM over CBC
From General Corpus
Tsinghua University 16
Domain-Specific CorporaEncy55: 90.02 million charactersWeb55: 54.97 million characters
Common Parts
From Domain-specific Corpora
Tsinghua University 17
Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)
From Domain-specific Corpora
Tsinghua University 18
From Domain-specific Corpora
Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)
Tsinghua University 19
From Domain-specific Corpora
Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)
Tsinghua University 20
From Domain-specific Corpora
PM and TM distribution over Domain Corpora
42% of overlapping ambiguities in any Chinese text can be 100% solved.
★
Tsinghua University 21
Part Ⅲ
Disambiguation
Tsinghua University 22
Disambiguation Method
Current performance on OAPerformance of ICTCLAS1.0 http://www.nlp.org.cn on OAs
e.g. 公安局 长 是 主管 这一 事故 的
The police chief ( 公安 局长 ) is the person who in charge of
this accident.
Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs
e.g. 核电站的特殊性 质 The special properties ( 特殊 性质 ) of nuclear power
station
Tsinghua University 23
Disambiguation Method
Performance of CRF-base[Lafferty 2001] CWS on OAs
e.g. 这一 现状 先 天地 决定 了 他们 的 使命
This situation congenitally ( 先天 地 ) makes them to take the mission
About 2% of OAS are mistakenly segmented
——it is a net gain
Tsinghua University 24
Individual-based methodSimple table lookup: record the PMs and the correct segmentation in a table
AdvantageSatisfactory token coverage to MOASsFull correctness for segmentation of pseudo MOASsLow cost in time and space complexity.
Disambiguation Method
Tsinghua University 25
An extension of [Sun et. al, 1999]Adjust the exist results in large corporaFurther verify the properties on
domain-specific corporaAn disambiguation strategy is
proposedOver 42% Overlapping ambiguity can
be resolved without any mistakeWill be more effective when facing
running text
Conclusion
Tsinghua University 26
Reference Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289.
Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese)
Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7.
Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338.
Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese)
Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)
Tsinghua University 27
Thank you
any comments ? ^.^