LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics

LP&IIS 2013, Springer LNCS Vol. 7912, pp. 57–68

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao

[email protected], {derekfw, lidiasc}@umac.mo

June 17th-18th, 2013, Warsaw, Poland

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

Motivation and related work in NER (CNER)

Problem analysis and the aim of this work

A study of Chinese characteristics (in PER, LOC, and ORG)

The designed and optimized feature set

Employed CRF model

Experiments

Comparison with related work

Different performance of sub features

Formal definitions of the problems in CNER

Conclusion

Reference

• Related literatures that are influenced by named entity recognition:

Information extraction

text mining

machine translation

knowledge management

information retrieval, etc.

• Rapid development of NLP also promotes the NER research

• Development of computer technology allows the analysis on big data

storage capacity

computational power

• Lev and Dan [1] perform NER on English

Using unlabeled text and Wikipedia gazetteers.

• Sang and Meulder [2] conduct NER research on German

• Special applications of NER:

geological text processing [3]

biomedical named entity detection [4]

• Chinese NER (CNER), more difficult. Why?

no word boundary in Chinese sentence

• International CNER shared tasks under the SIGHAN (special interest group for Chinese) and CIPS (Chinese information processing society)

before 2008 [5][6]

• Chinese personal name disambiguation

after 2008 by SIGHAN [7][8]

• Explored methods on CNER:

Maximum Entropy [9][10][16]

Hidden Markov Model [11]

Support Vector Machine [12]

Conditional Random Field [13][15]

• Combination with other researches:

Word segmentation, sentence chunking, word detection [14]

• Problems in the employed methods:

Maximum Entropy, local optimal solution, label bias

Markov Model , strong independence assumption

Support Vector Machine, low performance

Conditional Random Field, challenges in features selection

• Problems in the research work:

More discussion with the algorithm, less on the issues in CNER

Different features , less or no explanation or backgrounds

Less analysis on Chinese characteristics

• The aim of this work:

• An introduction of Chinese characteristics

• Feature optimization based on linguistic analysis

PER, LOC, ORG

• Comparisons of the performances by different algorithms

• Issues analysis and problem formalization in CNER

• Chinese personal names (PER):

clear format: Surname Given-name (we use x+y)

• Chinese surnames: 11,939 by Chinese academy of science [19][20]:

5313 of which consist of one character

4311 of two characters

1615 of three characters

571 of four characters, etc.

• Chinese Given-name:

usually contains one or two characters as shown in Table 1.

Pl: place; Bud: building; Org: organization; Suf: suffix; Abbr: abbreviation

• Chinese location names (LOC):

• Commonly used suffixes:

路(road), 區(district), 縣(county),

市(city), 省(province), 洲(continent), etc.

• Some standard formats, as in Table 1:

use building names

place + building

place + organization

Mix + suffix

abbreviations

• Chinese organization names (ORG):

• Some ORG entities contain suffixes

but the suffixes own various expressions, not formalized

• Others do not have apparent suffixes:

named by the owners of the organization

e.g. 笑開花(XiaoKaiHua, a small art association)

• Table 2 lists some kinds of ORG entities:

including administrative unit, company, arts, public service, association, education and cultural, etc.

• Potentially implying that ORG may be one of the difficult category

• X: the variable representing sequence

• Y: corresponding label sequence

• P(Y|X): the conditional model in mathematics

• G=(V, E): a graph G, V of vertices or nodes, E of edges or lines

• 𝑌 = {𝑌𝑣|𝑣 ∈ 𝑉}, Y is indexed by vertices of G

• (X, Y) is a conditional random field model [24]:

• 𝑃𝜃 𝑦 𝑥 ∝ exp 𝜆𝑘𝑓𝑘 𝑒, 𝑦 𝑒, 𝑥 +𝑒∈𝐸,𝑘 𝜇𝑘𝑔𝑘 𝑣, 𝑦 𝑣, 𝑥𝑣∈𝑉,𝑘

𝑓𝑘 and 𝑔𝑘 are the feature functions,

𝜆𝑘 and 𝜇𝑘 are the parameters to be trained

• Training methods for CRF including:

Iterative scaling algorithms [24]

Non-preconditioned conjugate-gradient [25]

Voted perceptron training [26]

Quasi-newton algorithm [27], used in this work

online tool: http://crfpp.googlecode.com/svn/trunk/doc/index.html

http://crfpp.googlecode.com/svn/trunk/doc/index.html

• Data intro:

• To deal with an extensive kinds of named entities

• Using the SIGHAN Bakeoff-4 corpora [6]

• Containing PER, LOC, and ORG three kinds of entities

• CityU (traditional Chinese) and MSRA (simplified Chinese)

• Perform on closed track (without using external resources)

• Detailed information for training and test data in Table 4 and 5.

• NE means the total of three kinds of named entities

• OOV means the entities of the test data that do not exist in the training data, and Roov means the OOV rate

• The samples of training corpus are shown as Table 6.

• In the test data, there is only one column of Chinese characters

• Recognition results:

• Evaluation metrics:

• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑢𝑡𝑝𝑢𝑡

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡

• 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑢𝑡𝑝𝑢𝑡

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑡ℎ

• 𝐹𝑠𝑐𝑜𝑟𝑒 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑅𝑒𝑐𝑎𝑙𝑙 =2×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

• Evaluation is performed on NE level (not token-per-token). – E.g., if a token is supposed to be B-LOC but it is labeled I-LOC instead, then this will not

be considered as a correct labeling

• Evaluation scores:

• There are several main conclusions derived:

• 1. These experiments results corroborate our analysis of the Chinese characteristics: PER and LOC have simpler structures and expressions that make the recognition easier than the ORG – the Roov rate (in Table 5) of LOC is the lowest (0.1857 and 0.0861 respectively for CityU

and MSRA) and the corresponding recognition of LOC performed very well (0.8599 and 0.8988 respectively in F-score).

– in the MSRA corpus, the Roov of ORG (0.3533) is larger than PER (0.3026) and the corresponding F-scores of ORG are lower

– however, in CityU corpus, the Roov of ORG (0.4884) is much lower than PER (0.7850) while the recognition result of ORG also perform worse (0.6646 and 0.8036 respectively of F-scores for them)

• 2. The recognition of the OOV entities is the principal challenge for the automatic systems – the total OOV entity number in CityU (0.4882) is larger than MSRA (0.2142), and the

corresponding final F-score of CityU (0.7955) is also lower than MSRA (0.8833)

• Comparison with baselines in Table 9:

• The baselines are produced by a left-to-right maximum match algorithm applied on the testing data with the named entity lists generated from the training data.

• The experiments have yielded much higher F-scores than the baselines

• The baseline scores are unstable on different entities resulting synthetically in the total F-scores of 0.5955 and 0.6105 respectively for CityU and MSRA corpus.

• On the other hand, our results show that the three kinds of entity recognitions get high scores generally without big twists and turns.

• This proves that the approaches employed in this research are reasonable and augmented.

• The improvements on ORG and PER are especially larger on both two corpora, leading to the total increases of F-scores 33.6% and 44.7% respectively.

• Comparison with related works：

• Related works that use different features (various window sizes)

• algorithms (CRF, ME, SVM, etc.)

• external resources (external vocabularies, POS tools, name lists, etc.)

• the comparison test on MSRA, some works briefly in Table 10. – Due to the fact that most researchers undertake the test only on MSRA corpus

• use number n to represent the character – previous nth character when n<0

– the following nth character when n>0

– and the current token case when n=0

– E.g., B(-10, 01, 12) means the three bigram features (former one and current, current and next one, next two characters).

• From Table 10:

• when the window size of the features is smaller, the performance shows worse.

• too large window size cannot ensure good results – while it will bring in noises and cost more running time simultaneously.

• external materials do not necessarily ensure better performances – the combination of segmentation and POS will offer more information about the test

set; however, the segmentation and POS accuracy also influence the system quality.

• the experiment of this paper has yielded promising results by employing optimized feature set and a concise model.

• the performances of different sub features in our experiments

• the corresponding results respectively in Table 11

• Table 11 shows：

• Generally speaking, more features lead to more training time, and when the feature set is small this conclusion also fit the case of iteration number.

• However, this conclusion does not stand when the feature set gets larger – e.g. testing on the MSRA corpus, the feature set (FS) FS4 needs 314 iteration number

which is less than 318 by FS2 although the former feature set is larger.

– This may be due to the fact that the feature set FS2 needs more iterations to converge to a fixed point.

• Employing the CRF algorithm, the optimized feature set is chosen as FS4 – and if we continue to expand the features the recognition accuracy will decrease as in

Table 11

• Due to the changeful and complicated characteristics of Chinese

• there are some special combinations of characters, and sometimes we can label them with different performances with all results reasonable in practice.

• These make some confusion for the researchers.

• How do we deal with these problems?

• To facilitate further researches, we introduce and provide some formal definitions of the existing issues in CNER

• First, the Function-overload problem: – (also called as metonymy in some place)

• One word bears two or more meanings in the same text. – E.g., the word “大山”(DaShan) means an organization name in the chunk “大山國際銀

行” (DaShan International Bank) and the whole chunk means a company

– While “大山” (DaShan) also represents a person name in the sequence “大山悄悄地走了” (DaShan quietly went away) with the whole sequence meaning a person's action

• It is difficult for the computer to differ their meaning and assign corresponding different labels (ORG or PER) – they must be recognized through the analysis of context and semantics.

• Furthermore, the Multi-segmentation problem in CNER:

• one sequence can be segmented into a whole or more fragments according to different meanings, and the labeling will correspondingly end in different results. – For example, the sequence “中興實業” (ZhongXing Corporation) can be labeled as a

whole chunk as "B-ORG I-ORG I-ORG I-ORG" which means it is an organization name

– It also can be divided as “中興 / 實業” and labeled as “B-ORG I-ORG / N N” meaning that the word “中興”(ZhongXing) can represent the organization entity and “實業” (Corporation) specifies common Chinese word, and this usage is widespread in Chinese documents.

• Another example of the Multi-segmentation problem in CNER: – the sequence “杭州西湖” (Hang Zhou Xi Hu) can be labeled as "B-LOC I-LOC I-LOC I-LOC"

as a place name

– but it can also be labeled as "B-LOC I-LOC B-LOC I-LOC" due to the fact that “西湖” (XiHu) is indeed a place that belongs to the city “杭州” (HangZhou).

• Which label sequences shall we select for them? Both of them are reasonable. This is a difficult problem for manual work, let alone for computer.

• Above discussed problems are only some of the existing ones in CNER. If we can deal with them well, the performances will be better in the future.

• This paper undertakes the researches of CNER which is a difficult issue in NLP literature.

• The characteristics of Chinese named entities are introduced respectively on personal names, location names and organization names.

• Employing the CRF algorithm, optimized features have shown promising performances compared with related works that use different feature sets and algorithms.

• Furthermore, to facilitate further researches, this paper discusses the problems existing in the CNER and puts forward some formal definitions combined with instructive solutions.

• The performance results can be further improved in the open test through employing other high quality resources and tools – e.g. externally generated word-frequency counts, common Chinese surnames and

internet dictionaries

• 1. Ratinov, L., Roth, D.: Design Challenges and Misconceptions in Named Entity

• Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural

• Language Learning (CoNLL 2009), pp. 147–155. Association for Computational

• Linguistics Press, Stroudsburg (2009)

• 2. Sang, E.F.T.K., Meulder, F.D.: Introduciton to the CoNLL-2003 Shared Task:

• Language-Independent Named Entity Recognition. In: HLT-NAACL, pp. 142–147.

• ACL Press, USA (2003)

• 3. Sobhana, N., Mitra, P., Ghosh, S.: Conditional Random Field Based Named Entity

• Recognition in Geological text. J. IJCA 1(3), 143–147 (2010)

• 4. Settles, B.: Biomedical named entity recognition using conditional random fields

• and rich feature sets. In: Collier, N., Ruch, P., Nazarenko, A. (eds.) International

• Joint Workshop on Natural Language Processing in Biomedicine and its Applications,

• pp. 104–107. ACL Press, Stroudsburg (2004)

• 5. Levow, G.A.: The third international CLP bakeoff: Word segmentation and named

• entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on CLP,

• pp. 122–131. ACL Press, Sydney (2006)

• 6. Jin, G., Chen, X.: The fourth international CLP bakeoff: Chinese word segmentation,

• named entity recognition and Chinese pos tagging. In: Sixth SIGHAN Workshop

• on CLP, pp. 83–95. ACL Press, Hyderabad (2008)

• 7. Chen, Y., Jin, P., Li, W., Huang, C.-R.: The Chinese Persons Name Disambiguation

• Evaluation: Exploration of Personal Name Disambiguation in Chinese News. In:

• CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 346–352.

• ACL Press, BeiJing (2010)

• 8. Sun, L., Zhang, Z., Dong, Q.: Overview of the Chinese Word Sense Induction Task

• at CLP2010. In: CIPS-SIGHAN Joint Conference on CLP (CLP2010), pp. 403–409.

• ACL Press, BeiJing (2010)

• 9. Jaynes, E.: The relation of Bayesian and maximum entropy methods. J. Maximumentropy

• and Bayesian Methods in Science and Engineering 1, 25–29 (1988)

• 10. Wong, F., Chao, S., Hao, C.C., Leong, K.S.: A Maximum Entropy (ME) Based

• Translation Model for Chinese Characters Conversion. J. Advances in Computational

• Linguistics, Research in Computer Science. 41, 267–276 (2009)

• 11. Ekbal, A., Bandyopadhyay, S.: A hidden Markov model based named entity recognition

• system: Bengali and Hindi as case studies. In: Ghosh, A., De, R.K., Pal, S.K.

• (eds.) PReMI 2007. LNCS, vol. 4815, pp. 545–552. Springer, Heidelberg (2007)

• 12. Mansouri, A., Affendey, L., Mamat, A.: Named entity recognition using a new

• fuzzy support vector machine. J. IJCSNS 8(2), 320 (2008)

• 13. Putthividhya, D.P., Hu, J.: Bootstrapped named entity recognition for product

• attribute extraction. In: EMNLP 2011, pp. 1557–1567. ACL Press, Stroudsburg

• (2011)

• 14. Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection

• using conditional random fields. In: Proceedings of the 20th international conference

• on Computational Linguistics (COLING 2004), Article 562. Computational

• Linguistics Press, Stroudsburg (2004)

• 15. Chen, W., Zhang, Y., Isahara, H.: Chinese named entity recognition with conditional

• random fields. In: Fifth SIGHAN Workshop on Chinese Language Process-

• ing, pp. 118–121. ACL Press, Sydney (2006)

• 16. Zhu, F., Liu, Z., Yang, J., Zhu, P.: Chinese event place phrase recognition of emergency

• event using Maximum Entropy. In: Cloud Computing and Intelligence Systems

• (CCIS), pp. 614–618. IEEE, ShangHai (2011)

• 17. Qin, Y., Yuan, C., Sun, J., Wang, X.: BUPT Systems in the SIGHAN Bakeoff 2007.

• In: Sixth SIGHAN Workshop on CLP, pp. 94–97. ACL Press, Hyderabad (2008)

• 18. Feng, Y., Huang, R., Sun, L.: Two Step Chinese Named Entity Recognition Based

• on Conditional Random Fields Models. In: Sixth SIGHAN Workshop on CLP,

• pp. 120–123. ACL Press, Hyderabad (2008)

• 19. Yuan, Yida, Zhong, W.: Contemporary Surnames. Jiangxi people’s publishing house,

• China (2006)

• 20. Yuan, Yida, Qiu, J., Zhang, R.: 300 most common surname in Chinese surnamespopulation

• genetic and population distribution. East China Normal University

• Publishing House, China (2007)

• 21. Huang, D., Sun, X., Jiao, S., Li, L., Ding, Z., Wan, R.: HMM and CRF based

• hybrid model for chinese lexical analysis. In: Sixth SIGHAN Workshop on CLP,


• 22. Sun, G.-L., Sun, C.-J., Sun, K., Wang, X.-L.: A Study of Chinese Lexical Analysis

• Based on Discriminative Models. In: Sixth SIGHANWorkshop on CLP, pp. 147–150.

• ACL Press, Hyderabad (2008)

• 23. Yang, F., Zhao, J., Zou, B.: CRFs-Based Named Entity Recognition Incorporated

• with Heuristic Entity List Searching. In: Sixth SIGHAN Workshop on CLP,


• 24. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic

• models for segmenting and labeling sequence data. In: Proceeding of 18th

• International Conference on Machine Learning, pp. 282–289. DBLP, Massachusetts

• (2001)

• 25. Shewchuk, J.R.: An introduction to the conjugate gradient method without the

• agonizing pain. Technical Report CMUCS-TR-94-125, Carnegie Mellon University

• (1994)

• 26. Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: kernels over

• discrete structures, and the voted perceptron. In: Proceedings of the 40th Annual

• Meeting on Association for Computational Linguistics (ACL 2002), pp. 263–270.

• Association for Computational Linguistics Press, Stroudsburg (2002)

• 27. The Numerical Algorithms Group. E04 - Minimizing or Maximizing a Function,

• NAG Library Manual, Mark 23 (retrieved 2012)

• 28. Zhao, H., Liu, Q.: The CIPS-SIGHAN CLP2010 Chinese Word Segmentation Backoff.

• In: CIPS-SIGHAN Joint Conference on CLP, pp. 199–209. ACL Press, BeiJing

• (2010)

• 29. Zhou, Q., Zhu, J.: Chinese Syntactic Parsing Evaluation. In: CIPS-SIGHAN Joint

• Conference on CLP (CLP 2010), pp. 286–295. ACL Press, BeiJing (2010)

• 30. Xu, Z., Qian, X., Zhang, Y., Zhou, Y.: CRF-based Hybrid Model for Word Segmentation,

• NER and even POS Tagging. In: Sixth SIGHAN Workshop on CLP,

• pp. 167–170. ACL Press, India (2008)

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao

[email protected], {derekfw, lidiasc}@umac.mo

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

Education

LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics