7
LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology Harbin, China [email protected] Zhiying Tu Harbin Institute of Technology Harbin, China [email protected] Zhongjie Wang Harbin Institute of Technology Harbin, China [email protected] Xiaofei Xu Harbin Institute of Technology Harbin, China [email protected] ABSTRACT In recent years, deep learning has achieved great success in many natural language processing tasks including named entity recogni- tion. The shortcoming is that a large amount of manually-annotated data is usually required. Previous studies have demonstrated that both transfer learning and active learning could elaborately reduce the cost of data annotation in terms of their corresponding advan- tages, but there is still plenty of room for improvement. We assume that the convergence of the two methods can complement with each other, so that the model could be trained more accurately with less labelled data, and active learning method could enhance trans- fer learning method to accurately select the minimum data samples for iterative learning. However, in real applications we found this approach is challenging because the sample selection of traditional active learning strategy merely depends on the final probability value of its model output, and this makes it quite difficult to evalu- ate the quality of the selected data samples. In this paper, we first examine traditional active learning strategies in a specific case of BERT-CRF that has been widely used in named entity recognition. Then we propose an uncertainty-based active learning strategy called Lowest Token Probability (LTP) which considers not only the final output but also the intermediate results. We test LTP on multiple datasets, and the experiments show that LTP performs better than traditional strategies (incluing LC and NLC) on both token-level F 1 and sentence-level accuracy, especially in complex imbalanced datasets. KEYWORDS active learning, named entity recognition, transfer learning, CRF 1 INTRODUCTION Over the past few years, papers applying deep neural networks (DNNs) to the task of named entity recognition (NER) have achieved noteworthy success [3], [11],[13].However, under typical training procedures, the advantages of deep learning are established mostly relied on the huge amount of labeled data. When applying these methods on domain-related tasks, their main problem lies in their need for considerable human-annotated training corpus, which requires tedious and expensive work from domain experts. Thus, to make these methods more widely applicable and easier to adapt to various domains, the key is how to reduce the number of manually annotated training samples. Both transfer learning and active learning are designed to reduce the amount of data annotation. However, the two methods work differently. Transfer Learning is the migration of trained model parameters to new models to facilitate the new model training. We can share the learned model parameters into the new model in a certain way to accelerate and optimize the learning efficiency of the model, instead of learning from zero. So transfer learning could help to achieve better results on a small dataset. However, it should be noted that transfer learning works well only when the sample distributions of the source and target domain are similar. While significant distribution divergence might cause a negative transfer. Unlike the supervised learning setting, in which samples are selected and annotated at random, the process of Active Learning employs one or more human annotators by asking them to label new samples that are supposed to be the most informative in the creation of a new classifier. The greatest challenge in active learning is to determine which sample is more informative. The most common approach is uncertainty sampling, in which the model preferentially selects samples whose current prediction is least confident. Quite a lot of works have been done to reduce the amount of data annotation for NER tasks through either transfer learning or active learning, but few researches have combined these two techniques to reduce labeling cost and avoid negative transfer. In this work, we try to integrate a widely used transfer learning based NER model, called Bert-CRF, with active learning. When evaluating the effect of NER, most of the works only use the value of the token-level F 1 score or entity-level F 1 score. However, in some cases, this could be misleading, especially for languages that do not have a natural separator, such as Chinese. And the NER task is often used to support downstream tasks, which prefer that all entities in the sentence are correctly identified. Figure 1 shows an example where only one token gets the wrong label (the corresponding token-level and entity-level F 1 values are 0.947 and 0.857). When this wrong result is used in the user intention understanding, the phone will be considered as the user demand rather than the phone cases. So in this work, we not only evaluate the token-level F 1 score but also the sentence-level accuracy. arXiv:2001.02524v1 [cs.CL] 8 Jan 2020

LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

  • Upload
    others

  • View
    14

  • Download
    1

Embed Size (px)

Citation preview

Page 1: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

LTP: A New Active Learning Strategy for Bert-CRF BasedNamed Entity Recognition

Mingyi LiuHarbin Institute of Technology

Harbin, [email protected]

Zhiying TuHarbin Institute of Technology

Harbin, [email protected]

Zhongjie WangHarbin Institute of Technology

Harbin, [email protected]

Xiaofei XuHarbin Institute of Technology

Harbin, [email protected]

ABSTRACTIn recent years, deep learning has achieved great success in manynatural language processing tasks including named entity recogni-tion. The shortcoming is that a large amount of manually-annotateddata is usually required. Previous studies have demonstrated thatboth transfer learning and active learning could elaborately reducethe cost of data annotation in terms of their corresponding advan-tages, but there is still plenty of room for improvement. We assumethat the convergence of the two methods can complement witheach other, so that the model could be trained more accurately withless labelled data, and active learning method could enhance trans-fer learning method to accurately select the minimum data samplesfor iterative learning. However, in real applications we found thisapproach is challenging because the sample selection of traditionalactive learning strategy merely depends on the final probabilityvalue of its model output, and this makes it quite difficult to evalu-ate the quality of the selected data samples. In this paper, we firstexamine traditional active learning strategies in a specific case ofBERT-CRF that has been widely used in named entity recognition.Then we propose an uncertainty-based active learning strategycalled Lowest Token Probability (LTP) which considers not onlythe final output but also the intermediate results. We test LTP onmultiple datasets, and the experiments show that LTP performsbetter than traditional strategies (incluing LC and NLC) on bothtoken-level F1 and sentence-level accuracy, especially in compleximbalanced datasets.

KEYWORDSactive learning, named entity recognition, transfer learning, CRF

1 INTRODUCTIONOver the past few years, papers applying deep neural networks(DNNs) to the task of named entity recognition (NER) have achievednoteworthy success [3], [11],[13].However, under typical trainingprocedures, the advantages of deep learning are established mostlyrelied on the huge amount of labeled data. When applying thesemethods on domain-related tasks, their main problem lies in theirneed for considerable human-annotated training corpus, whichrequires tedious and expensive work from domain experts. Thus, tomake these methods more widely applicable and easier to adapt to

various domains, the key is how to reduce the number of manuallyannotated training samples.

Both transfer learning and active learning are designed to reducethe amount of data annotation. However, the two methods workdifferently.

Transfer Learning is the migration of trained model parametersto new models to facilitate the new model training. We can sharethe learned model parameters into the new model in a certain wayto accelerate and optimize the learning efficiency of the model,instead of learning from zero. So transfer learning could help toachieve better results on a small dataset. However, it should benoted that transfer learning works well only when the sampledistributions of the source and target domain are similar. Whilesignificant distribution divergence might cause a negative transfer.

Unlike the supervised learning setting, in which samples areselected and annotated at random, the process of Active Learningemploys one ormore human annotators by asking them to label newsamples that are supposed to be the most informative in the creationof a new classifier. The greatest challenge in active learning is todetermine which sample is more informative. The most commonapproach is uncertainty sampling, in which the model preferentiallyselects samples whose current prediction is least confident.

Quite a lot of works have been done to reduce the amount of dataannotation for NER tasks through either transfer learning or activelearning, but few researches have combined these two techniques toreduce labeling cost and avoid negative transfer. In this work, wetry to integrate a widely used transfer learning based NER model,called Bert-CRF, with active learning.

When evaluating the effect of NER, most of the works onlyuse the value of the token-level F1 score or entity-level F1 score.However, in some cases, this could be misleading, especially forlanguages that do not have a natural separator, such as Chinese.And the NER task is often used to support downstream tasks, whichprefer that all entities in the sentence are correctly identified. Figure1 shows an example where only one token gets the wrong label(the corresponding token-level and entity-level F1 values are 0.947and 0.857). When this wrong result is used in the user intentionunderstanding, the phone will be considered as the user demandrather than the phone cases. So in this work, we not only evaluatethe token-level F1 score but also the sentence-level accuracy.

arX

iv:2

001.

0252

4v1

[cs

.CL

] 8

Jan

202

0

Page 2: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

,,

Mingyi Liu and Zhiying Tu, et al.

Figure 1: An example of F1 score leading to misunderstand-ing.

We first experiment with the traditional uncertainty-based activelearning algorithms, and then we proposed our own active learn-ing strategy based on the lowest token probability with the bestlabeling sequence. Experiments show that our selection strategy issuperior to traditional uncertainty-based active selection strategiesin multiple Chinese datasets both in token-level F1 score and overallsentence-level accuracy. Especially in the case of a large number ofentity types.

Finally, wemake empirical analysis with different active selectionstrategies and give some suggestions for using them.

The remainder of this paper is organized as follows. In Section2 we summarize the related works in transfer learning and activelearning. In section 3 we introduce an NER model called Bert-CRF,and the active learning framework. Section 4 describes in detailsthe active learning strategies we propose. Section 5 describes theexperimental setting, the datasets, and discusses the empirical re-sults.

2 RELATEDWORK2.1 Named entity recognitionThe framework of NER using deep neural network can be regardedas a composition of encoder and decoder. For encoders, there aremany options. Collobert et al. [5] first used convolutional neu-ral network (CNN) as the encoder. Traditional CNN cannot solvethe problem of long-distance dependency. In order to solve thisproblem, RNN[17], BiLSTM[9] , Dilated CNN[24] and bidirectionalTransformers[8] are proposed to replace CNN as encoder. For de-coders, someworks used RNN for decoding tags [15], [17]. However,most competitive approaches relied on CRF as decoder[11],[27].

2.2 Transfer learningTransfer learning could help have satisfied results on small datasets.There are two methods to apply the pre-training language modelto downstream tasks. Feature-based approach (eg. Word2Vec[16],ELMO[18]) that includes pre-trained representations as additionalfeatures into embedding. Fine-tuning approach (eg. GPT[19], BERT[8])that fine-tunes the pre-trained parameters in the specific down-stream tasks. In this work, we use BERT (as encoder) in upstream todo pre-training and CRF (as decoder) in downstream to fine-tune.

2.3 Active learningActive learning strategies have beenwell studied [7], [1], [23]. Thesestrategies can be grouped into following categories: Uncertainty

sample [12][6][20][10], query-by-committee[22] [25], informationdensity[26], fisher information[21]. There were some works thatcompared the performance of different types of selection strategiesin NER/sequence labeling tasks with CRF model [2] [14] [21][4].These results show that, in most case, uncertainty-based methodsperform better and cost less time. However, we found that thesetraditional uncertainty-based strategies did not perform well withtransfer learning. So, we propose our own uncertainty-based strat-egy in this work.

3 NER MODEL3.1 Architecture

Figure 2: The overall framework

The entire learning processing is shown in Figure 2ãĂĆ Thereare two main stages in the process, model training (discussed indetail in this section) and sample selection (discussed in detail inSection 4). Through multiple iterations of two stages, we can getthe ideal results in quite low annotation cost.

The architecture of the NER network comprises of an inputlayer, a pre-training language model, a fully connected layer andfinally a CRF (conditional random field) layer, which simulates labeldependencies in the output. It is must be noted that while somework have been done with Bert-BiLSTM-CRF taht replace the fullconnectivity layer in the Bert-CRF with BiLSTM layer. However,we found in the experiment that there was no significant differencebetween the performance of Bert-BiLSTM-CRF and Bert-CRF, andthe network structure of Bert-BiLSTM-CRF is more complex thanBert-CRF and with slower training speed. So Bert-CRF was selectedin this paper.

3.2 Data RepresentationWe represent each input sentence following Bert format; Each to-ken in the sentence is marked with BIO scheme tags. Special [CLS]and [SEP] tokens are added at the beginning and the end of thetag sequence, respectively. [PAD] tokens are added at the end ofsequences to make their lengths uniform. The formatted sentencein length N is denoted as x =< x1,x2, . . . ,xN >, and the corre-sponding tag sequence is denoted as y =< y1,y2, . . . ,yN >.

Page 3: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

LTP: A New Active Learning Strategy for Bert-CRF BasedNamed Entity Recognition

,,

Table 1: Example of data representation. [PAD] tag are not shown.

Sentence Trump was born in the United StatesTag [CLS] B-PER O O O B-LOC I-LOC I-LOC [SEP]

3.3 Bert & CRF LayerBert is one of the most successful pre-training language models,here we use it as the character-level encoder. For each character xiin the input sequence x, bert will convert it into a fixed-length vectorw. CRF are statistical graphical models which have demonstratedstate-of-art accuracy on virtually all of the sequence labeling tasksincluding NER task. Particularly, we use linear-chain CRF that is apopular choice for tag decoder, adopted by most DNNs for NER.

A linear-chain CRF model defines the posterior probability of ygiven x to be:

P(y|x;A) = 1Z (x) exp

(h1(y1; x) +

n−1∑k=1

hk+1(yk+1; x) +Ayk ,yk+1

)(1)

where Z (x) is a normalization factor over all possible tags of x, andhk (yk ; x) (#TODO: ambiguous?) indicates the probability of takingthe yk tag at position k which is the output of the previous softmaxlayer. A is a parameter called a transfer matrix, which can be setmanually or by model learning. In our experiment, we let the modellearn the parameter by itself. Ayk ,yk+1 means the probability of atransition from tag states yk to yk+1. We use y∗ to represent themost likely tag sequence of x:

y∗ = argmaxy

P(y|x) (2)

The parameters A are learnt through the maximum log-likelihoodestimation, that is to maximize the log-likelihood function ℓ oftraining set sequences in the labeled data set L:

ℓ(L;A) =L∑l=1

log P(y(l) |x(l);A) (3)

where L is the size of the tagged set L.

4 ACTIVE LEARNING STRATEGIESThe biggest challenge in active learning is how to select instancesthat need to be manually labeled. A good selection strategy ϕ(x),which is a function used to evaluate each instance x in the unlabeledpool U, will select the most informative instance x.

Algorithm 1 illustrate the entire pool-based active learning pro-cess. In the remainder of this section, we describe various querystrategy formulations of ϕ(·) in detail.

4.1 Least Confidence (LC)Culotta and McCallum employ a simple uncertainty-based strategyfor sequence models called least confidence(LC), which sort exam-ples in ascending order according to the probability assigned bythe model to the most likely sequence of tags:

ϕLC (x) = 1 − P(y∗ |x;A) (4)

This confidence can be calculated using the posterior probabilitygiven by Equation 1. Preliminary analysis revealed that the LC

Algorithm 1 Pool-based active learning frameworkRequire: Labeled data set L,

unlabeled data poolU,selection strategy ϕ(·),query batch size B

while not reach stop condition do// Train the model using labeled set Ltrain(L);for b = 1 to B do

//select the most informative instancex∗ = argmaxx∈U ϕ(x)L = L∪ < x∗, label(x∗) >U = U − x∗

end forend while

strategy prefer selects longer sentences:

P(y∗ |x;A) ∝ exp

(h1(y∗1 ; x) +

n−1∑k=1

hk+1(y∗k+1; x) +Ay∗k ,y

∗k+1

)(5)

Since Equation 5 contains summation over tokens, LC method nat-urally favors longer sentences. Although the LC method is verysimple and has some shortcomings, many works prove the effec-tiveness of the method in sequence labeling tasks.

4.2 Normalized Least Confidence (NLC)As mentioned in Section 4.1, LC favors longer sentences whichrequires more labor for annotation. To overcome this drawback, wenormalize the confidence as follow:

ϕNLC (x) = 1 − 1NP(y∗ |x;A) (6)

where N is the length of x.

4.3 Lowest Token Probability (LTP)Inspired by Minimum Token Probability (MTP) strategies thatselect the most informative tokens, regardless of the assignmentperformed by CRF. This strategy greedily samples the tokens whosehighest probability among the labels is lowest:

ϕMTP (x) = 1 −mini

maxj

hi (yi = j |x;A) (7)

hi (yi = j |x;A) is the probability that j is the label at position i inthe sequences.

Unlike MTP, we believe that the sequence selected by CRF isvaluable. We look for the most probable sequence assignment, andhope that each token in the sequence has a high probability.

ϕLT P (x) = 1 − miny∗i ∈y∗

hi (y∗i |x;A) (8)

Page 4: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

,,

Mingyi Liu and Zhiying Tu, et al.

Table 2: Training(Testing) Data Statistics. #S is the number of total sentences in the dataset, #T is the number of tokens in thedataset, #E is the number of entity types, ASL is the average length of a sentence, ASE is the average number of entities in asentence, AEL is the average length of a entity, %PT is the percentage of tokens with positive label,%AC is the percentage of asentences with more than one entity, %DAC is the percentage of sentences that have two or more entities.

Corpus #S #T #E ASL ASE AEL %PT %AC %DAC

People’s Daily 38950(16608)

1651613(701518)

3(3)

42.4(42.2)

1.45(1.47)

3.24(3.22)

11.1%(11.2%)

57.8%(58.3%)

34.9%(36.0%)

Boson-NER 7348(3133)

378865(160693)

6(6)

51.5(51.2)

2.20(2.19)

3.99(4.00)

17.1%(17.1%)

73.2%(72.9%)

50.3%(50.0%)

OntoNotes-5.0(bn-zh)

7637(917)

368803(45601)

18(18)

48.2(49.7)

3.45(3.71)

3.14(3.07)

22.4%(23.0%)

87.5%(89.6%)

71.8%(75.02%)

We proposed our select strategy called Lowest Token Probability(LTP), which selects the tokens whose probability under the mostlikely tag sequence y∗ is lowest.

5 EXPERIMENTS5.1 DatasetsWe have experimented and evaluated the active learning strategiesof Section 4 on three Chinese datasets. People’s Daily is a collectionof newswire articles annotated with three entities: person, organi-zation, location. Boson-NER1 is a set of online news annotationspublished by bosonNLP, which contains 6 entities, such as per-son,product, time. OntoNotes-5.0 Chinese data (bn part) (#TODO:CITE) which contains 18 entities. All corpora are formatted in the“BIO" sequence representation(#TODO cite). Table 2 shows somestatistics of the datasets in terms of dimensions, number of entitytypes, distribution of the labels, etc.

5.2 Experimental SettingWe random chose an initial training set L1 of 99 sentences onPeople’s Daily dataset, 75 sentences on Boson-NER dataset, 76sentences on OntoNotes-5.0 dataset. The dimension of the batchupdate B has been seen as a trade-off between an ideal case in whichthe system is retrained after every single annotation and a practicalcase with higher B to limit the algorithmic complexity and improvemanual labeling efficiency in real-world. In our experiment, B issetting to 200. We fixed the number of active learning iterations at12 because of each algorithm does not improve obviously after 12iterations.

In the NER Model, we use BERT-Base-Chinese2 which has110M parameters. The training batch size is set to 32, and themax_seq_lenдth is set to 80. We set the learning rate to 0.00001.For each iteration, we train 30 epoch to ensure model convergence.And other parameters related to BERT are set to default values.In the fully connected layer, we set the dropout rate to be 0.9 toprevent overfitting. The transfer matrix in CRF is also left to themodel to learn by itself.

We empirically compare the selection strategy proposed in Sec-tion 4, as well as the uniformly random baseline (RAND). Weevaluate each selection strategy by constructing learning curvesthat plot the overall F1 (for tokens,O tag doesn’t in the metric) and

1https://bosonnlp.com/resources/BosonNLP_NER_6C.zip2https://github.com/google-research/bert

accuracy (for sentences). In order to prevent the contingency ofthe experiment, we have done 10 experiments for each selectionstrategy using different random initial training sets L1. All resultsare averaged across these experiments.

5.3 ResultsFrom the learning curves of Figure 3, it is clear that all active learn-ing algorithms perform better than the random baseline on People’sDaily and Boson-NER datasets. On OntoNote5.0 dataset, LC andNLC are better than RAND in the early iteration. Our approachLTP performs best on all datasets. Figure 3 also shows that com-bining transfer learning with appropriate active learning strategy,the amount of data that needs to be labeled can be greatly reduced.For example, LTP achieves 99% performance of FULL using only4.3% (5.0%, 5.4%) of training sentences (tokens, entities) on People’sDaily dataset, 22.8% (31.1%, 40.8%) of training sentences (tokens, en-tities) on Boson-NER dataset and 19.32% (26.1%, 30.2%) of trainingsentences (tokens, entities) on OntoNote5.0 dataset.

Figure 4 shows the results of sentence-level accuracy on threedatasets. The results exceeded our expectations and were very in-teresting. Firstly, the results confirm that the token-level F1 value issometimes misleading what we mentioned in Section 1. For exam-ple, on Boson-NER dataset, althought the token-level F1 scores ofLTP and LC are similar in the last few iterations, the sentence-levelaccuracy is 2% difference. Secondly, in the experiment, LTP is muchbetter than the rest of the methods, and can be close to the use ofall training data especially on complex datasets (eg. Boson-NRR,OntoNotes). Thirdly, LC andNLC perform poorly at sentence-levelaccuracy.

The two operations that are time-consuming in the actual se-quence labeling process are the amount of text that the labeler needsto read and the amount of text that the labeler needs to label. InFigure 5, we use the total number of tokens and the total number ofentities to represent these two factors. One can see that comparedwith LC, LTP can achieve better performance with less label effort.

6 DISCUSSION AND SUGGESTIONIn this section, we will discuss possible reasons for the gap betweendifferent selection strategies. Then we give some suggestions onhow to choose different active learning strategies in practical appli-cations.

Page 5: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

LTP: A New Active Learning Strategy for Bert-CRF BasedNamed Entity Recognition

,,

(a) People’s Daily (b) Boson-NER (c) OntoNote5.0 Chinese

Figure 3: Token-level F1 results on three datasets

(a) People’s Daily (b) Boson-NER (c) OntoNote5.0 Chinese

Figure 4: Sentence-level accuracy on three corpus

(a) People’s Daily (b) Boson-NER (c) OntoNote5.0 Chinese

Figure 5: Total number of selected tokens and entities.

Table 3: Entity Distribution in People’s Daily Trainingdataset.

Person Location Organization23.85% 48.88% 27.27%

Table 4: Entity Distribution in Boson-NER Training dataset.

PER LOC ORG TIM PRODUCT COMPANY22.92% 21.46% 12.46% 18.79% 13.63% 10.75%

We first give the detailed distribution of the entities under differ-ent datasets as shown in Table 3 - 5. One can see that the biggest dif-ference between OntoNote dataset and the remaining two datasets

is that the entity distribution is extremely unbalanced. The numberof GPE is approximately 162 times the number of LANGUAGE inthe OntoNote5.0 dataset.

Due to page limitations, we are unable to present the analysisresults for all datasets here. So we select the first 6 iterations ofOntoNote dataset for explanation, for two reasons:

(1) On the OntoNote dataset, the difference in the effect of thedifferent strategies is most obvious.

(2) The first 6 iterations have large preformace changes.

Figure 7 shows the deviation between the sample distributionselected by different selection strategies and the overall sampledistribution. We can see two differences between LTP and otheractive learning strategies in sampling:

Page 6: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

,,

Mingyi Liu and Zhiying Tu, et al.

Table 5: Entity Distribution in OntoNote5.0 Training dataset.

GPE QUANTITY DATE EVENT PERSON CARDINAL29.20% 1.11% 11.84% 1.10% 18.43% 10.21%NORP ORG FAC ORDINAL TIME MONEY4.07% 13.80% 1.86% 1.36% 2.83% 0.96%WORK_OF_ART LOC PRODUCT LAW PERCENT LANGUAGE0.83% 2.75% 0.36% 0.46% 0.62% 0.18%

Figure 6: Offset of adjacent two iterations of selection

(1) For entities with high overall proportion (such as GPE, PER-SON, ORG), LTP samples consistently below overall propor-tion.

(2) For entities with low overall proportion (such as LAW, LAN-GUAGE, PRODUCT ), LTP samples consistently above overallproportion.

This sampling strategy is consistent with our intuition, that is, toreduce the sampling frequency of the samples that are already welllearned, and increase the sampling frequency of the samples thatare not sufficiently learned.

Finally, we explore the stability of sampling with different ac-tive selection strategies. We use two adjacent sampling offsets toindicate stability:

o f f seti =∑e ∈E

|proi (e) − proi−1(e)| (9)

Where E is a set of entity types, proi (e) represents the proportion ofentity class e in iteration i . Figure 6 shows this result. We found thatLTP can obtain stable sampling distribution faster, which meansthat LTP is more stable than other active learning strategies.

Based on the discussion above, we give several suggestions here.

(1) If your dataset is simple, the distribution of different entitiesis relatively balanced (eg. People’s Daily), and you don’t needto worry about sentence-level accuracy. Transfer learningis sufficient (although adding active learning at this pointimproves performance slightly, but incurs the cost of modelretraining).

(2) If your data set is complex, there are many entity classesinvolved, and the distribution of different entities is uneven(eg. OntoNote5.0), then choose LTP.

(3) If you need to care about sentence-level accuracy, alwaysuse LTP.

7 CONCLUSIONWe proposed a new active learning strategy for Bert-CRF basednamed entity recognition. The experiment shows that comparedwith the traditional active selection strategies, our strategy hasbetter performance, especially in complex datasets. Further, weanalyze the different selection strategies and give some Suggestionson how to use them.

8 ACKNOWLEDGMENTSREFERENCES[1] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. 2014. The power of

localization for efficiently learning linear separators with noise. In Proceedings ofthe forty-sixth annual ACM symposium on Theory of computing. ACM, 449–458.

[2] Yukun Chen, Thomas A. Lasko, Qiaozhu Mei, Joshua C. Denny, and Hua Xu.2015. A study of active learning methods for named entity recognition in clinicaltext. Journal of Biomedical Informatics 58 (2015), 11 – 18. https://doi.org/10.1016/j.jbi.2015.09.010

[3] Jason PCChiu and Eric Nichols. 2016. Named entity recognitionwith bidirectionalLSTM-CNNs. Transactions of the Association for Computational Linguistics 4(2016), 357–370.

[4] Vincent Claveau and Ewa Kijak. 2018. Strategies to Select Examples for ActiveLearning with Conditional Random Fields. In Computational Linguistics and Intel-ligent Text Processing, Alexander Gelbukh (Ed.). Springer International Publishing,Cham, 30–43.

[5] Ronan Collobert, JasonWeston, Léon Bottou,Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language processing (almost) from scratch.Journal of machine learning research 12, Aug (2011), 2493–2537.

[6] AronCulotta andAndrewMcCallum. 2005. Reducing labeling effort for structuredprediction tasks. In AAAI, Vol. 5. 746–751.

[7] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. 2005. Analysis ofperceptron-based active learning. In International Conference on ComputationalLearning Theory. Springer, 249–263.

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).

[9] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models forsequence tagging. arXiv preprint arXiv:1508.01991 (2015).

[10] Seokhwan Kim, Yu Song, Kyungduk Kim, Jeong-Won Cha, and Gary GeunbaeLee. 2006. Mmr-based active machine learning for bio named entity recognition.In Proceedings of the Human Language Technology Conference of the NAACL,Companion Volume: Short Papers. Association for Computational Linguistics,69–72.

[11] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. InProceedings of NAACL-HLT. 260–270.

[12] David D Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling forsupervised learning. In Machine learning proceedings 1994. Elsevier, 148–156.

[13] Nut Limsopatham and Nigel Henry Collier. 2016. Bidirectional LSTM for namedentity recognition in Twitter messages. (2016).

[14] Diego Marcheggiani and Thierry Artières. 2014. An Experimental Comparisonof Active Learning Strategies for Partially Labeled Sequences. In EMNLP.

[15] Grégoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio. 2013. Investigationof recurrent-neural-network architectures and learning methods for spokenlanguage understanding.. In Interspeech. 3771–3775.

[16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems. 3111–3119.

[17] Thien Huu Nguyen, Avirup Sil, Georgiana Dinu, and Radu Florian. 2016. Towardmention detection robustness with recurrent neural networks. arXiv preprint

Page 7: LTP: A New Active Learning Strategy for Bert-CRF Based ... · LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition Mingyi Liu Harbin Institute of Technology

LTP: A New Active Learning Strategy for Bert-CRF BasedNamed Entity Recognition

,,

Figure 7: Comparison of selected entities distribution and overall entities distribution for first 6 iterations onOntoNote dataset.

arXiv:1602.07749 (2016).[18] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher

Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized wordrepresentations. arXiv preprint arXiv:1802.05365 (2018).

[19] Alec Radford, Karthik Narasimhan, Tim Salimans, and IlyaSutskever. 2018. Improving language understanding by genera-tive pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf (2018).

[20] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hiddenmarkov models for information extraction. In International Symposium on Intelli-gent Data Analysis. Springer, 309–318.

[21] Burr Settles and Mark Craven. 2008. An analysis of active learning strategies forsequence labeling tasks. In Proceedings of the conference on empirical methods innatural language processing. Association for Computational Linguistics, 1070–1079.

[22] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query bycommittee. In Proceedings of the fifth annual workshop on Computational learningtheory. ACM, 287–294.

[23] Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and AnimashreeAnandkumar. 2017. Deep active learning for named entity recognition. arXivpreprint arXiv:1707.05928 (2017).

[24] Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fastand accurate entity recognition with iterated dilated convolutions. arXiv preprintarXiv:1702.02098 (2017).

[25] Jennifer Vandoni, Emanuel Aldea, and Sylvie Le Hégarat-Mascle. 2019. Evidentialquery-by-committee active learning for pedestrian detection in high-densitycrowds. International Journal of Approximate Reasoning 104 (2019), 166–184.

[26] Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submodularity in data subsetselection and active learning. In International Conference on Machine Learning.1954–1963.

[27] Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016).