Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
i
Chinese Verb Tense?
Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.
Master’s Thesis
Presented to
The Faculty of the Graduate School of Arts and Sciences Brandeis University
Department of Computer Science Graduate Program in Computational Linguistics
Nianwen Xue, Advisor
In Partial Fulfillment of the Requirements for
Master’s Degree
by Elizabeth Baran
February 2013
ii
Acknowledgements
I want to thank my advisor, Nianwen Xue, for all of the opportunities and support he has
provided over the past couple years. Thank you for giving me an outlet to grow and further
develop my passion for languages. What I have learned has been invaluable.
I would also like to thank my family for their constant love and support throughout this
process.
iii
ABSTRACT
Chinese Verb Tense?
Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.
A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences
Brandeis University Waltham, Massachusetts
By Elizabeth Baran
We explore time in Chinese by mapping tense information from a manually-aligned English
parallel corpus onto Chinese verbs. We construct a detailed mapping procedure to accurately
convey tense in English through combinations of word tokens and parts-of-speech and then
transfer that information onto verbs in Chinese. We explore the resulting Chinese data set and
discuss the pros and cons of this mapping technique. Using this Chinese data set, augmented
with tense, we attempt to automatically predict the tense of each verb in Chinese using a
Conditional Random Fields algorithm along with a suite of linguistic features. We include an
algorithm for extracting and associating time expressions to verbs and integrate that as a
feature into our tense prediction algorithm. We achieve a 34% accuracy gain over our baseline
as well as a much deeper understanding of how tense can transfer between English and Chinese
in a translation environment.
iv
Table of Contents
TABLE OF CONTENTS Introduction ................................................................................................................................. 1
Related Work ............................................................................................................................... 6
Data .............................................................................................................................................. 8
Tense Map Procedure .................................................................................................................. 9
Mapping Procedure ................................................................................................................... 15
The Problem with Translation .................................................................................................... 18
Comparison to Automatically Aligned Data ............................................................................... 20
Time Expressions........................................................................................................................ 22
Time Expression Recognition ................................................................................................. 22
Linking Time Expressions to Verbs ......................................................................................... 26
Tense Prediction ........................................................................................................................ 28
Features ................................................................................................................................. 28
Results .................................................................................................................................... 32
Conclusion and Future Work ..................................................................................................... 35
Bibliography ................................................................................................................................... 37
v
List of Tables
Table 1: Top 30 English Tags Aligned to Chinese Verbs ................................................................. 10
Table 2: English Tense Detailed Mapping Rules ............................................................................ 16
Table 3: Tag Distributions on Manually Aligned Data versus Automatically Aligned Data ........... 21
Table 4: Time Expression Link Data Files ....................................................................................... 26
Table 5: Results Compared to Baseline ......................................................................................... 33
Table 6: Feature Significance ......................................................................................................... 33
Table 7: Precision, Recall, and F1 Scores for Each Tag................................................................... 34
vi
List of Illustrations/Figures
Figure 1 ............................................................................................................................................ 1
Figure 2 ............................................................................................................................................ 2
Figure 3 ............................................................................................................................................ 2
Figure 4: Main Data Files ................................................................................................................. 8
Figure 5: English POS Tags that Align to Chinese VA ..................................................................... 11
Figure 6: English POS Tags that Align to Chinese VC ...................................................................... 12
Figure 7: English POS Tags that Align to Chinese VE ...................................................................... 13
Figure 8: English POS Tags that Align to Chinese VV ..................................................................... 13
Figure 9: An Example Parallel Sentence with Verbs Highlighted ................................................... 18
Figure 10: Data Split ....................................................................................................................... 22
Figure 11: Description of TimeChar Characters ............................................................................. 24
Figure 12: Frequency Distribution of Normalized Time Expressions ............................................. 25
Figure 13: Example of features in a syntactic tree ........................................................................ 32
1
INTRODUCTION
Understanding tense in a language that has none, is challenging and to some may seem futile.
However the attractiveness of a unifying grammatical theory makes it worth exploring from
both a linguistic and computational standpoint. Chinese is a tense-less language, meaning that
verbs in Chinese do not inflect for temporal changes in context. There are many language
counterparts that do inflect for tense, usually with inflections on the verb, and one of those is
English. The examples below demonstrate the contrast between Chinese and English.
Figure 1
我 今天 早上 喝 了 咖啡 。 I drank coffee this morning.
我 正在 喝 咖啡 呢 。 I am drinking coffee.
我 上 床 前 , 要 再 喝 一 杯 咖啡 。 I will drink another cup of coffee before bed.
In Figure 1, the verb in English changes form in the three different temporal contexts of past,
present, and future, but the verb in Chinese stays exactly the same. Tense is not solely limited to
past, present, future, even though these may be the most common. Although, Chinese is tense-
less, there are still strong motivations to understand what tense would mean in Chinese, since
time, including past, present, future, is very much conveyed and understood in Chinese.
2
One of these motivations falls within the domain of Machine Translation. When translating from
a language like Chinese to a language like English, tense must be created from the Chinese
representation of tense, which is virtually null, to the English representation which has past,
present, and future. Current MT models can handle this to some extent, but fail in obvious ways
in others. When overt contextual clues exist in Chinese to signal a corresponding tense in English,
models should be able to interpret these accurately. One of the state-of-the-art commercial
translators, Google Translate1, translates the following sentence as follows.
Figure 2
我 下 个 月 参加 会议 I next M month participate meeting Google: month to attend the meeting.
In this translation, the verb "to attend" was left in its base or infinitive form when "next month"
clearly signals a future time. The logical fallacy in the English shows us this is incorrect, and the
gloss supports this. "to attend" should have been translated into the English future tense and
this should have been clear from the time expression 'next month'.
Figure 3
她们 昨天 吃饭 了 。 They yesterday eat AS . Google: They eat yesterday.
Again the verb "to eat" was left in its base form even though the temporal word "yesterday"
coupled with the aspectual marker "le" signal past action.
1 www.translate.google.com
3
Tense is important also in that it is a by-product of time in a language. Chinese may be tense-
less, but it is certainly not void of temporal expression; time is merely represented differently,
particularly through temporal adverbs, aspectual markers, and syntactic constructions. By
studying tense in Chinese we are inevitably studying the way time is represented in Chinese.
This can lead to better techniques for NLP tasks such as summarization and event ordering.
If we accept the assumption that tense is a by-product of time in languages that have it, we
begin to understand how a careful projection of tense onto Chinese can be informative not only
from a cross-linguistic perspective, but also as a tool to understand temporal reference in
Chinese, independently. We will be making these assumptions as we refer to tense in Chinese
throughout this thesis. Also, although we will often be referring exclusively to the verb as a
beacon for tense, it should be understood that temporal reference can apply to whole portions
of text, and the verb is a main portion of but not all of it.
Aspect is another mechanism employed often together with tense. In English, tense and aspect
go hand-in-hand in that they are usually inclinations on the verb. This thesis will consider aspect
inasmuch as it makes sense to do so as we compare verb inclinations on English with their
counterparts in Chinese.
The motivation for this thesis is to explore tense and aspect for applied purposes. There are
many who devote themselves to defining temporal reference in a deeper, more comprehensive
way like for TimeML (Pustejovsky, et al 2004), and ultimately this will be the ideal scenario with
Chinese too, but this is not the goal of this thesis. How we define tense here is not meant as an
end-all classification but more as an exploration into the features of the Chinese temporal
system. The results may indirectly serve as evidence for and/or against different temporal
4
categorization schemes in Chinese but this is not our primary goal. We are concerned more with
how we can specify and use temporal information in Chinese for many NLP tasks, which include
Machine Translation, document summarization, and event-ordering, just to name a few.
Time is represented in a myriad of ways throughout the world’s languages, which is why a
unifying theory between all of them may seem intimidating. If the ultimate goal is to find a
unified semantic representation of time that can easily map to all of the world's languages, the
goal of this thesis is to begin to explore the practical applications of this for Chinese. We ask
ourselves, to what extent can we use parallel information from other languages to uncover
temporal information in Chinese? How inter-operable are temporal systems cross-lingually? And
we do this under the experimental design of attempting to automatically predict tense in
Chinese using information from parallel English data.
We begin our exploration of Chinese tense by initially mapping English tense and aspect
information onto parallel Chinese data and attempting to automatically predict tense in Chinese.
We use a Conditional Random Fields algorithm with a suite of lexical, syntactic, and linguistic
features that we believe will inform the tagger on temporal context, which will in turn serve to
predict the tense that we have prescribed to Chinese.
The rest of the thesis is organized as follows. In RELATED WORK we discuss work that has been
done on Chinese-English tense mapping and/or Chinese tense prediction and how this compares
with our research. In DATA we discuss the type of data we used to conduct the mappings from
English and to automatically predict tense in Chinese. In TENSE MAP PROCEDURE we discuss our
mapping schema for projecting English tense onto Chinese. We look at some of the issues with
5
using parallel data for this task in THE PROBLEM WITH TRANSLATION as well as why we insisted on
using manually aligned data in COMPARISON TO AUTOMATICALLY ALIGNED DATA. In TIME EXPRESSIONS,
we describe the process of extracting time expressions that are used as features in predicting
tense. We look at the rest of the features for our tense prediction algorithm in FEATURES. Finally
we discuss our results and conclude in RESULTS and CONCLUSION AND FUTURE WORK.
6
RELATED WORK Some work has been done for automatic tense prediction in Chinese, and several have
attempted to use Chinese-English parallel data to do so.
Similar to our approach, Liu et al. (2011) use Chinese-English parallel data to expand the data set.
They consider the four basic tenses of present, past, future, and infinitive, focusing on absolute
time as opposed to reference time. Their data consists of POS-tagged and parsed English
sentences, POS-tagged Chinese sentences, and English-Chinese word-alignments. They consider
only verbs in Chinese that have non-conflicting verb mappings in English. If no mapping exists or
several mappings exist but are inconsistent with each other, the verb is no longer considered.
They justify capturing time expressions through local bigram features, but there is no explicit
time expression recognition being performed. They use a suite of basic lexical and syntactic
features to train a Maximum Entropy classifier with a Gaussian prior of 0.1 and max iterations of
100. They perform an iterative bootstrapping algorithm on those results, and are able to achieve
significant improvement over the baseline accuracy of 56.52, which was calculated by assigning
the majority tag for each verb. They point out the issue of error propagation through faulty
word tokenization and part-of-speech tagging. They also point out the need to construct a
deeper feature set.
Xue (2008) looked at tense in Chinese without the help of parallel data. Instead he performed
annotation on 5709 verb instances using a tag set, created in-house, that included past, present,
7
future, future-in-past, and none. He used a Maximum Entropy algorithm with a set of his own
lexical and linguistics features to train and test a tense prediction algorithm. He achieved an F1
score of 67.1 over a baseline of using the most frequent tag, which was 62.4. He did not
consider time expressions, however he did have a feature that looked at NTs, a type of temporal
noun in Chinese.
Ye et al. (2006), use Chinese-English parallel data with manual alignments to obtain verb tenses.
They discount Chinese verbs that are not aligned to English verbs in their data set. Beyond that,
they do not detail their mapping procedure but have three different tags: past, present, and
future. They use telicity, punctuality, and temporal ordering features to increase accuracy for
tense prediction as well as the typical lexical and syntactic features. They achieve F1 scores
of .627, .896, and .572 for present, past, and future respectively.
We can see that many choose to exploit the availability of Chinese-English parallel data and the
fact that English has tense information to explore tense in Chinese. However there is little
discussion on how this tense information is actually mapped over to the Chinese side. Those
who are familiar with the Penn Treebank data set (Marcus, Santorini, & Marcinkiewicz, 1993)
know that there are no POS tags that directly refer to tense. The tags instead reflect the surface
form of the English verb, which in turn generally corresponds to a certain tense. This mapping,
however is not one-to-one, and there needs to be some rule set that can interpret these tag
combinations to logically transform them into what we understand as tense (e.g. past, present,
future) and then tense and aspect combinations (e.g. past progressive, present perfect). We
hope to provide a more detailed and transparent account of our mapping process and show
how this may influence the validity of the data.
8
DATA Our primary data set was a relatively small number of Chinese-English parallel data files from
the Chinese Treebank. The data set is small because we opted for manual alignments over
automatic alignments to decrease error propagation during this early stage. This was the
manually aligned data that was available to us. We will be using the parallel English POS
information to essentially create a gold data set for Chinese, so we wanted this to be as accurate
as possible at least in this first attempt. The data files are listed in Figure 4 below.
Figure 4: Main Data Files
Chinese Treebank Files
8, 11-14, 17-20, 23-24, 26, 28, 30-33, 35-37, 43-44, 46-49, 51, 53-64, 66, 68, 71, 73-74, 76, 79, 81-84, 86-87, 89, 91, 93-95, 97-98, 101-104, 107-109, 111, 113, 115-116, 123, 126, 130-132, 134-138, 142-143, 146-150, 153-156, 159-169, 208-215, 217-218, 221-223, 229-230, 232-234, 236-242, 245-246, 249-251, 255-256, 258-259, 261, 263, 265, 267, 268-269, 301, 304, 306, 311-314, 316-318, 320, 323
9
TENSE MAP PROCEDURE The motivation for using English parallel data to formulate tense in Chinese is first and foremost
due to the lack of tense data in Chinese. Therefore, our intent is to only use the English data
during this mapping stage to create a tense-labeled gold standard for Chinese. The English data
will not be used later for feature selection during the automatic tense prediction stage.
To create this gold standard, we first examine the English POS tag alignments to Chinese verbs.
As mentioned earlier, there is no English tense POS tag, so we must use a set of manually
constructed rules to transform the tags that do exist in English into what we understand as
English tense and aspect.
In the chart below, we show the top 30 English tag alignments to Chinese verbs, as well as a
word token example that correspond to these parts-of-speech. Note that often more than one
English word is aligned to a single Chinese word. In those cases, tags are joined with hyphens.
10
Table 1: Top 30 English Tags Aligned to Chinese Verbs
English Tag Example Frequency
VBD reached 806
VBG increasing 587
JJ good 476
VB promote 449
TO-VB to be 417
VBZ is 363
NN development 345
VBN approved 308
VBZ-VBN has been 227
IN from 209
MD should 152
VBP are 146
DT-NN the establishment
123
VBP-VBN have been 122
VBD-IN served as 116
VBD-VBN had chosen 112
RB friendly 94
IN-VBG in implementing
86
NNS holds 76
MD-VB will allow 74
VB-VBN have become 70
IN-NN at maturation 56
VBG-IN participating in 56
VBD-RP asked about 43
VBZ-IN accounts for 43
IN-DT-NN in the implementation
39
VBP-IN belong to 33
TO-VB-VBN to be built 32
VBZ-VBG is diverging 32
JJR cheaper 24
From this table we can justify how often unintuitive part-of-speech alignments may actually
make sense when we see the corresponding word tokens. Grammatical categories in Chinese
are much fuzzier than they are in English. A word that is a verb in one context can be a noun in
another and vice versa. Same goes for adjectives, adverbs, etc. Context plays an important role
11
in distinguishing parts-of-speech in Chinese, moreso than English, because of a relative lack of
morphology in Chinese.
When we create our mapping rule set, we must consider how all of these tag combinations will
translate into tense. A simple example is the most commonly aligned tag “VBD” which would
translate to simple past. VBD-VBN (i.e. the past form and the past participle) might be past
perfect if the auxiliary verb is “to have” as in the example above, e.g. “had chosen”, or just
simple past in the passive voice if the auxiliary verb is “to be”, e.g. “were held”. At this point we
can see that simple part-of-speech mappings alone will be insufficient in excavating tense
information from the data. Instead, we need to consider a combination of features including
part-of-speech tags but also a selection of word tokens that are relevant to tense.
To explore these alignments further, we looked at which individual English tags aligned to each
of the four verb types in Chinese, predicative adjectives (VA), verb copulas (VC), existential verbs
(VE), and regular verbs (VV). The results are displayed in the figures below.
Figure 5: English POS Tags that Align to Chinese VA
0
50
100
150
200
250
300
350
400
12
A VA in Chinese represents a predicative adjective, and this adjective quality is particularly
apparent with the alignments we see in our data. VAs rarely translate to verbs in English so
these results were expected. They are also significantly different from the other three verb tags
both intuitively and as evidenced by our data. This is the most important factor in our
consideration of what a verb is in Chinese, and what compels our decision to eventually leave
the VA tag out of the verb category when we conduct tense mappings. There are 557 VA tags
represented in the graph above and a total of 651 in the entire document. The discrepancy is
due to an unspecified alignment that could either be deliberate or by error.
Figure 6: English POS Tags that Align to Chinese VC
There are 374 VC tags in the data, in which 336 have alignments and are represented in the
graph above. A VC is a verb copula, most notably 是 in Chinese which can translate to “is”, “was”,
“be”, etc. in English.
0
20
40
60
80
100
120
140
160
VBZ VBD VBP IN VB VBN MD CD NNP TO JJ NNS VBG NN POS DT
13
Figure 7: English POS Tags that Align to Chinese VE
A VE is an existential verb, usually the verb 有, which translates to “to have” and can also serve
the same purpose as “there is” or “there are” in English.
Figure 8: English POS Tags that Align to Chinese VV
VV is all other verbs and is therefore the largest category of verbs.
We were surprised to see a number of verbs with no alignment to an English word in Chinese,
especially given that our data was manually aligned. Since verbs carry significant meaning we
0
5
10
15
20
25
30
35
40
VBP EX VBZ VBD VBN VB JJ IN RB DT NN MD VBG CD TO WDT
0
200
400
600
800
1000
1200
1400
VB
VB
N
VB
D IN
VB
G
NN
VB
Z
TO
VB
P
DT
MD JJ
NN
S
RP
RB
JJR
NN
P
CD
-NO
NE
- , .
PR
P$
CC
PR
P
SYM
PO
S
WD
T
EX
RB
R
WP
JJS ''
FW
HY
PH
WR
B
14
were skeptical that these alignments were actually correct. We explore this issue further in
subsequent sections.
15
MAPPING PROCEDURE We map tense information onto Chinese by using a manually constructed rule set where the
input is a combination of aligned POS tags and word tokens from the English parallel corpus. Of
all the verbs that had alignments, there were 471 unique tag combinations. When we take out
VAs from the mix of verbs, we still have 463 unique tag combinations. This number is
considerably high for the number of available tags and the types of intuitively legal
combinations that can be made.
Our mapping rules for Chinese tense is based directly on what we understand about tense and
aspect in English. The theoretical foundations for this approach can certainly be argued, but we
justify that this is at the least one of the perspectives to consider when regarding tense in an
otherwise tense-less language. Given our understanding of the types of verb tags that exist in
English coupled with our understanding of auxiliary verbs in forming tense, we create the
following transformation rules.
16
Table 2: English Tense Detailed Mapping Rules
Tense Aspect English example Input (from English) Output Frequency
past simple danced VBD|VBN
VPAST 1487
1510
was/were cleaned "was|were"-VBN|VBD
there was/were VEX-"was|were"
perfect had danced "had"-VBN|VBD VPAST1 20
had been cleaned "had"-"been"-VBN|VBD
progressive was/were dancing "was|were"-VBG
VPAST2 3 was/were being cleaned
"was|were"-"being"-VBN|VBD
perfect progressive
had been dancing "had"-"been"-VBG VPAST3 0
present simple dances/dance VBZ|VBP
VPRES 726
1201
am/is/are watched "am|is|are"-VBD|VBN
there is/are VEX="is|are"
perfect has/have danced "has|have"-VBD|VBN
VPRES1 412 has/have been cleaned
"has|have"-"been"-VBD|VBN
progressive is/are dancing "am|is|are"-VBG
VPRES2 51 is/are being cleaned
"am|is|are"-"being"-VBN|VBD
perfect progressive
has/have been dancing
"has"-"been"-VBG VPRES3 12
future simple will dance MD="will"-VB
VFUTR 258
260
will be cleaned MD="will"-"be"-VBN|VBD
am/is/are going to dance
"am|is|are"-"going"-TO-VB
am/is/are going to be watched
"am|is|are"-"going"-TO-"be"-VBN|VBD
there will be VEX-MD="will"-be
perfect will have danced MD="will"-"have"-VBN|VBD
VFUTR1 0 will have been watched
MD="will"-"have"-"been"-VBN|VBD
are going to have danced
"am|is|are"-"going"-TO-"have"-VBN|VBD
progressive will be dancing MD="will"-"be"-VBG VFUTR2 2
perfect progressive
will have been dancing
MD="will"-"have"-"been"-VBG VFUTR3 0
infinitive to dance TO-VB VINF 981 981
dance VB
gerund walking ^VBG VBG 722 722
other the improvement any combination not taken care of above
VOTHER 1689 1689
no map no alignment to English VNOMAP 747 747
predicative adjective
Chinese tag VA VA 610 610
ALL
7720 7720
17
We enumerate 12 different tenses for English, as well as the infinitive form. In order to classify
all words that are originally considered a verb in Chinese, we use the tags that come at the end
of the chart. This includes the gerund (VBG) – which is ambiguous in terms of tense when it
occurs on its own, all mappings that are not taken care of in our tense rule set (e.g. verbs that
map to nouns or prepositions), and the predicative adjective (VA) which is considered as a verb
in Chinese but which we factored out of the mapping process since it tends to display adjective
and adverb qualities than verb qualities (see Figure 5). There are a total of 7720 verbs given
these tags.
There are several other things to note about this mapping. First off, the tags VBD and VBN for
past tense form and past participle, respectively, are often used interchangeably in our
mappings. This is mostly to reduce human annotation errors that may result from confusion
between the past tense and past participle form of a verb, which are the same for most regular
verbs in English. Also to reduce error, if a specific word token is a necessary part of the tag
pattern (e.g. “has” or “been”), we attempt to match only the word token and ignore the POS tag.
That way, in the case where the tag happens to be incorrect, we don’t miss out on catching the
right token. We also accounted for passive voice. So “he ate” is simple past in active voice, and
“it was eaten” is also simple past but in the passive voice. For future tense, we consider not only
the use of the modal “will” which is the traditional future tense indicator, but also the “is going
to” structure. You’ll note however that some of these combinations are so rarely, if ever said, in
English and were deliberately left out, e.g. “the move is going to have been being watched at
that time.”
18
THE PROBLEM WITH TRANSLATION
The problem with using parallel data for this task is the same problem we see when we try to
project any language onto another, which is that we tend to overlook the characteristics that
are unique to the language we are studying. Chinese and English are very different syntactically,
morphologically, phonetically, orthographically, and also historically and culturally. When we
use parallel English data, we are using a once-removed translation and we must be conscience
of the fact that it does not mean exactly the same thing as our source language. Furthermore,
translations can vary immensely between different translators and all could be considered
correct depending on the context. For our purposes, the best kind of translation would have
been one that is more literal and most likely less fluent in English. Unfortunately, this was not
exactly the type of data we had. Although the Chinese Treebank translations are more literal
than most published translations, there are many subtle inaccuracies that may only be
perceptible to someone who is doing a task like the one we are performing here. Below we
show an example of how translation can alter source meaning and how this may have caused
holes or inaccuracies in some of our mappings.
Figure 9: An Example Parallel Sentence with Verbs Highlighted
由 浙江 医科院 院长 、 中国 科学 院士 毛江森 主持 1 在 世界 上 率先 研究 2 成功 3 , 并 具
有 4 国际 先进 水平 的 甲肝 减毒 活 疫苗 , 去年 经 卫生部 批准 5 正式 投入 6 生产 和 使用 ,
目前 该 区 生产 7 此 疫苗 的 普康 公司 已 形成 8 年 产 9 五百万 人份 的 生产 规模 , 这 对
有效 10 地 控制 11 甲肝 流行 具有 重大 意义 。
The internationally advanced hepatitis " A " active toxin reducing vaccine successfully researched and produced for
the first time in the world and headed by Jiangsen Mao , President of Zhejiang Medical Institute and Chinese
Academy of Science member , was put into production and usage after official approval by the Ministry of Public
Health last year . At present , the Pu Kang Company , which produces the vaccine in this zone , has already formed a
production scale of 5 million doses per year , which has great significance in effectively controlling the hepatitis A
epidemic .
The interesting verbs in this sentence are verbs 3, 4, 5, and 9.
19
Verb 3 is a predicative adjective (VA) in Chinese, which we left out of our mapping precisely for
the reason demonstrated here; VAs tend to translate into adjectives and other modifiers, but
rarely into verbs. Here it modifies “researched” as an adverb. This is the same scenario as for
verb 10 which modifies the verb “controlling” as an adverb.
Verb 4 is not even represented in the English translation. It means “to have” or “to possess”,
and corresponds to the first segment of the passage translated as “the internationally advanced
hepatitis ‘A’ active toxin reducing vaccine”. A more literal translation, which would have better
suited our purposes, would have been something like “the active toxin-reducing hepatitis A
vaccine that possesses an internationally advanced level”. We can see how this translation is
awkward but encapsulates the verb “to possess” which would have been more accurately
tagged as VPRES.
Verb 5 also demonstrates a translation choice that resulted in a poor tense mapping. Verb 5
means “to approve”. The translation nominalizes the verb and therefore loses tense information.
A more literal and accurate translation would have been something like “last year, [the vaccine]
was approved by the Ministry of Public Health to formally enter production and usage”.
Nominalization was entirely avoidable for this verb, which may not always be the case, and
because of this translation choice, we lost valuable tense information.
Finally, verb 9 is not represented in the translation. It literally means “to produce”, but is used
somewhat idiosyncratically before the number amount of production for that year. Here this
verb is preceded by the word meaning “year” so together you could translate this as “[the
company], yearly, produces…”. The given translation seems to capture this with “per year” but
20
unfortunately doesn’t make the necessary alignment. In any case, it is not a verb alignment so it
would not provide any tense information. This is a case again where a more literal translation
would have been more helpful.
This sentence was chosen at random to demonstrate the types of problems we encounter even
with manual alignments. This mapping procedure is by no means perfect as we can see by the
number of VNOMAP and VOTHER tags that are present in the data, but when mappings exist
they tend to be more accurate than not.
COMPARISON TO AUTOMATICALLY ALIGNED DATA To further explore our decision to use manual alignments over automatic alignments to create
our gold data set, we looked at 50,000 lines of automatically aligned Chinese Treebank data and
performed the same mapping procedure. We found that our alignment technique drastically
degrades when we use it on automatically aligned data. The number of VNOMAP and VOTHER
tags that show up is overwhelmingly more. Table 3 shows the distribution.
21
Table 3: Tag Distributions on Manually Aligned Data versus Automatically Aligned Data
Tag Manual Alignment
Distribution Automatic Alignment
Distribution
VPAST 19.26% 0.35%
VPAST1 0.26% 0.08%
VPAST2 0.04% 0.41%
VPAST3 0.00% 0.00%
VPRES 9.40% 1.48%
VPRES1 5.34% 1.83%
VPRES2 0.66% 0.02%
VPRES3 0.16% 0.00%
VFUTR 3.34% 0.40%
VFUTR1 0.00% 0.00%
VFUTR2 0.03% 0.07%
VFUTR3 0.00% 0.00%
VINF 12.71% 0.08%
VBG 9.35% 0.25%
VOTHER 21.88% 77.79%
VNOMAP 9.68% 10.93%
VA 7.90% 6.31% A color scale is used to mimic the distribution. The darker the color green, the higher the
distribution. As you can see there is much more green in the manual alignment column. The only
green in the automatic alignment column is the VOTHER, VNOMAP, and VA columns, which
shows how much noisier the data is for this task. Furthermore, the actual amount of VOTHER
tags jumped drastically from 22% to 78%. In other words, the actual number of verbs that have
tense in the manually aligned data account for 38.5% of all verbs, and only account for 4.6% of
all verbs in the automatically aligned data. Assuming that these alignments are even accurate,
this gives us very little to work with.
22
TIME EXPRESSIONS
Time expressions, which include temporal adverbs and phrases, are an important part of
interpreting the temporal location of verbs in Chinese. In order to make use of them we need to
first identify them and then link them to the correct verbal context.
TIME EXPRESSION RECOGNITION For the recognition step, we followed the approach similar to the TIRSemZh method described
in Llorens et al. (2011). The primary difference was that we did not use semantic roles, and were
able to achieve slightly better accuracy.
Our data was taken from the Temp-Eval 2 Task2. The training and testing split are shown in
Figure 10 below.
Figure 10: Data Split
Chinese Treebank Training Files Chinese Treebank Testing Files
chtb_0031, chtb_0032, chtb_0033, chtb_0038, chtb_0040, chtb_0043, chtb_0049, chtb_0053, chtb_0059, chtb_0067, chtb_0071, chtb_0072, chtb_0073, chtb_0077, chtb_0080, chtb_0087, chtb_0088, chtb_0097, chtb_0112, chtb_0118, chtb_0128, chtb_0129, chtb_0130, chtb_0139, chtb_0143, chtb_0144, chtb_0147, chtb_0249, chtb_0251, chtb_0252, chtb_0259, chtb_0279, chtb_0291, chtb_0309
chtb_0544, chtb_0590, chtb_0592, chtb_0593, chtb_0594, chtb_0595, chtb_0596, chtb_0600, chtb_0604, chtb_0605, chtb_0615, chtb_0616, chtb_0618, chtb_0621, chtb_0628
2http://semeval2.fbk.eu/semeval2.php?location=data
23
We framed this task as a simple IOB recognition task and trained a Conditional Random Fields
algorithm using the crfsuite package (Okazaki, 2007), which is a first-order markov model
implementation. If a word began a time expression, it was given the label “B”. If a word was
inside of a time expression but was not the first word, it was given the label “I”. Any word
outside of a time expression was labeled “O”. We used the following features to train our
algorithm:
Features for Timex Extraction
1. WORD: The current word. 2. POS: The POS of the current word. 3. PREV_POS: the part-of-speech of the previous token. 4. NEXT_POS: the part-of-speech of the next token. 5. NORMALIZED: The character string of the word with all digits substituted with a D, so
“2009 年” becomes “DDDD 年”. 6. TimeChar: True if any of the characters in Figure 11 are part of the word. A time
character signals some sort of time or duration when used on its own or as part of
another word. This list was compiled by us, using our own intuitions about the language.
This is essentially a white list that we are incorporating into the algorithm. The
characters are limited to those that are unambiguously related to time, so we expect
that this feature can only help the algorithm, even if the current data set may be too
small to note its significance.
24
Figure 11: Description of TimeChar Characters
Date Character Translation Example Contexts
今 now 如今 “up until now”, 今年 “this year”
明 tomorrow 明天 “tomorrow”
昨 yesterday 昨晚 “last night”
时 time; at that time; while 做功课时 “while doing homework”
候 period 小的时候 “when [I] was little”
纪 century; period 世纪 “century”
钟 hour 两个钟头 “two hours”
天 day 五天后 “after 5 days”
日 day 10 月 10 日 “October 10th”
月 month 下个月 “next month”
年 year 去年 “last year”
早 early 早上 “morning”
晚 late 昨晚 “last night”
期 period 星期日 “Sunday”
We achieved .95 precision, .85 recall, and .89 F1 score, macro-averaged across the three IOB
categories, which is a fair improvement over the 0.94 precision, 0.74 recall, and 0.83 F1
achieved with TIRSemZh (Llorens, Saquete, Navarro, Li, & He, 2011). It is possible that a good
portion of this gain was due to some of the default configurations of the crfsuite classifier, since
our features overlapped for the most part. The TimeChar feature which was unique to our
algorithm, did not increase accuracy significantly, but the data set is too small to deliberate its
usefulness. Either way, it does not explain the extra 3 percentage points gained over TIRsemZh
that had more features and also used semantic roles, so we must assume that the CRF algorithm
that we used was tuned in a more beneficial way for this task.
After we tested this model on the TempEval data, we constructed a final model using all of the
training and testing data combined. With this model, we extracted time expressions in our main
data set and created a parallel time file that would be used for features during the tense
25
prediction stage. An example of this file is shown in Appendix A. Time expressions are denoted
with brackets.
After we extracted time expressions in our main data set, we performed some simple analysis to
understand the nature of these time expressions. Figure 12 is a frequency distribution of time
expressions in the data, with digits normalized (i.e. numerical digits, Arabic and Chinese, are
mapped to “D”). In the entire data set, there were a total of 405 unique time expressions, which
were condensed to 194 normalized time expressions. The distribution is logarithmic, consistent
with Zipf’s Law where the frequency of a word is inversely proportional to its rank – a
phenomenon we see often with frequency distributions in natural language (Zipf, 1932). Not
surprisingly, the most common normalized time expression is “DDDD 年” which is the format for
specifying a year. Following that is “目前”, which means “now” and then “去年” which means
“last year”. An example of one the many hapaxes is “白垩纪” which means “Cretaceous Period”.
Figure 12: Frequency Distribution of Normalized Time Expressions
We will revisit this data later on when we begin to resolve these time expressions to verbs and
interpret their meaning in regards to tense.
DDDD年
目前
去年
白垩纪
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200
Fre
qe
un
cy C
ou
nt
26
LINKING TIME EXPRESSIONS TO VERBS
We use a rule-based approach to link time expressions to their verb counterparts. The rule is
based on the following assumption that we have found to often be the case in Chinese:
A time expression has jurisdiction over all verbs that are ancestors to its phrase node and ancestors to its sibling phrase nodes in a syntactic tree, unless obstructed by a CP or IP node. Given this definition we are able to associate time expressions with verbs by traversing the
syntactic tree. To test this method, we used data provided by Zhou et al. (2012) in which time
expressions where manually associated with events (i.e. verbs) using Mechanical Turk. In their
annotation scheme, a maximum of one time expression is associated with each event, whereas
our method for extracting time expressions has no maximum.
The annotated data came from the following 73 Chinese Treebank files. There were 2902 event
or verb instances in total.
Table 4: Time Expression Link Data Files
chtb_0031 chtb_0032 chtb_0033 chtb_0038 chtb_0040 chtb_0043 chtb_0049 chtb_0053 chtb_0059 chtb_0067 chtb_0071
chtb_0072 chtb_0073 chtb_0077 chtb_0080 chtb_0087 chtb_0088 chtb_0097 chtb_0112 chtb_0118 chtb_0128 chtb_0129
chtb_0130 chtb_0139 chtb_0143 chtb_0144 chtb_0147 chtb_0249 chtb_0251 chtb_0252 chtb_0259 chtb_0279 chtb_0291
chtb_0309 chtb_0310 chtb_0408 chtb_0427 chtb_0441 chtb_0450 chtb_0452 chtb_0453 chtb_0507 chtb_0510
chtb_0309 chtb_0544 chtb_0600 chtb_0604 chtb_0605 chtb_0615 chtb_0616 chtb_0618 chtb_0621 chtb_0628
chtb_0629 chtb_0633 chtb_0644 chtb_0646 chtb_0650 chtb_0651 chtb_0654 chtb_0658 chtb_0660 chtb_0664
chtb_0629 chtb_0670 chtb_0672 chtb_0702 chtb_0704 chtb_0707 chtb_0709 chtb_0714 chtb_0716 chtb_0717
Our rule-based approach achieved 64% accuracy if we consider a match to be an exact match
and 68% accuracy when we consider a match to be one in which the gold match is included in
the set of time expressions that the rule-based algorithm extracted for a given event. We
27
consider the 68% to be more representative of the reality since the annotated gold data was
artificially constrained to only one time expression.
Although further improvements could and eventually should be made to this time-verb linking
algorithm, it seems that the next step would require significant more effort and data that does
not fall into the scope of this thesis. Therefore we used this rule-based association method for
our purposes and proceed to make the correct time associations with our main data set.
28
TENSE PREDICTION
We used a Conditional Random Fields algorithm that was part of the crfsuite package (Okazaki,
2007) to predict tense in Chinese. We looked at verbs only and attempted to tag them with
their correct tense. We consider verbs within a single sentence to be the basis for our sequence
modeling.
FEATURES The following features were used to predict tense. These features were borrowed in part from
Xue (2008) . Some of the simpler lexical features were borrowed from feature sets traditionally
used for Chinese POS-tagging (Ng & Low, 2004).
1. Most Frequent Tense
For this feature, we used 50,000 lines of complementary Chinese Treebank parallel data that
was automatically parsed and aligned. We performed our tense and aspect mappings as we did
with our gold data. Then we found the most common tags associated with each verb, excluding
VNOMAP and VOTHER tags, if they existed for that verb. This feature was therefore the string of
the most common tag associated with the verb.
2. Time Expressions
These are the strings of all time expressions associated with the verb as determined by our
algorithm described in LINKING TIME EXPRESSIONS TO VERBS.
29
3. Time Expression Value
We used the PKU dictionary (Wang & Yu, 2003) for this feature, which has a dictionary of time
expressions and a potential “tense” value, which can be 过 (past), 未(future), or 否(none). If any
of the time expressions in the Time Expressions have tense values, these were used.
4. Verb Classes
We also used the PKU dictionary for this feature. If the verb is placed into one or more verb
classes, we use the numbers associated with all classes.
5. Position in Verb Compound
If the verb is part of a verb compound (VSB, VCD, VRD, VCP, VNV, VPT), its position in the
compound, either first or last.
6. Quotes
If the verb is in quotes, then this feature returns True.
7. Verb
The verb string.
8. Previous Word
The previous word token.
9. Verb POS
The POS of the verb based on the automatic parse.
10. Next POS
The POS of the next word in the sentence.
11. Previous and Current POS
30
The POS of the previous word plus the POS of the current word.
12. Current and Next POS
The POS of the current word plus the POS of the next word.
13. Next Next POS
The POS of the word following the next word.
14. Previous and Next POS
The POS of the previous word plus the POS of the next word.
15. Post-Verb Aspect Marker
The aspect marker that immediately follows the verb, if one exists.
16. Adverb
All adverbs that modify the verb.
17. Right DER
If the functional character 得 occurs after the verb, then this feature is True. This character is followed by some modifier that signals how or the degree to which a verb is being done.
31
Figure 13 is an example of a tree structure taken from our data with some features highlighted,
namely the current verb, an adverb, and a time expression.
32
Figure 13: Example of features in a syntactic tree
RESULTS We established two baseline measures. The first measure used the same data as the Most
Frequent Tense feature and tagged each verb with the most frequent tense if it had any. Since
most verbs are most frequently tagged with VOTHER or VNOMAP, we excluded these when
there were other options available. This baseline came to .214. The second baseline measure
was simply to take the most frequent tag, which was VOTHER and tag all verbs as such. This was
slightly higher at .219.
33
Using the Conditional Random Fields algorithm provided by crfsuite and 10-fold cross validation
on our data set, we were able to achieve 0.552 accuracy – a 34% gain over our baselines. See
Table 5 for these figures.
Table 5: Results Compared to Baseline
Baseline Final Accuracy
.22 .55
We looked at the removal of each individual feature to see how much they contributed to our
final score.
Table 6: Feature Significance
Feature #
Accuracy Difference from
best (-)
All 0.552
7 0.514 -0.038
16 0.524 -0.028
13 0.534 -0.018
4 0.536 -0.016
8 0.538 -0.013
14 0.540 -0.012
1 0.545 -0.007
2 0.545 -0.007
11 0.545 -0.007
5 0.546 -0.005
9 0.548 -0.004
12 0.548 -0.004
3 0.549 -0.003
6 0.549 -0.003
10 0.549 -0.003
15 0.549 -0.003
17 0.552 0.000
The top 5 most important features were the verb itself, the adverbs, the POS of the word
following the next word, the verb classes, and the previous word. Common adverbs like “已经”,
34
meaning “already”, and “将”, meaning “in the future” encompass important temporal cues that
strictly confine the options for tense on the modified verb so it makes sense that this is an
important feature. Our time expression features were not as significant as we expected however
we believe this only proves that we have not yet found a way to capture the relevant
information that they provide. The Most Frequent Tense feature was less significant than we
would have thought, which we consider a better scenario since we would rather our algorithm
not rely on pre-compiled static information.
In terms of precision and recall, the results for each tag are displayed in Table 7.
Table 7: Precision, Recall, and F1 Scores for Each Tag
Tag Precision Recall F1
VA 0.889 0.870 0.879 VPAST 0.595 0.677 0.633 VINF 0.522 0.645 0.577 VOTHER 0.527 0.598 0.560 VNOMAP 0.632 0.493 0.554 VPRES 0.479 0.333 0.393 VFUTR 0.370 0.417 0.392 VBG 0.383 0.349 0.365 VPRES2 0.500 0.250 0.333 VPRES1 0.467 0.200 0.280 VPAST1 0.000 0.000 0.000 VPAST2 0.000 0.000 0.000 VPRES3 - - - VFUTR2 - - -
The VA tag was most accurately predicted followed by VPAST which was most the common type
of tense tag in the data. VPAST1 and VPAST2 occurred less than 5 times in the data so the scores
of 0 do not tell us much.
35
CONCLUSION AND FUTURE WORK Trying to understand tense in a language that has none is a difficult task that we have attempted
to explore in this thesis. There are many questions to consider that may change the direction of
the task considerably. For example, if the motivation is more theoretical, we may consider
defining a temporal annotation scheme that is organic to Chinese and that may or may not
overlap with other languages. If the motivation is for NLP tasks like Machine Translation, it
makes sense to consider other languages, namely the ones we may be translating into, when
considering a tense schema for Chinese. This is more the direction we have taken in this thesis.
We mapped English tense onto Chinese using parallel Chinese-English data. Our work differs
from others in that we explain in detail how this mapping is carried out and the type of data that
results. This understanding is fundamental to creating temporal information schemas and
continuing this type of work going forward.
We integrated time expressions into our feature selection process by constructing an algorithm
for automatically extracting time expressions and associating them with verbs. Although we
achieved considerable results, we found this area to be the one in which the most
improvements can be made. A better understanding of the network of time expressions in the
data and how they associate with verbs is crucial to understanding tense in Chinese since this is
intuitively how Chinese speakers interpret time given that no inclinations on the verb exist. We
have information like “yesterday” and “later”, but these words mean little without a larger
36
temporal context and our algorithm can only extract so much information from strings and
naïve understandings of time expressions (e.g. “yesterday” is in the past). We hope to modify
the direction of this research in the future to focus on time expressions and the networks that
exist between them. Only then can we make more informed decisions about how they may
influence the tense of a verb.
Finally, we used a suite of lexical, syntactic, and other linguistically-informed features to train
and test a Conditional Random Fields algorithm. We achieved considerable improvement over
our baseline. The improvement shows that using English parallel data to understand tense in
Chinese is very worthwhile for certain NLP tasks. In the future we would like to integrate our
results into a MT system to see how translation could be improved. We hope that the types of
features we explored will help correct the types of errors that we saw in Figure 1, Figure 2, and
Figure 3.
37
Bibliography Kudo, T. (2005). CRF++: Yet another crf toolkit. Retrieved from http://crfpp.sourceforge.net.
Li, W., Wong, K.-F., & Yuan, C. (2001). A Model for Processing Temporal References in Chinese.
Workshop on Temporal and Spatial Information Processing - Volume 13 (pp. 5:1-5:8).
Stroudsburg: Association for Computational Linguistics.
Liu, F., Liu, F., & Liu, Y. (2011). Learning from Chinese-English Parallel Data for Chinese Tense
Prediction. Fifth International Joint Conference on Natural Language Processing (pp.
1116-1124). Chiang Mai, Thailand: AFNLP.
Llorens, H., Saquete, E., Navarro, B., Li, L., & He, Z. (2011). Data-Driven Approach Based on
Semantic Roles for Recognizing Temporal Expressions and Events in Chinese. NLDB 2011
(pp. 88-99). Berlin: Springer-Verlag.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of
English: The Penn Treebank. Computational Linguistics, 313-330.
Ng, H., & Low, J. (2004). Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-
Based or Character-Based? Proceedings of EMNLP.
Okazaki, N. (2007). CRFsuite: a fast implementation of Conditional Random Fields (CRFs).
Retrieved from http://www.chokkan.org/software/crfsuite/
Olsen, M., Traum, D., Van Ess-Dykema, C., Weinberg, A., & Dolan, R. (2000). Telicity as a Cue to
Temporal and Discourse Structure in Chinese-English Machine Translation. College Park,
MD.
Pustejovsky, J., Ingria, R., Sauri, R., Castano, R., Littman, J., Gaizauskas, R., et al. (2004). The
Specification Language TimeML. In The Language of Time: A Reader (pp. 185-196).
Oxford.
Wang, H., & Yu, S. (2003). The semantic knowledge-base of contemporary Chinese and its
applications in WSD. Proceedings of the second SIGHAN workshop on Chinese language
processing - Volume 17 (pp. 112-118). Sapporo, Japan: Association for Computational
Linguistics.
Xue, N. (2008). Automatic inference of the temporal location of situations in Chinese text. 2008
Conference on Empirical Methods in Natural Language Processing (pp. 707-714).
Honolulu: Association for Computational Linguistics.
Ye, Y., & Zhang, Z. (2005). Tense Tagging for Verbs in Cross-Lingual Context: A Case Study.
Second International Joint Conference on Natural Language Processing (pp. 885-895).
Jeju Island, Korea: Springer-Verlag.
38
Ye, Y., Fossum, V. L., & Abney, S. (2006). Latent Features in Automatic Tense Translation
between Chinese and English. Fifth SIGHAN Workshop on Chinese Language Processing
(pp. 48-55). Sidney, Australia: Association for Computational Linguistics.
Zhou, Y., & Xue, N. (2012). Exploring Temporal Vagueness with Mechanical Turk. Proceedings of
the 6th Linguistic Annotation Workshop (pp. 124-128). Jeju, Korea: Association for
Computational Linguistics.
Zhu, X., Yuan, C., Wong, K., & Li, W. (2000). An Algorithm for Situation Classification of Chinese
Verbs. Second Workshop on Chinese Language Processing (pp. 140-145). Stroudsburg:
Association for Computational Linguistics.
Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language. Cambridge,
Mass: Harvard University Press.
39
APPENDIX A: TIME FILES
新华社 香港 [二月 二十三日] 3电
据 台 “ 经济部 ” 统计 , [去年]4 两 岸 贸易额 为 二百零九亿 美元 。
其中 , 台湾 对 祖国 大陆 输出值 为 一百七十八亿 美元 , 比 [上 一 年]5 增长 百分之二十 ;
输入值 为 三十一亿 美元 , 比 [上 年]6 增长 百分之七十四 。
台湾 在 两 岸 贸易 中 顺差 一百四十七亿 美元 。
统计 还 显示 , 台商 投资 祖国 大陆 正 趋向 大型化 。
[去年]7 经 台 当局 核准 的 台商 投资案 共 四百九十 项 , 金额 为 十点九二亿 美元 。
在 投资 项目 上 比 [上 年] 减少 四百四十四 件 , 但 投资 金额 却 比 [上 年]8 增加 一点三亿
多 美元 。
( 完 )
新华社 北京 [二月 二十九日]9 电
国家 开发 银行 [日前]10 在 日本 资本 市场 成功 地 发行 了 三百亿 日元 武士 债券 。
这 是 国家 开发 银行 首 次 在 国际 资本 市场 发行 债券 , 由 日本 野村 证券 株式会社 和
日本 兴业 银行 证券 株式会社 作为 联合 主干事 , 发行 期限 十 年 , 到期 一 次 偿还 。
据 了解 , 这 次 发行 武士 债券 的 条件 是 [近 几 年]11 来 比较 优惠 的 , 筹集 的 资金 将
主要 用于 广东 岭澳 核电 工程 、 伊敏 电厂 和 绥中 电厂 等 国家 重点 建设 项目 。
国家 开发 银行 自 成立 以来 , 为 国家 重点 建设 项目 筹集 了 大 批 资金 。
[一九九五年]12 , 国家 开发 银行 成功 地 组织 了 首 次 五千万 美元 外国 银团 贷款 , 同
时 承做 了 岭澳 核电 工程 、 秦山 二 期 核电 等 项目 的 国外 出口 信贷 的 转贷 , 从 内资
和 外资 两 个 方面 不断 加大 对 重点 建设 项目 的 支持 力度 , 为 推动 中国 经济 发展 发
挥 了 积极 的 作用 。
( 完 )
3 February 23rd
4 Last year
5 Last year
6 Last year
7 Last year
8 Last year
9 February 29th
10 A few days ago
11 In recent years
12 1995