Sentence Processing_study 1

7/29/2019 Sentence Processing_study 1

1/24

T ransform at ion-B ased Error-DrivenLearning and Natura l LanguageProcess ing: A Case Study inPart -o f -Speech Tagg ingEric Bri l l*T h e J o h n s H o p k i n s U n i v e r s i t y

Recently, there has been a rebirth of empiricism in the field o f natural languag e processing. Ma n-ual encoding of l inguistic information is being challenged by automated corpus-based learningas a method o f provid ing a natura l language process ing sys tem w ith l inguis tic knowledge. A l-though corpus-based approaches have been successful in m an y different areas of natural languag eprocessing, i t is often the case that these methods capture the linguistic information they aremod elling indirectly in large opaqu e tables of statistics. This can m ake it diff icult to analyze,understand and im prove the ability of these approaches to mod el underlying linguistic behavior.In this paper, w e will describe a simple rule-based approach to automated learning of l inguisticknow ledge. T his approach has been shown for a nu m ber o f tasks to capture informa tion in a clearerand more d irect fash ion with ou t a comp romise in per formance. W e present a deta i led case s tu dyof this learning method applied to part-of-speech tagging.1 . I n t r o d u c t i o nI t h a s r e c e n t l y b e c o m e c le a r th a t a u t o m a t i c a l l y e x t r a c t i n g l i n g u is t ic i n f o r m a t i o n f r o ma s a m p l e t e xt c o r p u s c a n b e a n e x t r e m e l y p o w e r f u l m e t h o d o f o v e r c o m i n g t h e li n gu i s-t ic k n o w l e d g e a c q u i s it i o n b o t t l e n e c k i n hi b i ti n g t h e c r e a t io n o f r o b u s t a n d a c c u r a t en a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s . A n u m b e r o f p a r t - o f- s p e e c h t a g g e rs a r e r e a d i l ya v a i l a b l e a n d w i d e l y u s e d , a ll t r a i n e d a n d r e t r a i n a b l e o n t e x t c o r p o r a ( C h u r c h 1 98 8;C u t t i n g e t a l . 1 9 9 2 ; B r i l l 1 9 9 2 ; W e i s c h e d e l e t a l . 1 9 9 3 ) . E n d e m i c s t r u c t u r a l a m b i g u i t y ,w h i c h c a n l e a d t o s u c h d i f fi c u lt ie s a s tr y i n g t o c o p e w i t h t h e m a n y t h o u s a n d s o f p o s s i-b l e p a r s e s t h a t a g r a m m a r c a n a s s ig n t o a s en t e n c e , c a n b e g re a t l y r e d u c e d b y a d d i n ge m p i r i c a l l y d e r i v e d p r o b a b i l i t i e s t o g r a m m a r r u l e s ( F u j i s a k i e t a l . 1 9 8 9 ; S h a r m a n , J e -l i n e k , a n d M e r c e r 1 9 9 0 ; B l a c k e t a l . 1 9 9 3 ) a n d b y c o m p u t i n g s t a t i s t i c a l m e a s u r e s o fl e x i c a l a s s o c i a t i o n ( H i n d l e a n d R o o t h 1 9 9 3 ) . W o r d - s e n s e d i s a m b i g u a t i o n , a p r o b l e mt h a t o n c e s e e m e d o u t o f r e a c h f o r s y s t e m s w i t h o u t a g r e a t d e a l o f h a n d c r a f t e d l in -g u is ti c a n d w o r l d k n o w l e d g e , c a n n o w i n s o m e c a s e s b e d o n e w i t h h i g h a c c u r a c yw h e n a l l i n f o r m a t i o n i s d e r i v e d a u t o m a t i c a l l y f r o m c o r p o r a ( B r o w n , L a i , a n d M e r c e r1 9 9 1 ; Y a r o w s k y 1 9 9 2 ; G a l e , C h u r c h , a n d Y a r o w s k y 1 9 9 2 ; B r u c e a n d W i e b e 1 9 9 4 ) . A ne f fo r t h a s r e c e n t ly b e e n u n d e r t a k e n t o c r e a te a u t o m a t e d m a c h i n e t r a ns l a ti o n s y s t e m si n w h i c h t h e l i n g u i s t i c i n f o r m a t i o n n e e d e d f o r t r a n s l a t i o n i s e x t r a c t e d a u t o m a t i c a l l yf r o m a l i g n e d c o r p o r a ( B r o w n e t a l. 1 99 0). T h e s e a r e j u s t a f e w o f t h e m a n y r e c e n ta p p l i c a ti o n s o f c o r p u s - b a s e d t e c h n i q u e s i n n a tu r a l l a n g u a g e p r o c e s si n g .

De partm ent of Co m pute r Science, Baltimore, MD 21218-2694. E-m ail: [email protected].

1995 Association for C om putational Linguistics


2/24

Com putat ional Linguist ics Volume 21, Nu m ber 4

A l o n g w i t h g r e a t r e s e a r c h a d v a n c e s , t h e i n f r a s t r u c t u r e i s i n p l a c e f o r t h i s l i n e o fr e s e a r c h t o g r o w e v e n s t r o n g e r , w i t h o n - l i n e c o r p o r a , t h e g r i s t o f t h e c o r p u s - b a s e dn a t u r a l l a n g u a g e p r o c e s s i n g g r i n d s to n e , g e t ti n g b ig g e r a n d b e t t e r a n d b e c o m i n g m o r er e a d i l y av a il ab l e. T h e r e a r e a n u m b e r o f e ff o rt s w o r l d w i d e t o m a n u a l l y a n n o t a t e l a r g ec o r p o r a w i t h l i n g u i s t i c i n f o r m a t i o n , i n c l u d i n g p a r t s o f s p e e c h , p h r a s e s t r u c t u r e a n dp r e d i c a t e - a r g u m e n t s t r u c t u r e ( e .g ., t h e P e n n T r e e b a n k a n d t h e B r i ti s h N a t i o n a l C o r p u s( M a r c u s , S a n t o r i n i , a n d M a r c i n k i e w i c z 1 99 3; L e e c h , G a r s i d e , a n d B r y a n t 1 9 94 )). A v a s ta m o u n t o f o n- l in e t ex t i s n o w a v ai la b le , a n d m u c h m o r e w i l l b e c o m e a v a i l ab l e i n t h ef u t u r e . U s e f u l t o o l s , s u c h a s l a r g e a l i g n e d c o r p o r a ( e . g . , t h e a l i g n e d H a n s a r d s ( G a l ea n d C h u r c h 1 9 91 )) a n d s e m a n t i c w o r d h i e r a r c h i e s ( e.g ., W o r d n e t ( M i l l e r 19 90 )), h a v ea l s o r e c e n t l y b e c o m e a v a i la b l e.C o r p u s - b a s e d m e t h o d s a r e o f te n ab l e to s u c c e e d w h i l e i g n o r i n g t h e t r u e c o m p l e x -i ti es o f la n g u a g e , b a n k i n g o n t h e f ac t th a t c o m p l e x li n g u is ti c p h e n o m e n a c a n o f t e n b ei n d i r e c tl y o b s e r v e d t h r o u g h s i m p l e e p i p h e n o m e n a . F o r e x a m p l e , o n e c o u l d a c c u r a t e l ya s s i g n a p a r t - o f - s p e e c h t a g t o t h e w o r d r a c e i n ( 1-3 ) w i t h o u t a n y r e f e r e n c e t o p h r a s es t r u c t u r e o r c o n s t i t u e n t m o v e m e n t : O n e w o u l d o n l y h a v e t o r e a l i z e t h a t , u s u a l l y , aw o r d o n e o r t w o w o r d s t o t h e r ig h t o f a m o d a l i s a v e r b a n d n o t a n o u n . A n e x c e p -t i o n t o t h i s g e n e r a l i z a t i o n a r i s e s w h e n t h e w o r d i s a l s o o n e w o r d t o t h e r i g h t o f ad e t e r m i n e r .

(1 )(2 )(3)

H e w i l l r a c e / V E R B the ca r .H e w i l l n o t r a c e /V E R B t h e c a r.W h e n w i l l t h e ra c e / N O U N e n d?

I t i s a n e x c i t i n g d i s c o v e r y t h a t s i m p l e s t o c h a s t i c n - g r a m t a g g e r s c a n o b t a i n v e r yh i g h r a te s o f t a g g i n g a c c u r a c y s i m p l y b y o b s e r v i n g f i x e d -l e n g t h w o r d s e q u e n c e s, w i t h -o u t r e c o u r s e to t h e u n d e r l y i n g l i n gu i s ti c s t ru c t u r e . H o w e v e r , i n o r d e r t o m a k e p r o g r e s si n c o r p u s - b a s e d n a t u r a l l a n g u a g e p r o c e ss i n g , w e m u s t b e c o m e b e tt e r a w a r e o f ju s tw h a t c u e s to l i n gu i st ic s t r u c t u r e a r e b e i n g c a p t u r e d a n d w h e r e t h e s e a p p r o x i m a t i o n st o t h e t r u e u n d e r l y i n g p h e n o m e n a f a i l . W i t h m a n y o f t h e c u r r e n t c o r p u s - b a s e d a p -p r o a c h e s t o n a t u r a l l a n g u a g e p r o c e s s i n g , t h is i s a n e a r l y i m p o s s i b l e t a sk . C o n s i d e rt h e p a r t - o f - s p e e c h ta g g i n g e x a m p l e a b o v e . I n a s to c h a s t i c n - g r a m t a g g er , th e i n f o r m a -t i o n a b o u t w o r d s t h a t f o l l o w m o d a l s w o u l d b e h i d d e n d e e p l y i n t h e t h o u s a n d s o rt e n s o f t h o u s a n d s o f c o n t e x t u a l p r o b ab i l i ti e s ( P ( T a g i I Z a g i - l Z a g i - 2 ) ) a n d t h e r e s u l t o fm u l t i p l y i n g d i f f e r e n t c o m b i n a t i o n s o f t h e s e p r o b a b i li t ie s t o g e th e r .B e lo w , w e d e s c r ib e a n e w a p p r o a c h t o c o r p u s - b a s e d n a t u r a l l a n g u a g e p r o c e s s i n g ,c a l l e d t r a n s f o r m a t i o n - b a s e d e r r o r - d r i v e n l e ar n i n g . T h i s a l g o r i t h m h a s b e e n a p p l i e d t o an u m b e r o f n a tu r a l l a n g u a g e p r o b l e m s , i n c l u d i n g p a r t - o f -s p e e c h ta g g i n g , p r e p o s it i o n a lp h r a s e a t t a c h m e n t d i s a m b i g u a t i o n , a n d s y n t a c t i c p a r s i n g (B r il l 1 9 9 2 ; B r il l 19 93 a; B r il l1 99 3b ; B r il l a n d R e s n i k 1 9 9 4 ; B r i ll 1 9 9 4 ). W e h a v e a l s o r e c e n t l y b e g u n e x p l o r i n g t h eu s e o f t h i s t e c h n i q u e f o r l e t t e r - t o - s o u n d g e n e r a t i o n a n d f o r b u i l d i n g p r o n u n c i a t i o nn e t w o r k s f o r s p e e c h r e c o g n i t i o n . I n t h i s a p p r o a c h , t h e l e a r n e d l i n g u i s t i c i n f o r m a t i o ni s r e p r e s e n t e d i n a c o n c i s e a n d e a s i l y u n d e r s t o o d f o r m . T h i s p r o p e r t y s h o u l d m a k et r a n s f o r m a t i o n - b a s e d l e a r n i n g a u s e f u l t o o l f o r f u r t h e r e x p l o r i n g l i n g u i s t i c m o d e l i n ga n d a t t e m p t i n g t o d i s c o v e r w a y s o f m o r e t i g h t l y c o u p l i n g t h e u n d e r l y i n g l i n g u i s t i cs y s t e m s a n d o u r a p p r o x i m a t i n g m o d e l s .

544


3/24

Brill Transformation-Based Error-Driven Learning

UNANNOTATEDTEXT

STATE

/~kI~INOTETAxTD TRUTH

Figure 1Transformation-Based Error-Driven Learning.

RULES

2 . T ra n s f o rm a t i o n - B a s ed E rror -D r iv en L e a m i n gFigure I illustrates how transformation-based error-driven learning works. First, unan-notated text is passed through an initial-state annotator. The initial-state annotator canrange in complexity from assigning random structure to assigning the output of asophisticated manually created annotator. In part-of-speech tagging, various initial-state annotators have been used, including: the out put of a stochastic n-gram tagger;labelling all words with their most likely tag as indicated in the training corpus; andnaively labelling all words as nouns. For syntactic parsing, we have explored initial-state annotations ranging from the output of a sophisticated parser to random treestructure with random nonterminal labels.Once text has been passed through the initial-state annotator, it is then comparedto the truth. A manually annotated corpus is used as our reference for truth. Anordered list of transformations is learned that can be applied to the output of theinitial-state annota tor to make it better resemble the truth. There are two componentsto a transformation: a rewrite rule and a triggering environment. An example of arewrite rule for part-of-speech tagging is:

Change the tag f rom moda l t o noun .and an example of a triggering environment is:

The preceding word i s a determiner .Taken together, the transformation with this rewrite rule and triggering environ-ment when applied to the word can would correctly change the mistagged:

The~determiner can~m odal rus ted~verb . /.

545


4/24

Computational Linguistics Volume 21, Number 4

to:The~determiner can~nou n rusted~verb ./.

An example of a bracketing rewrite rule is: change the bracketing of a subtreefrom:

AB C

to:

CA B

where A, B and C can be either terminals or nonterminals. One possible set of trigger-ing environments is any combination of words, part-of-speech tags, and nonterminallabels within and adjacent to the subtree. Using this rewrite rule and the triggeringenvironment A = the, the bracketing:

( the ( boy ate ) )would become:

( ( the boy ) ate )In all of the applications we have examined to date, the following greedy search is

applied for deriving a list of transformations: at each iteration of learning, the transfor-mation is found whose application results in the best score according to the objectivefunction being used; that transformation is then added to the ordered transforma-tion list and the training corpus is updated by applying the learned transformation.Learning continues until no transfo rmation can be found whose application results inan improvement to the annotated corpus. Other more sophisticated search techniquescould be used, such as simulated annealing or learning with a look-ahead window,but we have not yet explored these alternatives.Figure 2 shows an example of learning transformations. In this example, we as-sume there are only four possible transformations, T1 through T4, and that the ob-jective function is the total number of errors. The unannotated training corpus isprocessed by the initial-state annotator, and this results in an annotated corpus with5,100 errors, determined by comparing the output of the initial-state annotator withthe manual ly derived annotations for this corpus. Next, we apply each of the possibletransformat ions in turn and score the resulting anno tated corpus. 1 In this example,

1 In the real implementation, the search is data driven, a nd therefore not all transformations need to beexamined.

546


5/24

Brill Tra nsform ation-B ased Error-Driven Le arnin g

U n a n n o t a t e dC o r p u s

IIn i t i a l StateA n n o t a t o r

A n n o t a t e dC o r p u s

Er rors = 5 ,100

T I

I


Er r o r s = 5 , 1 0 0


Er rors = 3 ,145


Er r o r s = 3 , 9 1 0

T1


A n n o t a t e dT 2 C o r p u s

Er r o r s = 2 , 1 1 0


Errors = 1,231

W 4 n o e d 4 I A n n o t a t e 'C o r p u s C o r p u s /

E r r o r s = 6 , 30 0 E r r o r s = 4 , 2 5 ~F i g u r e 2An Example of Transformation-Based Error-Driven Learning.

T 1

1"2

T 3

T 4


Er rors = 1 ,410


Er ro rs = 1 ,251


Errors = 1,231


Er ror s = 1 ,231

applying transformation T2 results in the largest reduction of errors, so T2 is learnedas the first transformation. T2 is then applied to the entire corpus, and learning con-tinues. At this stage of learning, transformation T3 results in the largest reduction oferror, so it is learned as the second transformation. After applying the initial-stateannotator, followed by T2 and then T3, no further reduction in errors can be obtainedfrom applying any of the transformations, so learning stops. To annotate fresh text,this text is first annotated by the initial-state annotator, followed by the application oftransformat ion T2 and then by the application of T3.

To define a specific application of transformation-based learning, one mus t specifythe following:

.

2.

.

The initial state-annotator.The space of allowable transformations (rewrite rules and triggeringenvironments).The objective function for comparing the corpus to the truth andchoosing a transformation.

In cases where the application of a particular transformation in one environmentcould affect its application in another environment, two additional parameters mustbe specified: the order in which transformations are applied to a corpus, and whethera transformation is applied immediately or only after the entire corpus has been ex-amined for triggering environments. For example, take the sequence:

A A A A A Aand the transformation:

547


6/24

Com putat ional Linguist ics Volume 21, N um be r 4

Change the label from A to B i f the preceding label is A.I f t h e e f f e c t o f t h e a p p l i c a t i o n o f a t r a n s f o r m a t i o n i s n o t w r i t t e n o u t u n t i l t h e e n t i r ef il e h a s b e e n p r o c e s s e d f o r th a t o n e t r a n s f o r m a t i o n , t h e n r e g a r d l e s s o f t h e o r d e r o fp r o c e s s i n g t h e o u t p u t w i l l b e :

A B B B B B ,s in c e t h e tr i g g e ri n g e n v i r o n m e n t o f a t r a n s f o r m a t i o n i s a l w a y s c h e c k e d b e f o r e th a tt r a n s f o r m a t i o n i s a p p l i e d t o a n y s u r r o u n d i n g o b j e c ts i n t h e c o r p u s . I f t h e e f fe c t o f at r a n s f o r m a t i o n i s r e c o r d e d i m m e d i a t e ly , t h e n p r o c e s s i n g t h e s t ri n g l e ft to r ig h t w o u l dr esu l t i n :

A B A B A B ,w h e r e a s p r o c e s s i n g r i g h t t o le f t w o u l d r e s u l t in :

A B B B B B .3. A C o m p a r i s o n W i th D e c i s i o n T r e esT h e t e c h n i q u e e m p l o y e d b y t h e l e a r n e r is s o m e w h a t s i m i la r t o th a t u s e d i n d e c i s i o nt r e e s ( B r e i m a n e t a l . 1 9 8 4 ; Q u i n l a n 1 9 8 6 ; Q u i n l a n a n d R i v e s t 1 9 8 9 ) . A d e c i s i o n t r e ei s t r a i n e d o n a s e t o f p r e c l a s s i f ie d e n t i t i e s a n d o u t p u t s a s e t o f q u e s t i o n s t h a t c a n b ea s k e d a b o u t a n e n t i t y t o d e t e r m i n e i t s p r o p e r c l a s s i f i c a t i o n . D e c i s i o n t r e e s a r e b u i l tb y f i n d i n g t h e q u e s t i o n w h o s e r e s u l t i n g p a r t i t i o n i s t h e p u r e s t , 2 s p l i t ti n g t h e t r a i n i n gd a t a a c c o r d i n g t o t h a t q u e st i o n , a n d t h e n r e c u r s i v e l y r e a p p l y i n g t h is p r o c e d u r e o ne a c h r e s u l t i n g s u b s e t .

W e fi r st s h o w t h a t t h e s e t o f c l a s s if i c a ti o n s t h a t c a n b e p r o v i d e d v i a d e c i s i o n t r e e si s a p r o p e r s u b s e t o f t h o s e t h a t c a n b e p r o v i d e d v i a t r a n s f o r m a t i o n l is ts ( a n o r d e r e dl is t o f t r a n s f o r m a t i o n - b a s e d r u l e s) , g i v e n t h e s a m e s e t o f p r i m i t i v e q u e s t i o n s . W e th e ng i v e s o m e p r a c t ic a l d i f fe r e n c e s b e t w e e n t h e t w o l e a r n in g m e t h o d s .3 .1 D e c i s i o n T r e e s c_ T r a n s f o r m a t i o n L i s t sW e p r o v e h e r e t h a t f o r a f ix e d s e t o f p r i m i t i v e q u e r i e s , a n y b i n a r y d e c i s i o n t r e e c a nb e c o n v e r t e d i n t o a t r a n s f o r m a t i o n li st. E x t e n d i n g t h e p r o o f b e y o n d b i n a r y t r e es iss t r a i g h t f o r w a r d .

P r o o f ( b y i n d u c t i o n )B a s e C a s e :G i v e n t h e f o l l o w i n g p r i m i t i v e d e c i s i o n t r e e , w h e r e t h e c l a s s if i c a ti o n is A if t h e

a n s w e r t o t h e q u e r y X ? i s y e s , a n d t h e c l a s s i f i c a t i o n i s B i f t h e a n s w e r i s n o :X ?

B A

2 O n e p o s s i b l e m e a s u r e f o r p u r i t y i s e n t r o p y r e d u c t i o n .

5 4 8


7/24

t h i s tr e e c a n b e c o n v e r t e d i n t o t h e f o l l o w i n g t r a n s f o r m a t i o n l is t:.

2.3.

X ?

L a b e l w i t h S / * S t a rt S ta te A n n o t a t i o n * /I f X, then S - -* AS --* B / * E m p t y T a g g i ng E n v i r o n m e n t - - A l w a y s A p p l i e s T o E n t it ie sC u r r e n t l y L a b e l e d W i t h S * /

I n d u c t i o n :A s s u m e t h a t t w o d e c i s i o n tr e es T 1 a n d T2 h a v e c o r r e s p o n d i n g t r a n s f o r m a t i o n l is ts

L1 a n d L 2. A s s u m e t h a t t h e a r b i t r a r y l a b e l n a m e s c h o s e n i n c o n s t r u c t i n g L 1 a r e n o tu s e d i n L 2, a n d t h a t t h o s e i n L 2 a r e n o t u s e d i n L 1. G i v e n a n e w d e c i s i o n t r e e T3c o n s t r u c t e d f r o m T1 a n d T2 a s f o l lo w s :

Br i ll T ra n sfo rm a t io n -Ba se d Er ro r -D r iv e n Le a rn in g

w e c o n s t r u c t a n e w t r a n s f o r m a t i o n l i st L 3. A s s u m e t h e f i rs t t r a n s f o r m a t i o n i n L1 i s:L a b e l w i t h S '

a n d t h e f i r s t t r a n s f o r m a t i o n i n L2 i s:L a b e l w i t h S "

T h e f i rs t th r e e t r a n s f o r m a t i o n s i n L 3 w i l l t h e n b e:

1. L a b e l w i t h S2. I f X th en S --* S '3. S --+ S "

f o l l o w e d b y a l l o f t h e r u l e s i n L 1 o t h e r t h a n t h e f i r st r u le , f o l l o w e d b y a l l o f t h e r u l e si n L 2 o t h e r t h a n t h e f i r st r u le . T h e r e s u l t i n g t r a n s f o r m a t i o n l i s t w i l l fi r st l ab e l a n i t e ma s S ' i f X is t r u e , o r a s S " i f X is f a ls e . N e x t , t h e t r a n f o r m a t i o n s f r o m L1 w i l l b e a p p l i e di f X i s t r u e , s i n c e S ' i s t h e i n i t i a l - s t a t e l a b e l f o r L 1. I f X i s f a ls e , t h e t r a n s f o r m a t i o n sf r o m L2 w i l l b e a p p l i e d , b e c a u s e S " i s t h e i n i t i a l - s ta t e l a b e l f o r L 2. [ ]3 .2 D e c i s i o n T r e e s # T r a n s f o r m a t i o n L i s t sW e s h o w h e r e t h a t t h e re e x i st t r a n s f o r m a t i o n l i st s fo r w h i c h n o e q u i v a l e n t d e c i s i o nt r ee s e x i s t, f o r a f ix e d s e t o f p r i m i t i v e q u e r i e s. T h e f o l l o w i n g c l a s s i fi c a t io n p r o b l e m i so n e e x a m p l e . G i v e n a s e q u e n c e o f c h a r a c t e r s , c l a ss i f y a c h a r a c t e r b a s e d o n w h e t h e rt h e p o s i t i o n i n d e x o f a c h a r a c t e r i s d i v i s i b l e b y 4 , q u e r y i n g o n l y u s i n g a c o n t e x t o ft w o c h a r a c t e r s t o t h e l e f t o f th e c h a r a c t e r b e i n g c l a s s i fi e d .

5 4 9


8/24


Assuming transformations are applied left to right on the sequence, the aboveclassification problem can be solved for sequences of arbitrary length if the effect ofa transformation is written out immediately, or for sequences up to any prespecifiedlength if a transformat ion is carried out only after all triggering environments in thecorpus are checked. We present the proof for the former case.Given the inpu t sequence:

A A A A A A A A A A0 1 2 3 4 5 6 7 8 9

the underlined characters should be classified as true because their indices are 0, 4,and 8. To see why a decision tree could not perform this classification, regardless oforder of classification, note that, for the two characters before both A3 and A 4 , both thecharacters and their classifications are the same, although these two characters shouldbe classified differently. Below is a t rans formation list for per forming this classification.Once again, we assume transformations are applied left to right and that the result of atransformation is written out immediately, so that the result of applying transformationx to character ai will always be kn own when applying transformation x to a i+ l .

1. Label with SRESULT: A/ S A/ S A/ S A/ S

2. If there is no previous character,RESULT: A/F A/S A/S A/S

3. If the character two to the left isRESULT: A/F A/S A/F A/S4. If the character two to the left isRESULT: A/F A/S A/S A/S5. F --+ yes6. S--* no

A/S A/S A/S A/S A/S A/S A/Sthen S ~ FA/S A/S A/S A/S A/S A/S A/Slabelled with F, then S --* FA/F A/S A/F A/S A/F A/S A/Flabelled with F, then F ~ SA/F A/S A/S A/S A/F A/S A/S

RESULT: A/yes A/ no A/n o A /no A/yes A/ no A/n o A /no A/yesA/no A/no

The extra power of transformation lists comes from the fact that intermediateresults from the classification of one object are reflected in the current label of thatobject, thereby making this intermediate information available for use in classifyingother objects. This is not the case for decision trees, where the outcome of questionsasked is saved implicitly by the current location within the tree.3 .3 S o m e P r a ct ic a l D i f f e r e n c e s B e t w e e n D e c i s i o n T r e e s a n d T r a n s f o r m a t i o n L is t sThere are a number of practical differences between transformation-based error-drivenlearning and learning decision trees. One difference is that when training a decisiontree, each time the depth of the tree is increased, the average amou nt of training mate-rial available per nod e at tha t new depth is halved (for a binary tree). In transformation-based learning, the entire training corpus is used for finding all transformations. There-fore, this method is not subject to the sparse data problems that arise as the depth ofthe decision tree being learned increases.Transformations are ordered, with later transformations being dependent uponthe outcome of applying earlier transformations. This allows intermediate results in

550


9/24

Br il l T r a n s f o r m a t i o n - Ba s e d E r r o r - Dr i v e n Le a r n i n g

c l a s s i f y i n g o n e o b j e c t t o b e a v a i l a b l e i n c l a s s i f y i n g o t h e r o b j e c ts . F o r i n s t a n c e , w h e t h e rt h e p r e v i o u s w o r d is ta g g e d a s to-infinitival o r to-preposit ion m a y b e a g o o d c u e f o r d e -t e r m i n i n g t h e p a r t o f s p e e c h o f a w o r d . 3 If , i n it ia l ly , t h e w o r d to i s n o t r e l i a b l y t a g g e de v e r y w h e r e in t h e c o r p u s w i t h i ts p r o p e r t a g ( o r n o t t a g g e d a t al l ), th e n t h is c u e w i l lb e u n r e l ia b l e . T h e t r a n s f o r m a t i o n - b a s e d l e a r n e r w i ll d e l a y p o s i t i n g a t r a n s f o r m a t i o nt r i g g e r e d b y t h e t a g o f th e w o r d to u n t i l o t h e r t r a n s f o r m a t i o n s h a v e r e s u l t e d i n a m o r er e l i a b le t a g g i n g o f th i s w o r d i n t h e c o r p u s . F o r a d e c i s i o n t r e e t o t a k e a d v a n t a g e o ft hi s i n f o r m a t i o n , a n y w o r d w h o s e o u t c o m e is d e p e n d e n t u p o n t h e t a g g in g o f to w o u l dn e e d t h e e n t i r e d e c i s io n t r ee s t r u c t u r e f o r th e p r o p e r c l a s s if ic a t i o n o f e a c h o c c u r r e n c eo f to b u i l t i n t o i ts d e c i s i o n t r e e p a t h . I f t h e c l a s s i f ic a t i o n o f to w e r e d e p e n d e n t u p o n t h ec l a ss i fi c a ti o n o f y e t a n o t h e r w o r d , t h is w o u l d h a v e t o b e b u i l t i n to t h e d e c i s i o n t r e e asw e l l . U n l i k e d e c i s i o n t r e e s , i n t r a n s f o r m a t i o n - b a s e d l e a r n i n g , i n t e r m e d i a t e c l a s s i f i c a -t i o n r e s u l ts a r e a v a i l a b l e a n d c a n b e u s e d a s c l a s si f ic a t i o n p r o g r e s s e s . E v e n i f d e c i s i o nt r e e s a r e a p p l i e d t o a c o r p u s i n a le f t - to - r i g h t f a s h i o n , t h e y a r e a l l o w e d o n l y o n e p a s si n w h i c h t o p r o p e r l y c l a s s i f y .S i n c e a t r a n s f o r m a t i o n l is t i s a p r o c e s s o r a n d n o t a c l a s si fi er , i t c a n r e a d i l y b eu s e d a s a p o s t p r o c e s s o r t o a n y a n n o t a t i o n s y s t e m . I n a d d i t i o n t o a n n o t a t i n g f r o ms c r a tc h , ru l e s c a n b e l e a r n e d t o i m p r o v e t h e p e r f o r m a n c e o f a m a t u r e a n n o t a t i o ns y s t e m b y u s i n g t h e m a t u r e s y s t e m a s t h e i n i t i a l - s t a t e a n n o t a t o r . T h i s c a n h a v e t h ea d d e d a d v a n t a g e t h a t t h e l is t o f t r a n s f o r m a t i o n s l e a r n e d u s i n g a m a t u r e a n n o t a t i o ns y s t e m a s t h e i n i t i a l - s t a t e a n n o t a t o r p r o v i d e s a r e a d a b l e d e s c r i p t i o n o r c l a s s i f i c a t i o n o ft h e e r r o r s t h e m a t u r e s y s t e m m a k e s , t h e r e b y a i d i n g i n th e r e f i n e m e n t o f t h a t s y s te m .T h e f a c t t h a t i t i s a p r o c e s s o r g i v e s a t r a n s f o r m a t i o n - b a s e d l e a r n e r g r e a t e r t h a n t h ec l a ss i fi e r- b a se d d e c i s i o n t r e e. F o r e x a m p l e , i n a p p l y i n g t r a n s f o r m a t i o n - b a s e d l e a r n i n gt o p a r s i n g , a r u l e c a n a p p l y a n y s t r u c t u r a l c h a n g e t o a t r e e . I n t a g g i n g , a r u l e s u c h a s :

Change the tag o f the curren t word to X , a nd o f the previous w ord to Y, i f Z ho ldsc a n e a s i ly b e h a n d l e d i n t h e p r o c e s s o r - b a s e d s y s t e m , w h e r e a s i t w o u l d b e d i f fi c u l t t oh a n d l e i n a c l a s s i f i c a t i o n s y s t e m .

I n t r a n s f o r m a t i o n - b a s e d l e a r n i n g , t h e o b j e c t i v e f u n c t i o n u s e d i n t r a i n i n g i s t h es a m e a s t h a t u s e d f o r e v a l u a t i o n , w h e n e v e r t h i s i s f e as i bl e . I n a d e c i s i o n t r e e , u s i n g s y s -t e m a c c u r a c y a s a n o b j e c ti v e f u n c t i o n f o r t r a in i n g t y p i c a l l y r e s u l t s i n p o o r p e r f o r m a n c e 4a n d s o m e m e a s u r e o f n o d e p u r i t y , s u c h a s e n t r o p y r e d u c t i o n , is u s e d i n s te a d . T h e d i -r e c t c o r r e la t io n b e t w e e n r u le s a n d p e r f o r m a n c e i m p r o v e m e n t i n t r a n s fo r m a t i o n - b a s e dl e a r n i n g c a n m a k e t h e le a r n e d r u l e s m o r e r e a d i l y i n t e r p r e ta b l e t h a n d e c i s i o n tr e e r u l e sf o r i n c r e a s i n g p o p u l a t i o n p u r i ty , s4 . Pa r t o f S p eech T a g gi ng : A C a s e S t u d y i n T ra n s f o rm a t i o n - B a s ed E rror -D r iv en

L ea rn i n gI n th i s s e c t i o n w e d e s c r i b e t h e p r a c t ic a l a p p l i c a t i o n o f t r a n s f o r m a t i o n - b a s e d l e a r n i n gt o p a r t - o f - s p e e c h t a g g i n g . 6 P a r t - o f - s p e e c h t a g g i n g i s a g o o d a p p l i c a t i o n t o t e s t th e

3 T h e o r i g i n a l t a g g e d B r o w n C o r p u s ( F r a n c i s a n d K u c e r a , 1 9 8 2 ) m a k e s t h i s d i s t i n c t i o n ; t h e P e n nT r e e b a n k ( M a r c u s , S a n t o r i n i , a n d M a r c i n k i e w i c z , 1 9 9 3 ) d o e s n o t .4 F o r a d i s c u s s i o n o f w h y t h i s i s t h e c a s e , s e e B r e i m a n e t a l. ( 19 8 4, 9 4 -9 8 ) .5 F or a d i s c u s s i o n of o t h e r i s s u e s r e g a r d i n g t h e s e t w o l e a r n i n g a l g o r i t h m s , s e e R a m s h a w a n d M a r c u s(1994) .6 A l l o f t h e p r o g r a m s d e s c r i b e d h e r e i n a r e f r e e l y a v a i l a b l e w i t h n o r e s t r i c t io n s o n u s e o r r e d i s t r i b u t i o n .F o r i n f o r m a t i o n o n o b t a i n i n g t h e t a g g e r , c o n t a c t t h e a u t h o r .

5 51


10/24

Com putat ional Linguist ics Volume 21, N um ber 4

l e a rn e r , f o r s e v e r a l r e a s o n s . T h e r e a r e a n u m b e r o f l a r g e t a g g e d c o r p o r a a v a i l a b l e ,a l l o w i n g f o r a v a r i e t y o f e x p e r i m e n t s t o b e r u n . P a r t - o f - s p e e c h t a g g i n g i s a n a c t i v ea r e a o f r e se a rc h ; a g r e a t d e a l o f w o r k h a s b e e n d o n e i n t h i s a r e a o v e r t h e p a s t f e w y e a r s( e .g ., J e l i n e k 1 9 85 ; C h u r c h 1 98 8; D e r o s e 1 9 88 ; H i n d l e 1 98 9; D e M a r c k e n 1 99 0; M e r i a l d o1994; Br i l l 1992; Black et a l . 1992; Cutt ing et a l . 1992; Kupiec 1992; Charniak et a l . 1993;W e i s c h e d e l e t a l . 1 9 9 3 ; S c h u t z e a n d S i n g e r 1 9 9 4 ) .

P a r t - o f - s p e e c h t a g g i n g i s a l s o a v e r y p r a c t i c a l a p p l ic a t i o n , w i t h u s e s in m a n y a r e a s ,i n c l u d i n g s p e e c h r e c o g n i t i o n a n d g e n e r a t i o n , m a c h i n e t r a n s l a t i o n , p a r s i n g , i n f o r m a t i o nr e t r i e v a l a n d l e x i c o g r a p h y . I n s o f a r a s t a g g i n g c a n b e s e e n a s a p r o t o t y p i c a l p r o b l e mi n l e x i c a l a m b i g u i t y , a d v a n c e s i n p a r t - o f - s p e e c h t a g g i n g c o u l d r e a d i l y t r a n s l a t e t op r o g r e s s i n o t h e r a r e a s o f l e xi c al , a n d p e r h a p s s t r u c t u r a l , a m b i g u i t y , s u c h a s w o r d -s e n s e d i s a m b i g u a t i o n a n d p r e p o s i t i o n a l p h r a s e a t t a c h m e n t d i s a m b i g u a t i o n . 7 A l s o , i t i sp o s s i b l e t o c a st a n u m b e r o f o t h e r u s e f u l p r o b l e m s a s p a r t - o f- s p e e c h t a g g i n g p r o b l e m s ,s u c h a s l e t t e r - t o - s o u n d t r a n s l a t i o n ( H u a n g , S o n - B e l l , a n d B a g g e t t 1 9 9 4 ) a n d b u i l d i n gp r o n u n c i a t i o n n e t w o r k s f o r s p e e c h r e c o g n it i on . R e c e n tl y, a m e t h o d h a s b e e n p r o p o s e df o r u s i n g p a r t - o f - s p e e c h t a g g i n g t e c h n i q u e s a s a m e t h o d f o r p a r s i n g w i t h l e x ic a l iz e dg r a m m a r s ( Jo s hi a n d S r i n i v a s 1 99 4).W h e n a u t o m a t e d p a r t - o f -s p e e c h t a g g i n g w a s i n i ti al ly e x p l o r e d ( K l e in a n d S i m -m o n s 1 9 6 3 ; H a r r i s 1 9 6 2 ) , p e o p l e m a n u a l l y e n g i n e e r e d r u l e s f o r t a g g i n g , s o m e t i m e sw i t h t h e a i d o f a c o r p u s . A s l a r g e c o r p o r a b e c a m e a v a i la b l e, i t b e c a m e c l e a r th a t s i m p l eM a r k o v - m o d e l b a s e d s t o c h a s t i c t a g g e r s t h a t w e r e a u t o m a t i c a l l y t r a i n e d c o u l d a c h i e v eh i g h r a t e s o f t a g g i n g a c c u r a c y ( J e li n ek 19 85 ). M a r k o v - m o d e l b a s e d t a g g e r s a s s i g n t o as e n t e n c e t h e t a g s e q u e n c e t h a t m a x i m i z e s P r o b ( w o r d I t a g ) , P r o b ( t a g I p r e v i o u s n t a g s ) .T h e s e p r o b a b i l it i e s c a n b e e s t i m a t e d d i r e c t ly f r o m a m a n u a l l y t a g g e d c o r p u s , s T h e s es t oc h a s ti c t a g g e r s h a v e a n u m b e r o f a d v a n t a g e s o v e r t h e m a n u a l l y b u i l t ta g g e r s, i n -c l u d i n g o b v i a t i n g t h e n e e d f o r la b o r i o u s m a n u a l r u l e c o n s t r u c ti o n , a n d p o s s i b l y c ap -t u r i n g u s e f u l i n f o r m a t i o n t h a t m a y n o t h a v e b e e n n o t i c e d b y t h e h u m a n e n g in e e r.H o w e v e r , s to c h a s ti c t a g g e rs h a v e t h e d i s a d v a n t a g e t h a t l in g u i st ic i n f o r m a t i o n i s c a p -t u r e d o n l y i n d i r e c t l y , i n l a r g e t a b l e s o f s t a t i s t i c s . A l m o s t a l l r e c e n t w o r k i n d e v e l o p i n ga u t o m a t i c a l l y t r a i n e d p a r t - o f - s p e e c h t a g g e r s h a s b e e n o n f u r t h e r e x p l o r i n g M a r k o v -m o d e l b a s e d t a g g i n g ( Je l in e k 19 85 ; C h u r c h 1 98 8; D e r o s e 1 98 8; D e M a r c k e n 1 99 0; M e r i -a l d o 1 9 9 4 ; C u t t i n g e t a l . 1 9 9 2 ; K u p i ec 1 9 9 2 ; C h ar n i ak e t a l . 1 9 9 3 ; Wei sch ed e l e t a l . 1 9 9 3 ;S c h u t z e a n d S i n g e r 1 9 9 4 ) .4 .1 T r a n s fo r m a t i o n -b a s e d E r r o r -d r i v e n P a r t -o f -S p e e c h T a g g i n gT r a n s f o r m a t i o n - b a s e d p a r t o f s p e e c h t a g g i n g w o r k s a s f o l l o w s . 9 T h e i n it i a l- s ta t e a n -n o t a t o r a s s i g n s e a c h w o r d i t s m o s t l i k e l y t a g a s i n d i c a t e d i n t h e t r a i n i n g c o r p u s . T h em e t h o d u s e d f o r i n i t i a l l y t a g g i n g u n k n o w n w o r d s w i l l b e d e s c r i b e d i n a l a t e r s e c t i o n .A n o r d e r e d l i st o f t r a n s f o r m a t i o n s i s t h e n l e a r n e d , t o i m p r o v e t a g g i n g a c c u r a c y b a s e do n c o n t e x t u a l c u e s . T h e s e t r a n s f o r m a t i o n s a l t e r th e t a g g i n g o f a w o r d f r o m X t o Y i ff

7 In Brill and Resnik (1994), we describe an approach to p repositional phrase attac hment di sambiguationthat obtains highly competitive performance compared to other corpus-based solutions to this problem.This system was derived in under two hours from the transformation-based part of speech taggerdescribed in this paper.8 One can also estimate these probabilities without a manually tagged corpus, using a hidden Markovmodel. However, it appears to be the case that directly estimating probabilities from even a very smallmanually tagged corpus gives better results than training a hidden Markov model on a large untaggedcorpus (see Merialdo (1994)).9 Earlier versions of this work were reported in Brill (1992, 1994).

5 5 2


11/24

B r ill T r a n s f o r m a t i o n - B a s e d E r r o r - D r i v e n L e a r n i n g

e i t h e r :1. T h e w o r d w a s n o t s e e n i n t h e t r a in i n g c o r p u s O R2 . T h e w o r d w a s s e e n t a g g e d w i t h a t l e a st o n c e i n t h e t r a i n i n g c o r p u s .

I n ta g g e r s b a s e d o n M a r k o v m o d e l s , t h e l e x i c o n c o n s i s t s o f p r o b a b i l i t i e s o f t h es o m e w h a t c o u n t e r in t u i t i v e b u t p r o p e r f o r m P ( W O R D I T AG ) . I n t h e t r a n s f o r m a t i o n -b a s e d t a g g e r , t h e l e x i c o n is s i m p l y a l is t o f a l l t a g s s e e n f o r a w o r d i n t h e t r a i n i n gc o r p u s , w i t h o n e t a g l a b e l e d a s t h e m o s t l ik el y. B e l o w w e s h o w a le x i ca l e n t r y f o r t h ew o r d half i n t h e t r a n s f o r m a t i o n - b a s e d t a g g er . 1

h a l f : C D D T JJ N N P D T R B V BT h i s e n t r y l i s t s t h e s e v e n t a g s s e e n f o r half i n t h e t r a in i n g c o r p u s , w i t h N N m a r k e da s t h e m o s t l i k e l y . B e l o w a r e t h e l e x i c a l e n t r i e s f o r half i n a M a r k o v m o d e l t a g g e r ,e x t r ac t e d f r o m t h e s a m e c o r p u s :

P (hal f l CD ) = 0 . 0 0 0 0 6 6P( ha lf l D T ) = 0 . 0 0 0 7 5 7

P (hal f I J ) = 0 . 0 0 0 0 9 2P ( h a l f I N N ) = 0 . 0 0 0 7 0 2

P (h a lf l P D T ) = 0 . 0 3 9 9 4 5P (hal f l RB ) = 0 . 0 0 0 4 4 3P(ha l f I VB ) = 0 . 0 0 0 0 2 7

I t i s d i ff i cu l t to m a k e m u c h s e n s e o f th e s e e n t r i e s i n i s o la t io n ; t h e y h a v e t o b e v i e w e di n th e c o n t e x t o f t h e m a n y c o n t e x t u a l p r o b a b i l i t i e s .

F i rs t, w e w i l l d e s c r i b e a n o n l e x i c a l i z e d v e r s i o n o f t h e t ag g e r , w h e r e t r a n s f o r m a t i o nt e m p l a t e s d o n o t m a k e r e f e r e n c e t o s p e c if i c w o r d s . I n t h e n o n l e x i c a l i z e d t a g g e r , th et r a n s f o r m a t i o n t e m p l a t e s w e u s e a r e :

Change tag a to tag b when:1. T h e p r e c e d i n g ( f o l l o w i n g ) w o r d i s t a g g e d z.2 . T h e w o r d t w o b e f o r e ( a ft er ) i s t a g g e d z .3 . O n e o f t h e t w o p r e c e d i n g ( f o l l o w i n g ) w o r d s i s t a g g e d z .4 . O n e o f t h e t h r e e p r e c e d i n g ( f o l l o w i n g ) w o r d s is t a g g e d z .5. T h e p r e c e d i n g w o r d i s t a g g e d z a n d t h e f o l lo w i n g w o r d i s t a g g e d w.6. T h e p r e c e d i n g (f o l l o w i ng ) w o r d is t a g g e d z a n d th e w o r d t w o b e f o r e

( a f t e r ) i s t a g g e d w .w h e r e a , b , z a n d w a r e v a r i a b l e s o v e r t h e s e t o f p a r t s o f s p e e c h .

T o l e a r n a t r a n s f o r m a t i o n , t h e l e a rn e r , in e s s e n c e , t r ie s o u t e v e r y p o s s i b l e t r a n s -f o r m a t i o n , 1I a n d c o u n t s t h e n u m b e r o f t a g g i n g e r r o r s a f t e r e a c h o n e i s a p p l i e d . A f t e r

1 0 A d e s c r i p t i o n o f th e p a r t o o f- s p ee c h t a g s i s p r o v i d e d i n A p p e n d i x A .1 1 Al l p o s s i b l e i n s t a n t i a t i o n s o f t r a n s fo rm a t i o n t e m p l a t e s .

5 5 3


12/24

Computational Linguistics Volume 21, Number 41. apply initial-state annotator to corpus2. while transformations can still be found do3. for from_tag = tag1 to tagn4. for to_tag = tag1 to tagn5. for corpus_position = 1 to corpus_size6. if (correct_tag(corpus_position) --= to_tag

&& current_tag(corpus_position) == from_tag)7. num_good_transformations( tag(corpus_position -1))++8. else if (correct_tag(corpus_position) == from_tag

&& current_tag(corpus_position) == from_tag)9. num_bad_transformations(tag(corpus_position-1 ))++10. find maxT (num_good_transformations(T) - num_bad_transformations(T))11. if this is the best-scoring rule found yet then store as best rule:

Change tag from from_tag to to_tag if previous tag is T12. app ly best rule to training corpus13. appe nd best rule to ordered list of transformationsFigure 3Pseudocode for learning transformations.all possible transformations have been tried, the transformation that resulted in thegreatest error reduction is chosen. Learning stops when no transformations can befound whose application reduces errors beyond some prespecified threshold.

In the experiments described below, processing was done left to right. For eachtransformation application, all triggering environments are first found in the corpus,and then the transformation triggered by each triggering environment is carried out.

The search is data-driven, so only a very small percentage of possible transfor-mations really need be examined. In figure 3, we give pseudocode for the learningalgorithm in the case where there is only one transformation template:

Chang e the tag from X to Y if the previous tag is Z.In each learning iteration, the entire training corpus is examined once for every pair

of tags X and Y, finding the best transformation whose rewrite changes tag X to tag Y.For every word in the corpus whose environment matches the triggering environment,if the word has tag X and X is the correct tag, then making this transformation willresult in an additional tagging error, so we increment the number of errors causedwhen making the transformation given the part-of-speech tag of the previous word(lines 8 and 9). If X is the current tag and Y is the correct tag, then the transformationwill result in one less error, so we increment the number of improvements causedwhen making the transformation given the part-of-speech tag of the previous word(lines 6 and 7).

In certain cases, a significant increase in speed for training the transformation-based tagger can be obtained by indexing in the corpus where different transformationscan and do apply. For a description of a fast index-based training algorithm, seeRamshaw and Marcus (1994).In figure 4, we list the first twenty transformations learned from training on thePenn Treebank Wall Street Journal Corpus (Marcus, Santorini, and Marcinkiewicz1993). 12 The first transformation states that a n oun should be changed to a verb if

1 2 V e r s i o n 0.5 o f t h e P e n n T r e e b a n k w a s u s e d i n a ll e x p e r i m e n t s r e p o r t e d i n t h i s p a p e r .

554


13/24


Change Tag# From To1 NN VB2 VBP VB3 NN VB4 VB NN5 VBD VBN6 VBN VBD7 VBN VBD8 VBD VBN9 VBP VB10 POS VBZ11 VB VBP12 VBD VBN13 IN WDT14 VBD VBN15 VB VBP16 IN WDT17 IN DT18 JJ NNP19 IN WDT20 JJR RBR

Figure 4

ConditionPrevious tag is T O

One of the previous three tags is M DOne of the previous two tags is M DOne of the previous two tags is D T

One of the previous three tags is V B ZPrevious tag is P R PPrevious tag is N N PPrevious tag is V B DPrevious tag is T O

Previous tag is P R PPrevious tag is N N S

One of previous three tags is V B POne of next two tags is V BOne of previous two tags is V B

Previous tag is P R PNext tag is V B ZNext tag is N N

Next tag is N N PNext tag is V B D

Next tag is J JThe first 20 nonlexicalized transformations.

the previous tag is TO, as in: to ~T O c o n f l i c t /N N - - . V B w i t h . The second transforma-tion fixes a tagging such as: m i g h t / M D v a n i s h / V B P - - . V B . The third fixes m i g h t/ M D n o tr e p l y / N N - - . V B . The tenth transformation is for the token's, which is a separate tokenin the Penn Treebank. 's is most fr equen tly used as a possessive ending, bu t after apersonal pronoun, it is a verb (John' s, compared to he 's). The transformations chang-ing IN to WDT are for tagging the word tha t , to determine in which environments t ha tis being used as a synonym of which .4 .2 Lexica l i z ing the TaggerIn general, no relationships between words have been directly encoded in stochas-tic n-gram taggers. 13 In the Markov mode l typically used for stochastic tagging, statetransition probabilities ( P ( T ag i I T a g i _ l . . . T a g i -n ) ) express the likelihood of a tag im-mediately following n other tags, and emit probabilities ( P ( W o r d j I T a g i ) ) express thelikelihood of a word, given a tag. Many useful relationships, such as that between aword and the previous word, or between a tag and the following word, are not di-rectly captured by Mar kov-model based taggers. The same is true of the nonlexicalizedtransformation-based tagger, where transformation templates do not make referenceto words.

To remedy this problem, we extend the transformation-based tagger by adding

13 In Kupiec (1992), a limited amount of lexicalization s introduced by having a stochastic tagger withword states for the 100 most frequent words in the corpus.

555


14/24

C o m p u t a t i o n a l L i n g ui st ic s V o l u m e 2 1, N u m b e r 4

c o n t e x t u a l t r a n s f o r m a t i o n s t h a t c a n m a k e r e f e r e n c e t o w o r d s a s w e l l a s p a r t - o f - s p e e c ht ag s. T h e t r a n s f o r m a t i o n t e m p l a t e s w e a d d a re :

Change tag a to tag b when:.

2.3 .4.5 .6 .7 .

.

T h eT h eT h eT h et.

T h e p r e c e d i n g ( f o l l o w i n g ) w o r d i s w .T h e w o r d t w o b e f o r e ( a ft er ) is w .O n e o f th e t w o p r e c e d i n g ( f o ll o w i n g ) w o r d s i s w .

c u r r e n t w o r d is w a n d t h e p r e c e d i n g ( f o l l o w i n g ) w o r d i s x .c u r r e n t w o r d is w a n d t h e p r e c e d i n g ( f o l lo w i n g ) w o r d i s t a g g e d z .c u r r e n t w o r d is w .p r e c e d i n g ( f o l l o w i n g ) w o r d i s w a n d t h e p r e c e d i n g ( f o l l o w i n g ) t a g i s

T h e c u r r e n t w o r d i s w , t h e p r e c e d i n g ( f o l l o w i n g ) w o r d i s w 2 a n d t h ep r e c e d i n g ( f o l l o w i n g ) t a g i s t .w h e r e w a n d x a r e v a r i a b l e s o v e r a l l w o r d s i n th e t r a i n i n g c o r p u s , a n d za n d t a r e v a r i a b l e s o v e r a ll p a r t s o f s p e e c h .

B e lO w w e l is t t w o l e x i c a l i z e d tr a n s f o r m a t i o n s t h a t w e r e l e a r n e d , t r a i n i n g o n c ea g a i n o n t h e W a l l S t r e e t J o u r n a l .C h a n g e t h e t a g :

(1 2) F r o m I N t o R B i f t h e w o r d t w o p o s i t i o n s t o th e r i g h t i s a s.(1 6) F r o m V B P t o V B i f o n e o f th e p r e v i o u s t w o w o r d s i s n' t. TM

T h e P e n n T r e e b a n k t a g g i n g s t y l e m a n u a l s p e c i f i e s t h a t i n t h e c o l l o c a t i o n as . . . as ,t h e f i r s t a s i s t a g g e d a s a n a d v e r b a n d t h e s e c o n d i s t a g g e d a s a p r e p o s i t i o n . S i nc e as i sm o s t f r e q u e n t l y t a g g e d a s a p r e p o s i t i o n i n t h e t r a i n i n g c o r p u s , t h e i n it ia l -s ta t e t a g g e rw i l l m i s t a g t h e p h r a s e as tall as as :

a s / I N t a l l / J J a s / I NT h e f i rs t l e x i c a li z e d t r a n s f o r m a t i o n c o r r e c t s t h is m i s t a g g i n g . N o t e t h a t a b i g r a m t a g g e rt r a i n e d o n o u r t r a i n i n g s e t w o u l d n o t c o r r e c t l y t a g th e f i rs t o c c u r r e n c e o f as. A l t h o u g ha d v e r b s a r e m o r e l i k e l y t h a n p r e p o s i t i o n s t o f o l lo w s o m e v e r b f o r m t ag s , t h e f ac tt h a t P ( a s ] I N ) i s m u c h g r e a t e r t h a n P ( a s ] R B ) , a n d P ( J J ] I N ) i s m u c h g r e a t e r t h a nP ( J J ] R B ) l e a d t o as b e i n g i n c o r r e c t l y t a g g e d a s a p r e p o s i t i o n b y a s t o c h a st i c ta g g e r. At r i g r a m t a g g e r w i l l c o r r e c t l y t a g t h is c o l l o c a t i o n in s o m e i n s t a n c e s , d u e t o t h e f a c t t h a tP ( I N ] R B J J ) i s g r e a t e r t h a n P ( I N ] I N J J) , b u t t h e o u t c o m e w i l l b e h i g h ly d e p e n d e n tu p o n t h e c o n t e x t i n w h i c h t h is c o l l o c a t i o n a p p e a r s .

T h e s e c o n d t r a n s f o r m a t i o n a r i se s f r o m t h e fa c t t h a t w h e n a v e r b a p p e a r s i n ac o n t e x t s u c h a s W e d o n ' t e a t o r W e d i d n ' t u s u a l l y d r i n k , t h e v e r b i s i n b a s e f o r m . As t o c h a s t i c t r i g r a m t a g g e r w o u l d h a v e t o c a p t u r e t h i s l i n g u i s t i c i n f o r m a t i o n i n d i r e c t l yf r o m f r e q u e n c y c o u n t s o f al l t r i g r a m s o f t h e f o r m s h o w n i n f ig u r e 5 ( w h e r e a s ta r c a nm a t c h a n y p a r t - o f - s p e e c h t ag ) a n d f r o m t h e f a c t t h a t P ( n ' t ] R B ) i s f a i r l y h i g h .

14 In the Penn Treebank, n't is treated as a sep arate token, so don't becomes do/V BP n't/RB.

5 5 6


15/24


* RB VBP* RB VBRB * VBPRB * VB

F i g u r e 5Trigram Tagger Probability Tables.

In Weischedel et al. (1993), results are given when training and testing a Markov-mode l based tagger on the Pen n Treebank Tagged Wall Street Journal Corpus. They citeresults making the closed vocabulary assumption that all possible tags for all words inthe test set are known. When training contextual probabilities on one million words,an accuracy of 96.7% was achieved. Accuracy dropped to 96.3% wh en contextua l prob-abilities were trained on 64,000 words. We trained the trans format ion-bas ed tagger onthe same corpus, making the same closed-vocabulary assumpt ion. 15 When trainingcontextual rules on 600,000 words, an accuracy of 97.2% was achieved on a separate150,000 word test set. When the training set was reduced to 64,000 words, accuracydro pped to 96.7%. The transformation-based learner achieved better performance, de-spite the fact that contextual information was captured in a small number of simplenonstochastic rules, as opposed to 10,000 contextual probabilities that were learnedby the stochastic tagger. These results are summarized in table 1. When training on600,000 words, a total of 447 transformations were learned. However, transformationstoward the end of the list contribute very little to accuracy: applyin g only the first 200learned tr ansformations to the test set achieves an accuracy of 97.0%; applying the first100 gives an accuracy of 96.8%. To match the 96.7% accuracy achieved by the stochas-tic tagger when it was trained on one million words, only the first 82 transformationsare needed.

To see whethe r lexicalized transformations were contributing to the transformation-based tagger accuracy rate, we first trained the tagger using the nonlexical transfor-mation template subset, then ran exactly the same test. Accuracy of that tagger was97.0%. Adding lexicalized trans format ions resulted in a 6.7% decrease in the e rror rate(see table 1 ) . 16

We found it a bit surprising that the addition of lexicalized transformations didnot result in a much greater improvement in performance. When transformations areallowed to make reference to words and word pairs, some relevant information isprobably missed due to sparse data. We are currently exploring the possibility ofincorporating word classes into the rule-based learner, in hopes of overcoming thisproblem. The idea is quite simple. Given any source of word class information, such

1 5 I n b o t h W e i s c h e d e l e t a l. (1 99 3) a n d h e r e , t h e t e s t s e t w a s i n c o r p o r a t e d i n t o t h e l e x i c on , b u t w a s n o tu s e d i n l e a r n i n g c o n t e x t ua l i n fo r m a t i o n . T e s ti n g w i t h n o u n k n o w n w o r d s m i g h t s e e m l i ke a nu n r e a l i s t i c t es t . W e h a v e d o n e s o f or t h r e e r e a s o n s : (1 ) t o a l lo w f o r a c o m p a r i s o n w i t h p r e v i o u s l yq u o t e d r e s u lt s , (2) to i so l at e k n o w n w o r d a c c u r a c y f r o m u n k n o w n w o r d a c c u ra c y , a n d (3 ) i n s o m es y s t e m s , s u c h a s a c l os e d v o c a b u l a r y s p e e c h r e c o g n i t io n s y s t e m , t h e a s s u m p t i o n t h a t a ll w o r d s a r ek n o w n i s v a li d . (W e s h o w r e s u l t s w h e n u n k n o w n w o r d s a r e i n c l u d e d l a te r i n t he p a pe r .)1 6 T h e t r a i n i n g w e d i d h e r e w a s s l ig h t l y s u b o p t i m a l , i n t h a t w e u s e d t h e c o n t e x t u a l ru l e s l e a r n e d w i t hu n k n o w n w o r d s ( d e s c r i b e d i n t h e n e x t se c t i o n ), a n d f il l ed i n t h e d i c t io n a r y , r a t h e r t h a n t r a i n i n g o n ac o r p u s w i th o u t u n k n o w n w o r d s .

557


16/24

C o m p u t a t i o n a l L i n g u i st ic s V o l u m e 2 1, N u m b e r 4Table 1C o m p a r i s o n o f T a g g in g A c c u r a c y W i th N o U n k n o w n W o r d s

T r a i n i n g # o f R u l e sC o r p u s o r C o n t e x t. A cc .M e t h o d S i z e ( W o r d s ) P r o b s . ( % )Stoc hast ic 64 K 6,170 96.3Stocha s t ic 1 M i l l ion 10 ,000 96 .7R u l e - B a s e dW ith Lex. Rule s 64 K 215 96 .7R u l e - B a s e dW ith Lex. Rule s 600 K 447 97 .2R u l e - B a s e dw / o L e x . R u l e s 6 0 0 K 3 7 8 9 7. 0

a s W o r d N e t ( M i l le r 1 99 0) , t h e l e a r n e r i s e x t e n d e d s u c h t h a t a r u l e i s a l l o w e d t o m a k er e f e r e n c e to p a r t s o f s p e e c h , w o r d s , a n d w o r d c l a s se s , a l l o w i n g f o r r u l e s s u c h a s

Change the tag f rom X to Y i f the fo l lowing word belongs to word class Z.T h i s a p p r o a c h h a s a l r e a d y b e e n s u c c e s sf u l ly a p p l i e d t o a s y s t e m f o r p r e p o s i t i o n a l

p h r a s e a t t a c h m e n t d i s a m b i g u a t i o n (B r il l a n d R e s n i k 1 99 4).4 .3 T a g g in g U n k n o w n W o r d sS o fa r, w e h a v e n o t a d d r e s s e d t h e p r o b l e m o f u n k n o w n w o r d s . A s s t a t ed a b o v e , t h ei n i ti a l- s ta t e a n n o t a t o r f o r t a g g i n g a s s i g n s a l l w o r d s t h e i r m o s t l i k e ly t ag , as i n d i c a t e di n a t r a in i n g c o r p u s . B e l o w w e s h o w h o w a t r a n s f o r m a t i o n - b a s e d a p p r o a c h c a n b et a k e n fo r t a g g i n g u n k n o w n w o r d s , b y a u t o m a t i c a l l y l e a r n i n g c u e s t o p r e d i c t th e m o s tl i k e ly ta g f o r w o r d s n o t s e e n i n th e t r a i n i n g c o r p u s . I f t h e m o s t l i k e ly ta g f o r u n k n o w nw o r d s c a n b e a s s ig n e d w i t h h i g h a c c u r a c y , th e n t h e c o n t e x tu a l r u l e s c a n b e u s e d t oi m p r o v e a c c u r ac y , a s d e sc r i b e d a b o v e .

I n t h e t r a n s f o r m a t i o n - b a s e d u n k n o w n - w o r d t ag g e r , t h e i n i t ia l -s ta t e a n n o t a t o r n a i v e l ya s s u m e s t he m o s t l ik e ly ta g f o r a n u n k n o w n w o r d i s " p r o p e r n o u n " i f t h e w o r d i sc a p i t a l i z e d a n d " c o m m o n n o u n " o t h e r w i s e . 17

B e l o w , w e l is t t h e s e t o f a l l o w a b l e t r a n s f o r m a t i o n s .Change the tag o f an un kno w n word ( f rom X ) t o Y if :

1..

3 ..

5.

D e l e t i n g t h e p r e f i x ( s u f f ix ) x , Ix l < 4 , r e s u l t s i n a w o r d ( x i s a n y s t r i n g o fl e n g t h 1 t o 4 ) .T h e f i r s t ( l as t) ( 1 ,2 ,3 ,4 ) c h a r a c t e r s o f th e w o r d a r e x .A d d i n g t h e c h a r a c t e r s t r i n g x a s a p r e f i x ( s u f f i x ) r e s u l t s i n a w o r d(Ixl ~ 4).W o r d w e v e r a p p e a r s i m m e d i a t e l y t o th e l e f t ( r i gh t ) o f th e w o r d .C h a r a c t e r z a p p e a r s i n t h e w o r d .

17 If we chang e the tagger to tag all unknown words as comm on nouns, then a num ber of rules arelearned of the form: change tag to prop er nou n if the pref ix is "E ' , "A ", "B ' , e tc ., s ince the learner isnot pro vide d with the concept of up pe r ca se in its set of transformation templates.

5 5 8


17/24


C h a n g e T a g# F r o m T o C o n d i t io n1 N N N N S H a s s uf fi x -s2 N N C D H a s c h a r a c t e r .3 N N JJ H a s c h a r a c t e r -4 N N V B N H a s s uf fi x - ed5 N N V B G H a s s u ff ix - i n g6 ? ? R B H a s s u f f i x - l y7 ? ? JJ A d d i n g s u f fi x - l y r e s u l t s i n a w o r d .8 N N C D T h e w o r d $ c a n a p p e a r t o t h e l ef t.9 N N JJ H a s s u f fi x - a l

1 0 N N V B T h e w o r d w o u l d c a n a p p e a r t o t h e l e ft .11 N N C D H a s c h a r a c t e r 01 2 N N JJ T h e w o r d b e c a n a p p e a r t o t h e l ef t.1 3 N N S JJ H a s s u f fi x - u s1 4 N N S V B Z T h e w o r d i t c a n a p p e a r t o t h e l ef t.1 5 N N JJ H a s s u f fi x - b l e1 6 N N JJ H a s s u f f i x - i c1 7 N N C D H a s c h a r a c t e r 118 N N S N N H a s s uf fi x - s s1 9 ? ? JJ D e l e t i n g t h e p r e f i x u n - r e s u l t s i n a w o r d2 0 N N JJ H a s s u f fi x - i r e

F i g u r e 6T h e f i rs t 2 0 tr a n s f o r m a t i o n s f o r u n k n o w n w o r d s .

A n u n a n n o t a t e d t e x t c a n b e u s e d t o c h e c k t h e c o n d i t i o n s in a ll o f t h e a b o v e t r a n s-f o r m a t i o n t e m p l a t e s . A n n o t a t e d t e x t i s n e c e s s a r y i n t r a i n i n g t o m e a s u r e t h e e ff e c t o ft r a n s f o r m a t i o n s o n t a g g i n g a c c u r a c y . S i n c e t h e g o a l i s t o l a b e l e a c h l e x i c a l e n t r y f o rn e w w o r d s a s a c c u r a t e ly as p o ss ib l e , a c c u r a c y is m e a s u r e d o n a p e r t y p e a n d n o t ap e r t o k e n b a s i s .

F i g u r e 6 s h o w s t h e f ir s t 2 0 t r a n s f o r m a t i o n s l e a r n e d f o r ta g g i n g u n k n o w n w o r d s i nt h e W a ll S t r e e t J o u r n a l c o r p u s . A s a n e x a m p l e o f h o w r u l e s c a n c o r r e c t e r r o r s g e n e r a t e db y p r i o r r u l e s , n o t e t h a t a p p l y i n g t h e f i rs t t r a n s f o r m a t i o n w i l l r e s u l t in t h e m i s t a g g i n go f t h e w o r d actress. T h e 1 8 t h l e a r n e d r u l e f i x e s t h i s p r o b l e m . T h i s r u l e s t a t e s :

Change a tag from p l u r a l c o m m o n n o u n t o s i n g u l a r c o m m o n n o u n i f the w ord hasS U f f i X s s .K e e p i n m i n d t h a t n o s p e c i fi c a ff ix e s a r e p r e sp e c i f i e d . A t r a n s f o r m a t i o n c a n m a k e

r e f e r e n c e t o a n y s t r i n g o f c h a r a c t e r s u p t o a b o u n d e d l e n g th . S o w h i l e t h e f ir s t r u l es p e c i f i e s t h e E n g l i s h s u f f i x " s ' , t h e r u l e l e a r n e r w a s n o t c o n s t r a i n e d f r o m c o n s i d e r i n gs u c h n o n s e n s i c a l r u l e s a s:Change a tag to ad ject ive i f the word has su f f ix "xh qr ' .

A l s o , a b s o l u t e l y n o E n g l i s h -s p e c i f i c i n f o r m a t i o n ( s u c h a s a n a f f ix l is t) n e e d b ep r e s p e c i f i e d i n t h e l e a r n e r . TM

1 8 T h i s l e a rn e r h a s a l s o b e e n a p p l i e d t o t a g g i n g Ol d E n g l i s h . S e e B r i l l (1 9 9 3 b ) . A l t h o u g h t h e

5 5 9


18/24


J

= =

i i E i i

0 1 0 0 2 0 0 3 0 0 4 0 0

Transformation Number

Figure 7Accuracy vs. Transformation Number

We then ran the following experiment using 1.1 million words of the Penn Tree-bank Tagged Wall Street Journal Corpus. Of these, 950,000 words were used for trainingand 150,000 word s were used for testing. Annota tions of the test corp us were not usedin any way to train the system. From the 950,000 word training corpus, 350,000 wordswere used to learn rules for tagging unknown words, and 600,000 words were usedto learn contextual rules; 243 rules were lear ned for tagging unk now n words, and 447contextual tagging rules were learned. Unk nown wo rd accuracy on the test corpus was82.2%, and overall tagging accuracy on the test corpus was 96.6%. To our knowl edge ,this is the highest overall tagging accuracy ever quoted on the Penn Treebank Corpuswhen making the open vocabulary assumption. Using the tagger without lexicalizedrules, an overall accuracy of 96.3% and an unknown word accuracy of 82.0% is ob-tained. A graph of accuracy as a function of transfor mation numbe r on the test set forlexicalized rules is shown in figure 7. Before appl ying any transfo rmations , test set ac-curacy is 92.4%, so the t rans formations reduce the err or rate by 50% over the baseline.The high baseline accuracy is somewhat misleading, as this includes the tagging ofunambiguous words. Baseline accuracy when the words that are unambiguous in ourlexicon are not considered is 86.4%. However, it is difficult to compare taggers usingthis figure, as the accuracy of the system depends on the particular lexicon used. Forinstance, in our training set the word the was tagged with a numb er of different tags,and so according to our lexicon the is ambiguous. If we instead used a lexicon wherethe is listed unambiguously as a determiner, the baseline accuracy would be 84.6%.

For tagging unknown words, each word is initially assigned a part-of-speech tagbased on word and word-distribution features. Then, the tag may be changed basedon contextual cues, via contextual transformations that are applied to the entire cor-pus, both known and unknown-words. When the contextual rule learner learns trans-formations, it does so in an attempt to maximize overall tagging accuracy, and notunknown-word tagging accuracy. Unknown words account for only a small percent-age of the cor pus in o ur experiments, typically two to three percent. Since the distribu-tional behavior of unknown words is quite different from that of known words, and

transformations are not English-specific, the set of transformation templates woul d have to be exte ndedto process languages with dramatically different morphology,

560


19/24

Brill Transformation-Based Error-Driven LearningT a b l e 2Tagging Accuracy on Different Corpora

Corpus AccuracyPenn WSJ 96.6%

Penn Brown 96.3%Orig Brown 96.5%

since a transformation that does not increase unknown-word tagging accuracy canstill be beneficial to overall tagging accuracy, the contextual transformations learnedare not optimal in the sense of leading to the highest tagging accuracy on unknownwords. Better unknown-word accuracy may be possible by training and using twosets of contextual rules, one maximizing known-word accuracy and the other maxi-mizing unknow n-word accuracy, and then applying the appropriate transformationsto a word when tagging, depending upon whether the word appears in the lexicon.We are currently experimenting with this idea.

In Weischedel et al. (1993), a statistical approach to tagging unknown words isshown. In this approach, a number of suffixes and impor tant features are prespecified.Then, for unknown words:p ( W I T ) -= p(unkno wn word I T) p(Capitalize-feature I T) * p(suf f ixes , hyphena t ion I T)Using this equation for un kno wn w ord emit probabilities within the stochastic tagger,an accuracy of 85% was obtained on the Wall Street Journal corpus. This portion ofthe stochastic model has over 1,000 parameters, with 108 possible unique emit proba-bilities, as opposed to a small number of simple rules that are learned and used in therule-based approach. In addition, the transformation-based met hod learns specific cuesinstead of requiring them to be prespecified, allowing for the possibility of uncover-ing cues not apparent to the hu man language engineer. We have obtained comparableperformance on unknown words, while capturing the information in a much moreconcise and perspicuous manner, and without prespecifying any information specificto English or to a specific corpus.In table 2, we show tagging results obtained on a numb er of different corpora, ineach case training on roughly 9.5 x 10s words total and testing on a separate test setof 1.5-2 x 10s words. Accuracy is consistent across these corpora and tag sets.In addition to obtaining high rates of accuracy and representing relevant linguisticinformation in a small set of rules, the part-of-speech tagger can also be made torun extremely fast. Roche and Schabes (1995) show a method for converting a listof tagging transformations into a deterministic finite state transducer with one statetransition taken per wo rd of input; the result is a transformation-based tagger whosetagging speed is about ten times that of the fastest Markov-model tagger.4 . 4 K-Best TagsThere are certain circumstances where one is willing to relax the one-tag-per-wordrequirement in order to increase the probability that the correct tag will be assigned toeach word. In DeMarcken (1990) and Weischedel et al. (1993), k-best tags are assignedwithin a stochastic tagger by returning all tags within some threshold of probabilityof being correct for a particular word.

561


20/24

C o m p u t a t i o n a l L i n g u i st ic s V o l u m e 2 1, N u m b e r 4

T a b l e 3R e s u l ts f r o m k - b e s t t a g g in g .# o f R u l e s A c c u r a c y A v g . # o f t a g s p e r w o r d

0 96.5 1.0050 96.9 1.02100 97.4 1.04150 97.9 1.10200 98.4 1.19250 99.1 1.50

W e c a n m o d i f y t he t r a n s f o r m a t i o n - b a s e d t a g g e r to r e tu r n m u l t i p l e t a g s f o r a w o r db y m a k i n g a s i m p l e m o d i f i c a t i o n to th e c o n t e x tu a l t r a n s f o r m a t i o n s d e s c r i b e d a b o v e .T h e i n i ti a l -s t a te a n n o t a t o r i s t h e t a g g i n g o u t p u t o f t h e p r e v i o u s l y d e s c r i b e d o n e - b e s tt r a n s f o r m a t i o n - b a s e d t ag g e r . T h e a l l o w a b l e t r a n s f o r m a t i o n t e m p l a t e s a r e t h e s a m e a st h e c o n t e x tu a l t r a n s f o r m a t i o n t e m p l a t e s l i s t ed a b o v e , b u t w i t h t h e r e w r i t e r u le : changetag X to tag Y m o d i f i e d t o add tag X to tag Y o r add tag X to word W . I n s t e a d o f c h a n g i n gt h e t a g g i n g o f a w o r d , t r a n s f o r m a t i o n s n o w a d d a l t e r n a t i v e t a g g i n g s t o a w o r d .

W h e n a l l o w i n g m o r e t h a n o n e t a g p e r w o r d , t h e r e is a t r a d e -o f f b e t w e e n a c c u r a c ya n d t h e a v e r a g e n u m b e r o f t a g s fo r e a c h w o r d . I d ea l l y , w e w o u l d l ik e to a c h i e v e a sl a r g e a n i n c r e a s e i n a c c u r a c y w i t h a s f e w e x t r a t a g s a s p o s s i b l e . T h e r e f o r e , i n t r a i n i n gw e f i n d t r a n s f o r m a t i o n s t h a t m a x i m i z e t h e f u n c t i o n :

N u m b e r o f c o r r e c t e d e r r o r sN u m b e r o f a d d i t i o n a l t a gs

I n ta b l e 3, w e p r e s e n t r e s u l ts f r o m f i r s t u s i n g t h e o n e - t a g - p e r - w o r d t r a n s f o r m a -t i o n - b a s e d t a g g e r d e s c r i b e d i n t h e p r e v i o u s s e c ti o n a n d t h e n a p p l y i n g t h e k - b e st ta gt r a n s f o r m a t i o n s . T h e s e t r a n s f o r m a t i o n s w e r e l e a r n e d f r o m a s e p a r a t e 24 0,0 00 w o r dc o r p u s . A s a b a s e l i n e , w e d i d k - b e s t t a g g i n g o f a t e s t c o r p u s . E a c h k n o w n w o r d i n t h et e st c o r p u s w a s t a g g e d w i t h a l l t a g s s ee n w i t h t h a t w o r d i n th e t r a in i n g c o r p u s a n dt h e f i v e m o s t l ik e ly u n k n o w n - w o r d t a g s w e r e a s s i g n e d t o a l l w o r d s n o t s e e n in t het r a i n i n g c o r p u s . 19 T h i s r e s u l t e d i n a n a c c u r a c y o f 9 9. 0% , w i t h a n a v e r a g e o f 2 .2 8 t a g sp e r w o r d . T h e t r a n s f o r m a t i o n - b a s e d t a g g e r o b t a i n e d t h e s a m e a c c u r a c y w i t h 1 .4 3 t a gsp e r w o r d , o n e t h i r d t h e n u m b e r o f a d d i t i o n a l t a g s a s t h e b a s e l i n e t a g g e r . 25 . C o n c l u s i o n sI n t h is p a p e r , w e h a v e d e s c r i b e d a n e w t r a n s f o r m a t i o n - b a s e d a p p r o a c h t o c o r p u s - b a s e dl e a rn i n g . W e h a v e g i v e n d e t a i l s o f h o w t h i s a p p r o a c h h a s b e e n a p p l i e d t o p a r t -o f -s p e e c h t a g g i n g a n d h a v e d e m o n s t r a t e d t h a t t he t r a n s f o r m a t i o n - b a s e d a p p r o a c h o b t a in s19 Thanks to Fred Jelinek and Fernando Pereira for suggesting this baseline experiment.20 Unfortunately, it is difficult to find results to compare these k-best tag results to. In DeMarcken (1990),the test set is included in the training set, and so it is difficult to know how this system w ould do o nfresh text. In W eischedel et al. (1993), a k-best tag experiment was run on the W all Street Journalcorpus. They quote the average number of tags per word for various threshold settings, but do notprovide accuracy results.

5 6 2


21/24


c o m p e t i ti v e p e r f o r m a n c e w i t h s t o c h as ti c t a g g e rs o n t a g g in g b o t h u n k n o w n a n d k n o w nw o r d s . T h e t r a n s f o r m a t i o n - b a s e d t a g g e r c a p t u r e s l i n g u i s t i c i n f o r m a t i o n i n a s m a l ln u m b e r o f s i m p l e n o n s t o c h a s t i c r u l e s , a s o p p o s e d t o l a r g e n u m b e r s o f l e x i c a l a n dc o n t e x t u a l p r o b a b i l i t i e s . T h i s l e a r n i n g a p p r o a c h h a s a l s o b e e n a p p l i e d t o a n u m b e ro f o t h e r t a s k s , i n c l u d i n g p r e p o s i t i o n a l p h r a s e a t t a c h m e n t d i s a m b i g u a t i o n ( B r i l l a n dResn ik 1994) , b racke t ing tex t (Br i l l 1993a) and labe l ing nonte rmina l nodes (Br i l l 1993c) .R e c e n tl y , w e h a v e b e g u n t o e x p l o r e t h e p o s s i b il i ty o f e x t e n d i n g t h e s e t e c h n i q u e s t oo t h e r p r o b l e m s , i n c l u d i n g l e a r n i n g p r o n u n c i a t i o n n e t w o r k s f o r s p e e c h r e c o g n i t io n a n dl e a r n i n g m a p p i n g s b e t w e e n s y n t a c ti c a n d s e m a n t i c r e p r es e n ta t i o n s.A pp end ix A: Pen n Treebank Part-of-Speech Ta gs (Excluding Punctuation)1. C C C o o r d i n a t i n g c o n j u n c t i o n2 . C D C a r d i n a l n u m b e r3 . D T D e t e r m i n e r4 . E X E x i s t e n t i a l " t h e r e "5. F W F o r e i g n w o r d6. I N P r e p o s i t i o n o r s u b o r d i n a t i n g c o n j u n c t i o n7. JJ A dje ct i ve8 . J JR A d j e c t i v e , c o m p a r a t i v e9 . J JS Ad jec t iv e , sup e r l a t iv e1 0. L S L i s t i t e m m a r k e r11. M D M o d a l12. N N N o u n , s i n g u l a r o r m a s s13. N N S N o u n , p l u r a l14. N N P P r o p e r n o u n , s i n g u l a r1 5. N N P S P r o p e r n o u n , p l u r a l1 6 . P D T P r e d e t e r m i n e r1 7. P O S P o s s e s s i v e e n d i n g18 . P P P e r s o n a l p r o n o u n1 9. P P $ P o s s e s s i v e p r o n o u n2 0. R B A d v e r b2 1. R B R A d v e r b , c o m p a r a t i v e2 2. R B S A d v e r b , s u p e r l a t i v e23 . RP Pa r t i c le2 4. S Y M S y m b o l2 5. T O " t o "2 6. U H I n t e r j e c t i o n27 . VB Verb , base fo rm28. VB D Verb , pa s t t ens e2 9. V B G V e rb , g e r u n d o r p r e s e n t p a r t i c i p le3 0. V B N V e r b, p a s t p a r t i c i p l e3 1. V B P V e rb , n o n - 3 r d p e r s o n s i n g u l a r p r e s e n t3 2. V B Z V e rb , 3 r d p e r s o n s i n g u l a r p r e s e n t33. W D T W h - d e t e r m i n e r34 . W P W h - p r o n o u n3 5. W P $ P o s s e s si v e w h - p r o n o u n3 6. W R B W h - a d v e r b

563


22/24

C o m p u t a t i o n a l L i n g u is ti c s V o l u m e 2 1, N u m b e r 4

AcknowledgmentsT h is w o r k w a s f u n d e d i n p a r t b y N S F g r a n tIRI -9502312. In addi t ion , th i s work wasd o n e i n p a r t w h i l e th e a u t h o r w a s i n t h eS p o k e n L a n g u a g e S y s t e m s G r o u p a tM a s s a ch u s e t t s I n s t i t u t e o f Te ch n o l o g yunder ARPA grant N00014-89-J -1332, andb y DARP A/ AF OS R g r a n t AF OS R- 9 0 - 0 0 6 6 a tt h e U n i v e r s i t y o f P e n n s y l v a n ia . T h a n k s t oM i t ch M a r cu s , M a r k Vi ll ai n, a n d t h ea n o n y m o u s r e v i e w e r s f o r m a n y u s e fu lco m m e n t s o n e a r l i e r d r a f t s o f t h i s p a p e r .

ReferencesBlack , E zra ; Je l inek , Fred; Laf fe r ty , John;M a g e r m a n , Da v i d ; M e r ce r , Ro b e r t ; a n dRo u k o s , S a l i m ( 1 9 93 ). " To w a r d sh i s t o r y -b a s e d g r a m m a r s : U s i n g r i c h e rm o d e l s f o r p r o b a b i l is t ic p a r s i n g . " I nP roceedings, 31st An nua l M eet ing o f theAssociation for Computational Linguistics.C o l u m b u s , O H .Black , E zra ; Je l inek , Fred; Laf fe r ty , John;Mercer , Rober t , and Roukos , Sa l im (1992) ." D e c i s i o n t r ee m o d e l s a p p l i e d t o t h el a b e l i n g o f t e x t w i t h p a r t s - o f - s p e e ch . " I nD arpa Workshop on Speech and N aturalLanguage. H a r r i m a n , N Y .Br e i m a n , Le o ; F r i e d m a n , J e r o m e ; Ol s h e n ,Richard ; and Stone , Char les (1984) .Classification and regression trees.W a d s w o r t h a n d B r o o k s .Br i l l , Er i c (1992) . "A s imple ru le -based par to f s p e e ch t a g g e r ." I n P roceedings, ThirdConference on Applied N atural Langu ageP rocessing, ACL , Tre nto, I ta ly.Br i l l , Er i c (1993a) . "Automat ic grammari n d u c t i o n a n d p a r s i n g f r e e t e x t : At r a n s fo r m a t i o n - b a s e d a p p r o a c h . " I nP roceedings, 31st M eet ing of the Association ofComputational Linguistics, C o l u m b u s , O H .Bril l , Eric (1993b). A Corpus-Based Approach toLanguage Learning. Do ct o r a l d i s s e r t a ti o n ,D e p a r t m e n t o f C o m p u t e r a n d I n f o rm a t i o nS c i e n ce, Un i v e r s i t y o f P e n n s y l v a n i a .

Br i l l , Er i c (1993c) . "Transformat ion-basede r r o r - d r i v e n p a r s i n g ." I n P roceedings, ThirdInternational Workshop on ParsingTechnologies, T i lb u r g, T h e N e t h e r l a n d s .Br i l l , Er i c (1994) . "Some advances inr u l e - b a s e d p a r t o f s p e e ch t a g g i n g . " I nP roceedings, Twelfth N ational Conference onArtificial Intelligence (AAAI-94), Seat t le ,

W A .Bri l l , Er ic and Resnik, Phi l ip (1994) . "At r a n s f o rm a t i o n - b a s e d a p p r o a c h t op r e p o s i t io n a l p h r a s e a t t a c h m e n td i s a m b i g u a t i o n . " I n P roceedings, FifteenthInternational Conference on Compu tationalLinguist ics (COLING-1994) , Ky o t o , Ja p a n .B r o wn , P e t e r ; Co ck e , Jo h n ; De l l a P i e t r a ,Stephen; Del la P ie t ra , Vincent ; Je l inek ,Fred; L af fe r ty , John; M ercer , Rober t ; an dRooss in , P aul (1990). "A s ta t is t i ca la p p r o a c h t o m a c h i n e t r an s l a ti o n . "Computational Linguistics, 16(2).Br o wn , P e t e r ; La i , J e n n i fe r ; a n d M e r ce r ,Rober t (1991) . "Word-sensed i s a m b i g u a t i o n u s i n g s t a t i s t i ca lm e t h o d s . " I n P roceedings, 29th A nnu alM eeting of the Association for ComputationalLinguistics, Berke ley , CABr u ce , Re b e cca a n d W i e b e , J a n y ce ( 1 99 4 )." W o r d - s e n s e d i s a m b i g u a t i o n u s i n gd e c o m p o s a b l e m o d e l s . " I n Proceedings,32nd A nn ua l M eet ing of the Association forComputational Linguistics, L a s C r u c e s , N M

C h a r n i a k , E u g e n e ; H e n d r i c k s o n , C u r ti s ;J a co b s o n , Ne i l ; a n d P e r k o wi t z , M i ch a e l( 19 9 3) . " E q u a t i o n s f o r p a r t o f s p e e c ht a g g i n g . " I n P roceedings, Conference of theAm erican Association for ArtificialIntel l igence (AAAI-93) , W a s h i n g t o n , DC.Ch u r ch , K e n n e t h ( 19 8 8) . " A s t o ch a s t i c p a r t sp r o g r a m a n d n o u n p h r a s e p a r se r fo ru n r e s t r i c t e d t e x t ." I n P roceedings, SecondConference on Applied N atural L anguageP rocessing, ACL , Au s t i n , TX.Cu t t i n g , Do u g ; Ku p i e c , J u l i a n ; P e d e r s e n ,Jan; and Sibun, Pene lope (1992) . "Ap r a c t i ca l p a r t - o f - s p e e ch t a g g e r . " I nP roceedings, Third Conference on AppliedN atural Language Processing, ACL , Tr e n t o ,Italy.De M a r ck e n , Ca r l ( 19 9 0) . " P a r s i n g t h e l o bc o r p u s . " I n P roceedings, 1990 Conference ofthe Association for Comp utational Linguistics,P i t t s b u r g h , P A.

De r o s e , S t e p h e n ( 1 9 8 8 ) . " Gr a m m a t i ca lca t e g o r y d i s a m b i g u a t i o n b y s t a t i s t i ca lo p t i m i z a t i o n . " Computational Linguistics,14 .F r a nc i s, W i n t h r o p N e l s o n a n d K u c e r a ,H e n r y ( 19 8 2) . Frequency analysis of Eng lishusage: Lexicon and grammar. H o u g h t o nMif f l in , Bos ton .

5 6 4


23/24


Fuj isaki , Tetsu; Jel inek, Fred; Cocke, John;an d Black , Ez ra (1989) . "Prob abi l i s t i cp a r s i n g m e t h o d f o r s e n t e n c ed i s a m b i g u a t i o n . " I n Proceedings,International W orkshop on P arsingTechnologies, Ca r n e g i e M e l l o n Un i v e r s i t y ,P i t t sburgh, PA.Ga l e , W i l l ia m a n d Ch u r c h , K e n n e t h ( 19 9 1) ." A p r o g r a m f o r a li g n in g s e n te n c e s i nb i l i n g u a l co r p o r a . " I n P roceedings, 29thAnn ual Meeting of the Association forComputational L inguistics, Berkeley, CA.Gale , Wi l l i am; Church , Kenneth ; andYarowsky, David (1992) . "A method ford i s a m b i g u a t i n g wo r d s e n s e s i n a l a r g ec o r p u s . " Comp uters and the Hum anities.Le e ch , Ge o f f r e y ; Ga r s i d e , Ro g e r ; a n dBryant , Michae l (1994) . "Claws4: Thet a g g i n g o f t h e Br i t i s h Na t i o n a l Co r p u s . "In P roceedings, 15th International Conferenceon Com putational Linguistics, Ky o t o , J ap a n .Ha r r i s , Z e l l ig (1962). String Analysis ofLanguage Structure. M o u t o n a n d C o . , T h eH a g u e .Hi n d l e , Do n a l d ( 1 9 8 9 ) . " Acq u i r i n gd i s a m b i g u a t i o n r u l e s f r o m t e x t . " I nP roceedings, 27th A nn ua l M eeting of theAssociation for Comp utational Linguistics,Va n co u v e r , BC.Hindle , D. and Rooth , M. (1993) . "St ructura la m b i g u i t y a n d l e x ica l r e l a ti o n s . "Computational L inguistics, 19(1):103-120.Huang, Caro l ine ; Son-Bel l , Mark; andBa g g e t t, D a v i d ( 19 94 ). " Ge n e r a t i o n o fp r o n u n c i a t i o n s f r o m o r t h o g r a p h i e s u s i n gt r a n s fo r m a t i o n - b a s e d e r r o r - d r i v e nl e a r n i n g . " I n Interna tional Con ference onSpeech and L anguage Processing (ICSLP ),Yo k o h a m a , J a p a n.Jel inek, Fred (1985) . Self-O rganized LanguageM odelling for Speech Recognition. D o r d r e c h t .In Impact of P rocessing Techniques onCommunication, J. Skwi rz insk i , ed .Josh i , Aravind and Sr in ivas , B. (1994) ." D i s a m b i g u a t i o n o f s u p e r p a r t s o f s p e e c h( o r s u p e r t a g s ): A l m o s t p a r s i n g . " I nP roceedings, 15th Internatio nal Con ference onComputational L inguistics, K y o t o , J a p a n.Kl e i n , S h e l d o n a n d S i m m o n s , Ro b e r t (1 96 3 )." A c o m p u t a t i o n a l a p p r o a c h t og r a m m a t i c a l c o d i n g o f E n g li sh w o r d s . "] A C M , 10 .Kupiec , Ju l i an (1992) . "Robus t

p a r t - o f- s p e e c h ta g g i n g u s i n g a h i d d e nM a r k o v m o d e l ." Comp uter Speech andLanguage, 6.Marcus , Mi tche l l ; Santor in i , Bea t r i ce ; andM a r c i n k i e wi cz , M a r y a n n ( 1 9 9 3 ) ." B u i l d i n g a la r g e a n n o t a t e d c o r p u s o fEn g li sh : t h e P e n n Tr e e b a n k . "Computational L inguistics, 19(2).M e r i a l d o , B e r n a r d ( 19 9 4) . " Ta g g i n g E n g l i s ht e x t wi t h a p r o b a b i l i s ti c m o d e l . "Computational L inguistics.Mil le r , Geo rge (1990) . "W ordnet : an on- l inelex ica l da tabase ." International Journal ofLexicography, 3(4).Quin lan , J . Ross (1986) . " Induct ion ofd e c i s i o n t r e e s. " Machine Learning,1:81-106.Quin lan , J . Ross and Rives t , Rona ld (1989) ." I n f e r r i n g d e c i s i o n t r e e s u s i n g t h em i n i m u m d e s c r i p t io n l e n g t h p r in c i p l e ."Information and Computation, 80.Ra m s h a w, La n ce a n d M a r cu s , M i t ch e l l(1994). "E xplo r ing the s t a t is t i ca ld e r i v a t i o n o f t r a n s f o r m a t i o n a l r u l es e q u e n c e s f o r p a r t - o f - s p e e ch ta g g i n g . " I nThe B alancing Act: P roceedings of the AC LW orkshop on Com bining Symb olic andStatistical Approaches to La nguage, N e wM exico S ta te Un ivers i ty , Ju ly .Ro ch e , Em m a n u e l a n d S ch a b e s , Yv e s ( 19 9 5) ." D e t e r m i n i st i c p a r t o f s p e e c h t a g g i n gwi t h f i n it e s ta t e t r a n s d u c e r s . "Computational L inguistics, 21(2), 227-253.Schutze , Hinr i ch and Singer , Yoram (1994) .P a r t o f s p e e ch t a g g i n g u s i n g a v a r i a b l em e m o r y M a r k o v m o d e l . In Proceedings,Association for Comp utational Linguistics,L a s C r u c e s , N M .S h a r m a n , R o b e r t ; J e l in e k , F r e d ; a n d M e r ce r ,Ro b e r t ( 1 9 9 0 ) . " Ge n e r a t i n g a g r a m m a r f o rs ta t i s t i ca l t r a in ing ." In P roceedings, 1990D

Documents

Sentence Processing_study 1