Upload
others
View
34
Download
0
Embed Size (px)
Citation preview
Universal Dependencies for Portuguese
Alexandre Rademaker13 Fabricio Chalub1 Livy Real6
Claudia Freitas4 Eckhard Bick5
Valeria de Paiva2
1IBM Research Brazil
2Nuance Communications USA
3FGVEMAp Brazil
4PUC-Rio Brazil
5University of Southern Denmark Denmark
6USP Brazil
Dependencies Linguistics 2017 Pisa
Rademaker et al UD for Portuguese Depling 2016 1 30
Linguistic Resources are Important
I Possibly do not need to explain it here
I At IBM Research Brazil many projects on language understandingand information extraction Portuguese and English (technicaldomains such as OampG and Mining)
I We started and still work with the OpenWordnet-PT a Portugueseversion of Princeton WordNet very useful in many applications
Rademaker et al UD for Portuguese Depling 2016 2 30
Universal Dependencies
I Universal Dependencies offer the promise of greater parallelismbetween languages
I Syntactic dependencies are not too far from semantic dependenciesuseful for many applications
I Manningrsquos Law grounds for individual languages good for linguistictypology rapid and consistent annotation suitable for parsing withhigh accuracy comprehended by non-linguist support downstreamlanguage understanding tasks
I In UD 12 first UD Portuguese in UD 13 one additionalUD Portuguese-BR (from the Googlersquos treebanks)
Rademaker et al UD for Portuguese Depling 2016 3 30
The corpus Bosque
lsquoBosquersquo means lsquowoodsrsquo in Portuguese It consists of news running textfrom both Portugal and Brazil chunked into sentences syntacticallyanalyzed in tree structures making use of both automatic parsingPALAVRAS and fully revised by linguists
PALAVRAS is a rule-based Constraint Grammar CG system designed forPortuguese It produces deep linguistic analyses with tags at themorphological syntactic (dependency) and semantic levels
Rademaker et al UD for Portuguese Depling 2016 4 30
The Corpus BosqueUD 12 version of UD Portuguese
I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)
I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)
I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)
I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)
Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml
Rademaker et al UD for Portuguese Depling 2016 5 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Linguistic Resources are Important
I Possibly do not need to explain it here
I At IBM Research Brazil many projects on language understandingand information extraction Portuguese and English (technicaldomains such as OampG and Mining)
I We started and still work with the OpenWordnet-PT a Portugueseversion of Princeton WordNet very useful in many applications
Rademaker et al UD for Portuguese Depling 2016 2 30
Universal Dependencies
I Universal Dependencies offer the promise of greater parallelismbetween languages
I Syntactic dependencies are not too far from semantic dependenciesuseful for many applications
I Manningrsquos Law grounds for individual languages good for linguistictypology rapid and consistent annotation suitable for parsing withhigh accuracy comprehended by non-linguist support downstreamlanguage understanding tasks
I In UD 12 first UD Portuguese in UD 13 one additionalUD Portuguese-BR (from the Googlersquos treebanks)
Rademaker et al UD for Portuguese Depling 2016 3 30
The corpus Bosque
lsquoBosquersquo means lsquowoodsrsquo in Portuguese It consists of news running textfrom both Portugal and Brazil chunked into sentences syntacticallyanalyzed in tree structures making use of both automatic parsingPALAVRAS and fully revised by linguists
PALAVRAS is a rule-based Constraint Grammar CG system designed forPortuguese It produces deep linguistic analyses with tags at themorphological syntactic (dependency) and semantic levels
Rademaker et al UD for Portuguese Depling 2016 4 30
The Corpus BosqueUD 12 version of UD Portuguese
I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)
I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)
I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)
I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)
Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml
Rademaker et al UD for Portuguese Depling 2016 5 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Universal Dependencies
I Universal Dependencies offer the promise of greater parallelismbetween languages
I Syntactic dependencies are not too far from semantic dependenciesuseful for many applications
I Manningrsquos Law grounds for individual languages good for linguistictypology rapid and consistent annotation suitable for parsing withhigh accuracy comprehended by non-linguist support downstreamlanguage understanding tasks
I In UD 12 first UD Portuguese in UD 13 one additionalUD Portuguese-BR (from the Googlersquos treebanks)
Rademaker et al UD for Portuguese Depling 2016 3 30
The corpus Bosque
lsquoBosquersquo means lsquowoodsrsquo in Portuguese It consists of news running textfrom both Portugal and Brazil chunked into sentences syntacticallyanalyzed in tree structures making use of both automatic parsingPALAVRAS and fully revised by linguists
PALAVRAS is a rule-based Constraint Grammar CG system designed forPortuguese It produces deep linguistic analyses with tags at themorphological syntactic (dependency) and semantic levels
Rademaker et al UD for Portuguese Depling 2016 4 30
The Corpus BosqueUD 12 version of UD Portuguese
I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)
I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)
I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)
I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)
Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml
Rademaker et al UD for Portuguese Depling 2016 5 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The corpus Bosque
lsquoBosquersquo means lsquowoodsrsquo in Portuguese It consists of news running textfrom both Portugal and Brazil chunked into sentences syntacticallyanalyzed in tree structures making use of both automatic parsingPALAVRAS and fully revised by linguists
PALAVRAS is a rule-based Constraint Grammar CG system designed forPortuguese It produces deep linguistic analyses with tags at themorphological syntactic (dependency) and semantic levels
Rademaker et al UD for Portuguese Depling 2016 4 30
The Corpus BosqueUD 12 version of UD Portuguese
I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)
I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)
I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)
I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)
Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml
Rademaker et al UD for Portuguese Depling 2016 5 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The Corpus BosqueUD 12 version of UD Portuguese
I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)
I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)
I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)
I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)
Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml
Rademaker et al UD for Portuguese Depling 2016 5 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The Corpus BosqueIn UD 12
The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded
httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll
Rademaker et al UD for Portuguese Depling 2016 6 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The Corpus UDThe consolidation
I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)
I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD
I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR
I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus
Rademaker et al UD for Portuguese Depling 2016 7 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Why BosqueWhy not creating a new one from scratch
I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml
I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems
I We had on the team two researchers who had already worked onprevious versions of Bosque
I But conversion to UD scheme was much more complicated thaninitially planned
Rademaker et al UD for Portuguese Depling 2016 8 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Why the effortI Incorporate changes and additions made in the original treebank after
2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield
interesting insightsI We wanted to build a framework where manual revision work and
consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually
I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS
I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality
Rademaker et al UD for Portuguese Depling 2016 9 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The CG conversion grammar
I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules
I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque
I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion
I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency
I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc
Rademaker et al UD for Portuguese Depling 2016 10 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
similarities and differences
PALAVRAS niceline format
Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2
carro [carro] ltVgt N M S SUBJgt 2-gt3
foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0
achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3
em [em] ltsam-gt PRP ltADVL 5-gt4
o [o] lt-samgt ltartdgt DET M S gtN 6-gt7
inıcio [inıcio] lttempgt N M S Plt 7-gt5
de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7
a [o] lt-samgt ltartdgt DET F S gtN 9-gt10
tarde [tarde] ltpergt N F S Plt 10-gt8
em [em] ltnp-closegt PRP Nlt 11-gt10
Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt
ltheurgt ltforeigngt PROP M S Plt 12-gt11
13-gt0
Rademaker et al UD for Portuguese Depling 2016 11 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
similarities and differencescont
Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP
gt N SUBJ gt
root
ICL minus AUX
lt ADVL
gt N
P lt
gt N gt N
P lt
N lt P lt
root
Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT
det
nsubjpass
auxpass
root
case
det
obl
case
det
nmod
case
obl
flatname
punct
Rademaker et al UD for Portuguese Depling 2016 12 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
similarities and differencescont
I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)
I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios
I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)
I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)
Rademaker et al UD for Portuguese Depling 2016 13 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Improving the datagender
Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not
Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)
Unsp (for unspecified value)
Rademaker et al UD for Portuguese Depling 2016 14 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Improving the dataMWEs
The PALAVRAS annotation has MWEs tokenized as a single word
The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated
UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically
Rademaker et al UD for Portuguese Depling 2016 15 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Improving the dataParticiples
How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature
In UD can be VERB or ADJ
We worked on a set of linguistic rules to semi-automatically re-tagparticiples
Rademaker et al UD for Portuguese Depling 2016 16 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Improving the dataEllipses
In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
remnantremnant
O Opala durou 23 anos o Chevette 20
det nsubj nummod
obj
det
parataxis
orphan
Rademaker et al UD for Portuguese Depling 2016 17 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
TokenizationMWE
I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)
I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion
I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links
I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ
I We adopted a MWEPOS= in the misc field
Rademaker et al UD for Portuguese Depling 2016 18 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Tokenizationclitics
I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure
I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)
I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive
Rademaker et al UD for Portuguese Depling 2016 19 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The particle lsquosersquo
1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)
2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)
3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)
4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)
Rademaker et al UD for Portuguese Depling 2016 20 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
The particle lsquosersquo
I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)
I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl
I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field
Rademaker et al UD for Portuguese Depling 2016 21 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Negation
The treatment of negation has changed from UD version 1 to 2
In the UD version 2 a polarity feature was introduced (Polarity=Neg)
We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART
CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun
CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)
Rademaker et al UD for Portuguese Depling 2016 22 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Appositives
So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)
a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis
lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos
However there are always borderline cases
It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open
Rademaker et al UD for Portuguese Depling 2016 23 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Numbers
Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas
At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable
We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6
Rademaker et al UD for Portuguese Depling 2016 24 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Comparison and Assessment
Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see
Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision
Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)
Rademaker et al UD for Portuguese Depling 2016 25 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Comparison and Assessmentcont
We found that our version of the Bosque had many more cases ofapposition dependencies (appos)
In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion
In the annotation provided by PALAVRAS the syntactic function NltPRED
(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod
When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation
Rademaker et al UD for Portuguese Depling 2016 26 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Contributions
We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available
Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice
Rademaker et al UD for Portuguese Depling 2016 27 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Whatrsquos Next
We should note that this work is not finished
While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain
First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation
But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues
A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too
Rademaker et al UD for Portuguese Depling 2016 28 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Problems
I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)
I Many problems with reported speech and parataxis are inconsistentannotated
I The relation discourse is not consistent annotated
I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat
I Some obl that have PALAVRAS tag PIV should be obj
I We are now revising the appositional modifier appos versus nmod
Rademaker et al UD for Portuguese Depling 2016 29 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30
Thanks
Rademaker et al UD for Portuguese Depling 2016 30 30