27
Automatic Codebook Acquisition Paper prepared for the workhop Methods and Techniques Innovations and Applications in Political Science Politicologenetmaal 2005 (Antwerp, 19-20 May) Wouter van Atteveldt [email protected] Department of Communication Science & Department of Artificial Intelligence Free University Amsterdam

Automatic Codebook Acquisition

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Codebook Acquisition

Automatic Codebook Acquisition

Paper prepared for the workhopMethods and Techniques Innovations and Applications in Political Science

Politicologenetmaal 2005 (Antwerp, 19-20 May)

Wouter van Atteveldt

[email protected] of Communication Science& Department of Artificial Intelligence

Free University Amsterdam

Page 2: Automatic Codebook Acquisition

Contents

1 Introduction 2

2 Word lists in Content Analysis 3

2.1 Word List Creation . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Latent Semantics Analysis . . . . . . . . . . . . . . . . . . . . 5

2.3 Synonym extraction using word distance . . . . . . . . . . . . 6

3 Methodology 6

3.1 Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Latent Semantics Analysis . . . . . . . . . . . . . . . . . . . . 7

3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Domain and Corpus: Dutch political news 9

4.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Term list: A “silver standard” . . . . . . . . . . . . . . . . . . 9

5 Results 10

5.1 Term extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2 Synonym extraction . . . . . . . . . . . . . . . . . . . . . . . 14

6 Summary and Discussion 17

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.2 Possibilities and Limitations . . . . . . . . . . . . . . . . . . . 18

References 18

Appendix A: Stoplist 21

Appendix B: Technical Details 21

Appendix C: Synonym lists 21

1

Page 3: Automatic Codebook Acquisition

1 Introduction

Content Analysis is often necessary for researching the effects of or pat-terns in political communication, such as election studies and agenda setting(Kleinnijenhuis et al. 2003; Bryant and Zillman 2002; McCombs and Shaw1972). Classification or coding is a crucial step in the execution of mostContent Analyses (Krippendorff 2004; Holsti 1969). In this step, a textualunit (word, clause, sentence, paragraph, or document) is assigned a categoryfrom a fixed or open list of accepted categories; examples of such categoriesare frames, actors, issues, or emotions. Later analysis is then performed onthese categories rather than on the texts themselves. For the repeatabilityand accuracy of this coding, it is crucial that this categorisation scheme(codebook or dictionary) is exhaustive and unambiguous.

This is true for both manual and computer-driven content analysis, butespecially for the latter these categories need to be very well defined as thecomputer needs to be able to assign these categories to the textual unitswithout true understanding of the text. For both types of analysis thecreation of a categorisation scheme or entity list that is exhaustive is criticalsince computers cannot add missing entities to the list. If human coders addentities during coding this leads to more intepretation by the coders, andthus to lower repeatability.

Researchers are relatively good at deciding whether a given entity shouldbelong to the entity list. Thinking of all possible entities that could berelevant for a domain, however, is more difficult. The same holds for thinkingof different ways an entity can materialize in a text, for example which wordsor synonyms are indicative of that entity. For example, Tony Blair couldbe referred to as ‘Tony’, ‘Mr. Blair’, ‘the labour leader’,‘No. 11’, ‘Cherie’shusband’, ‘Sedgefield’s MP’ et cetera.

This paper proposes a method to alleviate both these problems, by havingthe computer automatically suggest relevant concepts/entites and synonymsfor these entities based on relative word counts and co-occurrence patterns.This method will be tested by comparing it to the actually entity lists usedfor four Dutch election campaing studies.

Section 2 will provide a short overview of the relevant literature in the fieldsof Communication Science and corpus based Natural Language Processingmethods, ending with an overview of the method that we propose. Then,section 3 will describe the technical details and choices made in implementingthis method. Section 4 will introduce the target domain and the corpora that

2

Page 4: Automatic Codebook Acquisition

were used for the evaluation, and in section 5 the results of this evaluationwill be presented. The last section will briefly discuss the limitations andpossibilities of this technique.

2 Word lists in Content Analysis

Many instances of computer content analysis use some form of word list.There are many word lists for determining relations or emotions, such asOsgood’s original list of evaluative terms, the lists in General Inquirer andthe emotional word lists of Pennebaker (Osgood et al. 1967; Stone 1997;Stone et al. 1962; Pennebaker et al. 2001). Programs that automati-cally detect the mentioning of actors or issues in texts, such as Profiler+,KEDS/TABARI, and the General Inquirer also implicitely use word lists byhaving certain words act as anchors or triggers to signal the presence of oneof the actors under investigation (Young 2001; Schrodt 2001).

Moreover, non-automated content analysis also use a form of word lists inthe form of codebooks or coding instructions. Although these need notexhaustively list all synonyms of a word due to the linguistic ability of thehuman coders, it is still of vital importance that these instructions be ascomplete and detailed as possible. The exact content of a codebook can beseen as the researcher’s context as meant by Krippendorff (2004).

The word lists that has received most focus in scientific research are theones that measure specific evaluations, affective value or emotions, such asthose mentioned above. Although these lists are generally created manually,research has also been done into the automatic creation and evaluation ofsuch word lists (Bestgen 2002; Kamps and Marx 2002; Turney 2002; vanAtteveldt et al. 2004). These word lists can be seen as defining the evaluativecategories by extension.

For many analyses, however, the researcher is not solely interested in whichemotions or evaluations occur, but rather in what actors or issues are beingevaluated. Thus, references to these actors or issues, here collectively calledentities, also need to be determined. This is also a coding or classificationproblem. In traditional content analysis, these entities are defined by in-tention and the task of finding the referents is left to human coders. Inautomatic analysis, such definitions also need to be extensional by listing allpossible synonyms for these entities.

3

Page 5: Automatic Codebook Acquisition

This study focusses on the latter type of word lists: synonym lists for auto-matically discovering references to entities in a text. The results, however,are also applicable to human annotation, as the entity or category descrip-tions need to be as unambiguous as possible. Moreover, by suggesting clus-ters of important terms, the results can also help in creating an exhaustivecoding scheme, since humans can easily overlook certain entities that couldbe important, especially if the coding scheme is very extensive.

2.1 Word List Creation

As argued above, acquiring an exhaustive list of important terms and theirdescriptions or synonyms is of vital importance for meaningful content analy-sis. There are a number of ways to obtain such a list, which can be roughlycategorized as follows:

Hand-crafted Probably the most common source for word lists is the Con-tent Analysis researcher. Using his or her own linguistic ability, the re-searcher thinks of the words that will probably refer to the concepts underinvestigation. The main disadvantage of this approach is caused by theenormous variety of language: any list a person can think of beforehandwill be incomplete with respect to an actual corpus (errors of ommission).This is analogous to the finding in Natural Language Processing that hand-crafted rule sets will always miss many of the rarer cases (Manning andSchtze 1999; Holsti 1969). However, as such lists are used extensively andimproved through this use, such as the lists in the General Inquirer and otherstandardized lists, this problem becomes less severe as the list becomes anactual reflection of the bodies of text on which it is used. But this limits theuse of these lists to general and relatively static topics, which makes it lesssuited for the finding of political actors or issues, which are usually dynamicand specific in nature.

Lexicographic sources Another approach is to use the prior efforts of lex-icographers. In most languages, machine readable dictionaries and thesauriare available as a source of synonyms and hypernyms; for example, Roget’sThesaurus for English or Brouwers for Dutch (Brouwers 1989; Kirkpatrick1998). From these sources, term lists can be enriched with synonyms bylooking up all synonyms and hyponyms, and in theory it can be used asa source for suggesting new terms by exploiting the hierarchical nature ofthesauri.

4

Page 6: Automatic Codebook Acquisition

However, since the synonyms generally have no frequency information at-tached, it has been found that using such sources often result in includingwords that might have the intended meaning but generally mean somethingelse. In fact, these sources generally include many homonyms1 of the wordsand lead the disambiguation to the language faculty of the human users.This results in errors of comission if applied automatically, since the com-puter cannot easily perform this disambiguation (van Atteveldt et al. 2004).Additionally, due to the fact that these lists are based on the hard labourof specialized professionals, these lists are limited in size (for example, theBrouwer’s thesaurus contains about 300,000 words divided into one thou-sand categories), infrequently updated, and general in domain. Althoughthe latter seems an advantage, having an equivalently sized word list that isspecific to the domain under investigation will both provide better coverageand have less homonymy problems.

WordNet Although strictly speaking WordNet (Miller 1990; Miller 1995)can be considered a thesaurus, it has been used often enough to warrant spe-cific mention. Especially for determining the evaluative or affective chargeof modifiers, WordNet has proven a very useful resource (Kamps and Marx2002; Moldovan and Rus 2001). Moreover, since WordNet has very detailedrelations it is possible to devise metrics for the semantic distance betweenwords (Banerjee and Pedersen 2003; Church and Hanks 1989), similar tothe distance metrics using co-occurence patterns described below. However,this approach has two drawbacks: although WordNet is available in morelanguages (Vossen 1999), these WordNets contain far less terms than theorinigal English WordNet and are less suited as a model of language use.Moreover, the problem of applying general lexicographic sources to veryspecific domains applies to WordNet as well.

2.2 Latent Semantics Analysis

Latent Semantics Analysis (Landauer and Dumais 1997; Deerwester et al.1990) is a method of leveraging co-occurrence statistics to estimate the se-mantic distance between words from their co-occurrence with other words.Technically, the underlying dimensionality of the document-term matrix (thetable containing the frequency of each word in each document), is reduced by

1Homomyms are two words with the same spelling but a differnt meaning, as opposedto synonyms which have the same meaning but different spelling

5

Page 7: Automatic Codebook Acquisition

applying Singular Value Decomposition, a generalized form of Factor Analy-sis. Thus, new dimensions or factors are formed based on the (co-)occurrencepatterns of each word in the documents. Then, the original term-documentmatrix is recreated as a projection of the underlying (lower-dimensional)space on the original space. Landauer and Dumais (1997) found that lexicalacquisition using a cosine distance metric on a corpus reduced by LSA hascharacteristics similar to human lexical acquisition.

2.3 Synonym extraction using word distance

Two observations that were made above are central to the method proposedin this paper: exhaustive codebooks are important for Content Analysis; andhumans are good at precision but worse at recall while the opposite holdsfor automatic extraction from corpora. Thus, by combining the strenghtsand weaknesses of humans and computers, it might be possible to arrive ata codebook that is both correct and exhaustive.

This method can be outlined as follows: Term Extraction using frequencystatistics, dimensionality reduction using Latent Semantics Analysis to lever-age the contextual information in these terms, and clustering based on adistance metric defined on the reduced document-term matrix .

3 Methodology

The process described in this paper consists of three steps: term extrac-tion, latent semantics analysis, and term clustering. These three steps, andthe evaluation that was performed to assess the quality of the result, aredescribed below.

3.1 Term Extraction

The extraction of the relevant terms was done by measuring which termsoccur most often in the set of documents of interest (the target corpus) ascompared to a more or less general set of documents of the same type, thereference corpus. To extract these terms, for each word the χ2 value ofits frequency in the target corpus as compared to the refence corpus wasdetermined. Candidate terms were words that occurred at least 5 times, did

6

Page 8: Automatic Codebook Acquisition

not occur on a stop list for Dutch2, and occurred significantly more often inthe target corpus than in the reference corpus (ie χ2 > 6.75, p = 0.01).

Note that in this study, terms are seen as single words. It is relativelystraightforward to extend this to multiword terms (n-grams) by determiningthe χ2 values of these n-grams. It can also be determined whether thesen-grams occur significantly more often than would be predicted from theunderlying word frequencies (Manning and Schtze 1999).

3.2 Latent Semantics Analysis

As described in section 2.2, Latent Semantics Analysis (LSA) is a method ofleveraging co-occurrence statistics to estimate the semantic distance betweenwords from their co-occurrence patterns with other words. This sectiondescribed the choices made in applying LSA to the corpus, in particularconcerning term and document selection, units of analysis, and weighting.

Sampling Although in principle all words and documents can be used forthis procedure, a selection was made for reasons of computability. To createa table containing the counts of the words in the different documents, calleda term-document matrix, selection can be applied to both the columns andthe rows of the hypothetical full table. For the documents, a random subsetof 100,000 (17%) documents was taken from the target corpus. For theterms, the 2,500 terms with the highest χ2 value were used. Moreover, anadditional 50,000 terms were used to form the ‘co-occurrence context’. Thereason for this is that although a term such as ’deciding’ is not a politicalactor, the occurrence of this term can help differentiate between a ministerand an action group. Thus, although we are not interested in the actualdistance between the target terms and the contextual terms, these termscan help determine the distance between the target terms. These contextualterms were selected based on the measure of their total frequency multipliedby their information content3, which is a compromise meant to excludeinfrequent and uninformative words.

Units of analysis In creating this term-document matrix, there are alsotwo units of analysis: what constitutes a term, and what constitutes a doc-ument. In this study, terms are defined as single words, and documents are

2See appendix A for this list3see also appendix B

7

Page 9: Automatic Codebook Acquisition

full articles. Alternative choices that are very interesting for future investi-gation are multi-word terms in addition to single words, and smaller textualcontexts, effectively treating paragraphs or sentences as separate documents.

Weighting Although the raw word frequencies can be used in the document-term matrix, it has been found that results are greatly improved by usinga combination of local weighting, global weighting, and normalization. Inthis study, local weighting was performed by using the logarithm of the rawcounts in the document, which means that the weight increase by adding anew word to the document decreases as the word already occurs more often.

The global weight of terms was defined as the conditional entropy of thedocuments given the term frequency divided by the unconditional entropyof the documents. This global weight is an indication of the informationcontent of the term, and will be zero if all documents contain the word, asthat means the term gives no information about the documents it occurs in.Finally, the term scores in each document were normalized such that theysum to one, giving equal weight to each article regardless of the length ofthe article. These choices are fairly standard in applying LSA and are thechoices that proved most effective in (Nakov et al. 2001).4

Dimensions Analogously to deciding the amount of factors in a FactorAnalysis, one of the decisions to make in Latent Semantics Analysis is thedimensionality of the underlying ‘semantic’ space. Previous research pointsto an optimum somewhere around 300-400 (Landauer and Dumais 1997; abdB.M. Pincombe and Welsh ). For this study 400 dimensions were used.

3.3 Clustering

To generate a list of synonyms per concept or entity, possibly overlappingclusters were made by including the closest n words to each entity. Closenesshere is defined as the Euclidean distance between the document frequencyvectors representing the words (the columns of the term-document matrix).These choices were mainly made for practical reasons. Especially usingcosine distance instead of Euclidean distance may prove a better approxi-mation to human language processing (as found by Landauer and Dumais(1997) and might avoid some spurious clusterings based on total word fre-quency rather than actual usage patterns, as Euclidean distance is not nor-

4See appendix B for a more detailed description of the weights used.

8

Page 10: Automatic Codebook Acquisition

malised for length. Additionally, it might be preferable to use a form ofclustering that minimizes total inter-cluster distance rather than only thedistance to the original term (the centroid). This can help avoid problemswhere a word with less frequent homonyms also is assigned the synonymsreferring to these homonyms rather than the intended concept.

4 Domain and Corpus: Dutch political news

4.1 Corpora

The domain of this study is Dutch politics outside of election time. Thecorpora used were gathered for an investigation of the news coverage of thedifferent ministries done on behalf of the Rijksvoorlichtingsdienst (Service ofPublic Information), a service that is responsible for the public relations ofthe government. The corpora both contain newspaper articles published inthe 5 Dutch national newspapers and a number of regional daily newspapersin the period 2003-2004.

As explained in section 3.1, terms were extracted based on relative frequencyin a target corpus compared to a reference corpus. The target corpus usedin this study consisted of the articles about the main actors and contains allarticles that contain one of the ministers or ministries under investigation(600,000 in total). The reference corpus, approximately 1,200,000 articleslarge) was selected based on specific issues that the Service was interestedin, so it contained no direct selection on actors.

4.2 Term list: A “silver standard”

Part of this research was an automated measure of the attention to the in-cluded actors and issues. For this purpose, a hierarchic entity list of termsand synonyms was created. Although this suffered all the qualms of hand-crafted lists as described above, it was produced in a number of iterationsthrough the duration of the project and is partly based on similar lists thathave been developed since the 1994 Dutch parliamentary elections (Kleinni-jenhuis et al. 2003, and references therin). Thus, although I expect this listto miss many of the important terms used in the target corpus and are aspecific view on the domain, it can still serve as a base for comparison andshould not contain too many errors of commission. The way this list hasbeen used will be described in more detail in section 5 below.

9

Page 11: Automatic Codebook Acquisition

To relate performance to characteristics of the entity, the entities were cat-egorized in three ways: political versus societal actors; abstract concepts,concrete institutions, and persons; and important versus unimportant ac-tors. This latter distinction is a subjective ranking that was jointly de-termined by two coders, with as guideline that all cabinet members, partyleaders and well known societal actors are important. Since this distinctionis purely explorative, the lack of a formal evaluation of this ranking is notof great importance.

Table 1 lists these categories and the number of entities in each of thesecategories.

Concept Institution Person TotalNon-political Unimportant 20 300 2 322

Important 29 144 7 180Political Unimportant 1 0 230 231

Important 1 31 51 83Total 51 475 290 816

Table 1: Quantitative description of the ‘silver standard’ Term List

5 Results

5.1 Term extraction

Before synonyms can be extracted for the concepts or entities in the list,the terms themselves need to be extracted. Two central statistics for eval-uation of an extraction process are recall and precision, which are fairlystandard measures in Information Extraction (Pazienza 2003). Recall is thepercentage of correct entities that were found, making it a measure of theerrors of ommission. Precision is the percentage of found terms that actuallycorresponded to an entity in the list, measuring the errors of commission.

Recall

To determine recall, the extracted terms were compared to the ‘silver stan-dard’ term list defined in section 4.2. Although we do not expect this list tobe complete, we do assume that all terms in the list are correct. Thus, therecall on the terms in the list is a good indicator of the actual recall. For

10

Page 12: Automatic Codebook Acquisition

technical reasons, the term list was here limited to 65.535 terms. Table 2below shows the recall for the categories defined in table 1.

Concept Institution Person TotalNon-political Unimp. 20% 36% 35%

Imp. 41% 50% 49%Political Unimp. 84% 84%

Imp. 90% 96% 93%Total 33% 44% 86% 58%

Table 2: Recall of the method (cells with n<25 ommitted, total n=816)

Although the average recall of 58% is not fantastic, it is a reasonable score.Moreover, the recall for political actors is around 90%, going up to 96%for important persons. The general tenedency seems to be that the moreconcrete, important and political a concept is, the better the method per-forms. This is not surprising: concrete concepts are more likely to have agood correspondence to words; important concepts are less affected by datascarcity, which is always problematic in corpus-based NLP methods; andthe selection criterion for the documents in the target corpus were politicalterms, making it more likely for these terms to occur often in this corpus.

Apart from whether the entity had a corresponding term in the χ2 list, itis interesting to know on what position of the list it occurred, since thehigher up the list the relevant terms are, the shorter the list that needs tobe considered. Moreover, since it would be interesting if the matching ofentities and terms can be done automatically, it was tested how often thematch could be determined by a simple heuristic. This heuristic picks, inorder of preference, a direct match of the whole entity, a direct match ofone of the words in the entity description, or the closest match to one of thewords. Table 3 below contains these two additional scores.

Concept Institution Person TotalAv.Pos Acc. Av.Pos Acc. Av.Pos Acc. Av.Pos Acc.

Non-pol. Unimp. 3,310 75% 5,497 86% 5,332 86%Imp. 3,962 100% 6,700 65% 6,030 71%

Political Unimp. 9,537 94% 9,517 94%Imp. 648 75% 1,197 78% 978 77%

Total 3,646 88% 5,545 78% 7,800 91% 6,228 85%

Table 3: Average rank of the correct terms in the χ2list and accuracy of the auto-matic match method (cells with n<25 ommitted, total n=816)

11

Page 13: Automatic Codebook Acquisition

As can be seen in the table, the average position of terms on the list is muchhigher than the arbitrary cutoff point of 2,500 that was chosen earlier. Thismeans that many concepts will not appear on this shorter list. The mainexception here are the important political actors, which have an averageposition below thousand, which is in line with the recall results. Curiously,in general the unimportant political actors were lower on the list than thenon-political actors, and it seems that the more concrete terms were actuallylower on the list than the abstract terms.

The automatic matching heuristic performed well, averaging 90% for peopleand abstract concepts and around 80% for institutions. It is interesting thatimportant entities were more difficult to match than unimportant entitiesfor both persons and institutions. This seems mainly due to three artefactsin the data: relatively obscure acronyms for the ministries; using prefixes inthe names of ministers and secretaries but not for MP’s; and listing ’Royal’or ’Society for’ in front of some important non-political institutions.

Precision

As defined above, precision measures the percentage of terms that wereextracted that actually correspond to entities. Since we cannot assume thatthe Silver Standard is complete as well as correct, a term not correspondingto an entity in the list does not imply that the term is not relevant: theentity could be missing from the term list. Thus, the precision measured onthe silver standard is a lower bound for the precision of the method.

To get an upper bound on the precision of the method, a 10% sample ofthe first 10.000 terms was reviewed by two domain experts to judge whetherthese terms are relevant for constructing an entity list of the domain. Sincemany of these words will be duplicated with slight variation is spelling orconjugation, this is only an upper bound. The true precision of the methodwill be somewhere between these two figures.

A common graph in Information Extraction is the precision/recall curve. Itshows the drop in precision by lowering the threshold to get higher recall. Inthis case, as the arbitrary cutoff point is lowered, recall increases at the priceof lower precision. Figure 1 shows this curve for the two bounds on precision,with the points indicating a thousand position increase of the cutoff point.

As we can see, the lower bound on precision is quite low: around 13%.Moreover, to get acceptable recall, this precision will have to be lowered toaround 5%. The upper bound is much better: around 60% of the first 1,000

12

Page 14: Automatic Codebook Acquisition

0%

10%

20%

30%

40%

50%

60%

70%

0% 10% 20% 30% 40%

Recall

Pre

cisi

on

Silver StandardExpert

Figure 1: Precision/Recall curves

terms were judged relevant by the human experts, dropping to slightly lowerthan 30% for recall scores of around 40%. The F-score, an harmonic averageof recall and precision, is around 30% regardless of cutoff point (not shownin the graph).

Which means...

The above results for precision and recall, which indicate an upper boundon the F-score of around 0.3, mean that the method is not yet suited as anautomatic way to extract an entity list from the target corpus. However,the method can certainly be useful for aiding a researcher in constructingthis list: by looking at the first couple of thousand terms, the researcher canquickly see whether some important concepts were ommitted. As peopleare generally better at filtering out irrelevant terms than in thinking of allrelevant concepts, while the computer reaches recall of over 90% for certaincategories, this can prove a very good combination.

13

Page 15: Automatic Codebook Acquisition

5.2 Synonym extraction

The second and third step of the process, dimensionality reduction andclustering, result in a list of candidate-synonyms per entity. These synonymsare the words that were closest to the word that was matched to the entity inthe first step. To get a clean evaluation of this step, a random sample of thoseentities for which a term was found in the first 2,500 terms was evaluated.This evaluation was performed by having two domain experts determinethe correctness of the first 25 candidate-synonyms for these entities. Threecategories were made: direct synonyms (’Prime Minister’5 for the primeminister), indicators (’Party leader’ for the leader of Labour), and irrelevantwords (’Powell’ for the Dutch ministry of Foreign Affairs).

The raw results of this evaluation can be seen in figure 2. In this figure, eachrow represents one entity, and the different synonyms are shown accordingto their distance from the entity as determined by the clustering. The rowswere sorted by actor type and importance. Although this figure is difficult tointerpret, some tentative conclusions can be drawn. The correct synonymsare not all clustered together, and the different synonyms for one term arespread in distance fairly continuously. Thus, it does not seem sensible todefine a cutoff point for the synonyms based on the measured distance.

To get a better overview of the performance of the method per entity cat-egory, two aggregate scores are presented in table 4 below: the averagenumber of synonyms and indicators in the first 25 candidates; and the aver-age position of these relevant terms. This latter score is an indicator of theprecision of the method, while the former is an indicator of whether enoughsynonyms were found to make the method useful. The recall of the methodis very difficult to determine since we have no good standard for comparison.It would be very interesting to compare the found synonyms to manuallyannotated texts (which are available for the election studies), but this is leftas future work.

On average, 1.4 direct synonyms and 2.6 indicator words were found amongthe first 25 synonyms for a term. As in general the average position in a listof 25 items is 13, the average positions of 7 and 11 for these words indicatesthat the distance is a useful measure of closeness. The method scores poorlyon persons; this is presumably due to the fact that not many synonyms fora person exist apart from his or her name and function. For institutions andabstract concepts, more synonyms are found.

5These multiword terms are single words in Dutch

14

Page 16: Automatic Codebook Acquisition

Distance to concept

Political ActorsOther Actorsunimpimpunimpimp

Error

IndicatorS

ynonym

Figure 2: Results of the synonym generation; rows represent entities, pointscandidate-synonyms. Example: The top row is Geert Wilders, of which thefirst candidate is a direct synonym: geert, followed at some distance by acluster containing two indicators: liberale and liberaal and a lot of noise. Seealso appendix C

15

Page 17: Automatic Codebook Acquisition

Concept Institution Person TotalAv.Pos Amnt Av.Pos Amnt Av.Pos Amnt Av.Pos Amnt

NP Unimp Syn 6.5 1.0 5.9 1.1Ind 11.9 6.7 11.9 5.9

Imp Syn 9.1 5.3 7.4 1.6 8.0 2.2Ind 14.3 4.4 8.9 2.8 10.3 2.9

Pol Unimp Syn 7.0 0.8 7.0 0.8Ind 8.8 1.0 8.8 1.0

Imp Syn 7.5 0.9 4.6 1.1 5.8 1.0Ind 11.5 1.7 9.7 2.0 10.4 1.8

Total Syn 8.7 4.4 7.3 1.3 5.1 0.9 7.2 1.4Ind 13.9 3.9 10.4 3.1 9.5 1.6 10.7 2.6

Table 4: Average position of the relevant terms among the candidate-synonymsand the amount of relevant terms in the first 25 candidate-synonyms (cells withn<5 ommitted, total n=104)

Manual inspection of these synonyms6 revealed some interesting results:

• Even though the domain is limited, homonymy still forms a problem,although a lot smaller than would be expected otherwise. For example,‘As’ is both an MP and the translation of ’Axis [of evil]’, resulting inmany iraq related synonyms; and ’camp’ and ’kamp’ are an MP andminister as well as words for military encampments, the first being partof the Dutch ’Camp Smitty’ in Iraq and the latter being the Dutchword meaning camp. Named Entity Recognition and POS taggingmight well avoid many of these, as the main problem seems to beproper names being mixed up with nouns.

• Sports. Even in a corpus selected using political terms, many arti-cles are about sports, especially football. To aggrevate things, thesesports words are so densely clustered that any entity that could be asports term has many very close candidate-synonyms from these sportsarticles. Examples include ’advocaat’ (lawyer / football coach), ’az’(Ministery for General Affairs / football team), and ’Dekker’ (Minis-ter of Housing / cyclist). A better selection procedure (possibly usingnegative terms as well as positive ones) might avoid these problems.

• Abstract concepts. Although the matching and recall of step one per-forms a lot better on very concrete words, the synonyms generated for

6see http://www.cs.vu.nl/ wva/papers/pol2005 for the raw lists

16

Page 18: Automatic Codebook Acquisition

abstract concepts include very useful indicators. For example: ’Judi-cial Power’ yields trial, court, trials, courts, jurisdiction, proceedings,evidence, law, inquiry, adjudication: all in all 10 direct synonyms and12 indicators in the first 25 words (this is the lowest important non-political actor in figure 2).

6 Summary and Discussion

This paper proposes a method to aid the researcher in two steps that areperformed in most content analyses: the creation of a the list of categoriesor entities that the researcher wants to count, and the definition of theseterms in the codebook or synonym list.

For the first step, our method can suggest terms with a recall of 80%-90%,with the higher figure especially for concrete political actors. Precision ofthe method is lower with an upper bound of around 50%. This means thatthe method is not suited as an automatic method to generate entities sincetoo much noise would be contained in the result. On the other hand, thehigh recall means that the method can be very useful to help the researcherprevent errors of ommission.

For the second step, our method can suggest lists of candidate-synonyms.In the first 25 candidates, there are on average 1.5 direct synonyms and 2.5words that indicate the presence of the entity. In contrast to the first step,performance is best on general or abstract terms, although this might bebecause there simply are not many synonyms for a person.

6.1 Future work

Some additions to this method are fairly straightforward and can be ex-pected to improve results at low costs. Lemmatizing will help reducethe amount of word forms and reduce both computational complexity anddata scarcity problems. POS tagging and selecting only nouns and propernames as candidate terms and synonyms can also help reducing computa-tional complexity, plus it can be expected to solve certain homonymy prob-lems and increase precision of the recall. Finally, filtering out documentscontaining sport terms will tackle the specific problem mentioned in section5.2.

17

Page 19: Automatic Codebook Acquisition

Another good improvement might be dealing with multiword terms. Themost straightforward way to do this is presumably collocation detectionbased on individual and joint frequency, combined with preprocessing toreplace the collocations by single tokens (such as HouseOfCommons).

6.2 Possibilities and Limitations

Although this method can aid any content analysis research for which enoughdocuments are available, its main strenght may well lie in very quick analysisof relatively broad terms. Concepts such as ’Judicial Power’, ’EuropeanPolitics’, ’Democracy’, and ’Freedom’ seem to yield very useful synonyms.

This method can also work to enhance word lists such as those in the GeneralInquirer () or used in (). The main problem with those word lists is that it isdifficult to attain high recall; allowing the computer to generate synonymsand then removing the irrelevant ones might greatly increase recall withoutdamaging precision.

References

abd B.M. Pincombe, M. L. and M. Welsh. An empirical evaluation ofmodels of text document similarity. submitted manuscript.

Banerjee, S. and T. Pedersen (2003). Extended gloss overlaps as a measureof semantic relatedness. In Proceedings of the Eighteenth InternationalJoint Conference on Artificial Intelligence, pp. 805–810.

Bestgen, Y. (2002). Dtermination de la valence affective de termes dansde grands corpus de textes. In Actes du Colloque International sur laFouille de Texte CIFT’02, Nancy. INRIA.

Brouwers, L. (1989). Het juiste woord: Standaard betekeniswoordenboekder Nederlandse taal (7th edition; ed. F. Claes). Antwerpen: Stan-daard Uitgeverij.

Bryant, J. and D. Zillman (Eds.) (2002). Media Effects: Advances inTheory and Research. Mahwah, NJ: Lawrence Erlbaum.

Church, K. W. and P. Hanks (1989). Word association norms, mutual in-formation, and lexicography. In Proceedings of the 27th. Annual Meet-ing of the Association for Computational Linguistics, pp. 76–83.

18

Page 20: Automatic Codebook Acquisition

Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas, andR. A. Harshman (1990). Indexing by latent semantic analysis. Journalof the American Society of Information Science 41 (6), 391–407.

Holsti, O. (1969). Content Analysis for the Social Sciences and Humani-ties. Reading MA: Addison-Wesley.

Kamps, J. and M. Marx (2002). Words with attitude. In First Interna-tional WordNet conference.

Kirkpatrick, B. (1998). Rogets Thesaurus of English Words and Phrases.Harmondsworth, England: Penguin.

Kleinnijenhuis, J., D. Oegema, J. de Ridder, A. van Hoof, and R. Vliegen-thart (2003). De puinopen in het nieuws, Volume 22 of CommunicatieDossier. Alpen aan de Rijn (Netherlands): Kluwer.

Krippendorff, K. (2004). Content Analysis: An Introduction to ItsMethodology (second edition). Sage Publications.

Landauer, T. K. and S. T. Dumais (1997). A solution to plato’s problem:The latent semanctic analysis theory of the acquisition, induction, andrepresentation of knowledge. Psychological Review 104, 211–140.

Landauer, T. K., P. W. Foltz, and D. Laham (1998). Introduction tolatent semantic analysis. Discourse Processes 25, 259–284.

Manning, C. and H. Schtze (1999). Foundations of Statistical NaturalLanguage Processing. Cambridge, MA: MIT Press.

McCombs, M. E. and D. L. Shaw (1972). The agenda-setting function ofmass media. Public Opinion Quarterly 36, 176–187.

Miller, G. (1990). Wordnet: An on-line lexical database. InternationalJournal of Lexicography (Special Issue) 3, 235–312.

Miller, G. (1995). WordNet: a lexical database for English. New York:ACM Press.

Moldovan, D. I. and V. Rus (2001). Logic form transformation of wordnetand its applicability to question answering. In Meeting of the Associ-ation for Computational Linguistics, pp. 394–401.

Nakov, P., A. Popova, and P. Mateev (2001). Weight functions impacton lsa performance. In Proceedings of the EuroConference Recent Ad-vances in Natural Language Processing (RANLP’01), pp. 187–193.

Osgood, C. E., G. J. Suci, and P. H. Tannenbaum (1967). The Measure-ment of Meaning. Urbana IL: University of Illnois press.

19

Page 21: Automatic Codebook Acquisition

Pazienza, M. T. (Ed.) (2003). Information Extraction in the Web Era:Natural Language Communication for Knowledge Acquisition and In-telligent Information Agents. Springer.

Pennebaker, J. W., M. E. Francis, and R. J. Booth (2001). Linguistic In-quiry and Word Count. Mahwah, NJ: Lawerence Erlbaum Associates.

Schrodt, P. (2001). Automated coding of international event data us-ing sparse parsing techniques. In Annual meeting of the InternationalStudies Association, Chicago.

Stone, P. (1997). Thematic text analysis: new agendas for analyzing textcontent. In C. Roberts (Ed.), Text Analysis for the Social Sciences.Mahwah, NJ: Lawerence Erlbaum Associates.

Stone, P., R. Bayles, J. Namerwirth, and D. Ogilvie (1962). The generalinquirer: a computer system for content analysis and retrieval basedon the sentence as a unit of information. Behavioral Science 7.

Turney, P. D. (2002). Thumbs up or thumbs down? semantic orientationapplied to unsupervised classification of reviews. In Proceedings of the40th Annual Meeting of the Association for Computational Linguistics(ACL’02), pp. 417–424.

van Atteveldt, W., D. Oegema, E. van Zijl, I. Vermeulen, and J. Kleinni-jenhuis (2004). Extraction of semantic information: New models andold thesauri. In Proceedings of the RC33 Conference on Social ScienceMethodology, Amsterdam.

Vossen, P. (Ed.) (1999). EuroWordNet: a multilingual database with lexi-cal semantic networks for European languages. Dordrecht: Kluwer.

Young, M. D. (2001). Building worldviews with profiler+. In M. D.West (Ed.), Applications of Computer Content Analysis, Volume 17of Progress in Communication Sciences. Ablex Publishing.

0.2

20

Page 22: Automatic Codebook Acquisition

Appendix A: Stoplist

The following words were exluded from the analysis:

zich een we je deze aanbetreffende eer had juist na overeindvan weer aangaande bij eerdat haddenjullie naar overigens vandaan weg aangezienbinnen eerder hare kan nadat pasvanuit wegens achter binnenin eerlang hebklaar net precies vanwege wel achternaboven eerst hebben kon niet reedsveeleer weldra afgelopen bovenal elk hebtkonden noch rond verder welk albovendien elke heeft krachtens nog rondomvervolgens welke aldaar bovengenoemd en hemkunnen nogal sedert vol wie aldusbovenstaand enig hen kunt nu sindsvolgens wiens alhoewel bovenvermeld enigszins hetlater of sindsdien voor wier aliasbuiten enkel hierbeneden liever ofschoon slechtsvooraf wij alle daar er hierbovenmaar om sommige vooral wijzelf allebeidaarheen erdoor hij mag omdat spoedigvooralsnog zal alleen daarin even hoemeer omhoog steeds voorbij ze alsnogdaarna eveneens hoewel met omlaag tamelijkvoordat zelfs altijd daarnet evenwel hunmezelf omstreeks tenzij voordezen zichzelf altoosdaarom gauw hunne mij omtrent terwijlvoordien zij ander daarop gedurende ikmijn omver thans voorheen zijn anderedaarvanlangs geen ikzelf mijnent onder tijdensvoorop zijne anders dan gehad inmijner ondertussen toch vooruit zo anderszinsdat gekund inmiddels mijzelf ongeveer toenvrij zodra behalve de geleden inzakemisschien ons toenmaals vroeg zonder behoudensdie gelijk is mocht onszelf toenmaligwaar zou beide dikwijls gemoeten jezelfmochten onze tot waarom zouden beidendit gemogen jij moest ook totdatwanneer zowat ben door geweest jijzelfmoesten op tussen want zulke benedendoorgaand gewoon jou moet opnieuw uitwaren zullen bent dus gewoonweg jouwmoeten opzij uitgezonderd was zult bepaaldechter haar jouwe mogen over vaak

21

Page 23: Automatic Codebook Acquisition

Appendix B: Technical Details

Latent Semantics Analysis

As described in section 2.2, LSA (Landauer and Dumais 1997; Deerwesteret al. 1990) is a method for analysing the semantics of words from their usagepatterns. Technically, LSA is the application of the dimensionality reductionmethod Singular Value Decomposition (SVD) to a matrix containing thefrequency of the terms in the different documents.

SVD decomposes an d× t document-term matrix M to three matrices suchthat M = U · S · V T , where U is a t × n matrix mapping the terms inM to the underlying ‘factors’, S is a n × n diagonal matrix containing thesingular values of the original matrix, and V T is a n × d matrix mappingthe documents to these dimensions. Any matrix can be losslessly reducedto these three matrixes.

Then, the dimensionality reduction is done by selecting only the top m ¿ nsingular values, reducing the S matrix to an m × m matrix and likewisefor the other matrices. The resulting matrix, M ′, will be the least squaresapproximation to the original matrix given the requirement on S.

See also Landauer et al. (1998) for a very good and detailed introductionto this method.

Weighting

Section 3.2 describes the weigthing choices made in the current study. Thesechoices are more formally described below.

Each cell dt(d, t) in the document-term matrix contain the normalized weightedfrequencies of the term in that document. The normalization is such thateach document (each row in the matrix) sums up to one. The weighting isa combination of the global weight G(t) for that term and the local weightL(d, t) for that term in that document:

dt(d, t) = L(d, t) ·G(t)/∑

t∈T

L(d, t) ·G(t)

The local weight is the logarithm of one plus the document frequency f(d, t)of that term in that matrix, which reduces the impact of words with rela-tively high frequencies:

22

Page 24: Automatic Codebook Acquisition

L(d, t) = 2log(1 + f(d, t))

The global weight of a term t for all documents is defined as the relativeconditional entropy of the documents given that term. This means that it isa reflection of the amount of information that knowing the frequency of thatterm in a document gives about that documents. A word such as ’the’, whichwill occur in almost all documents with comparable relative frequency, doesnot give any information about the document and will thus be assigned avery low global weight. The measure for conditional entropy is the standardmeasure p(d|t) · 2logp(d|t), where the probability of a document given thatword is defined as the frequency of that word in that document divided bythe total frequency of that word:

G(t) = 1− H(D|t)H(D)

= 1 +∑

d∈D p(d|t) · 2logp(d|t)2log|D|

p(d|t) = f(d, t)/∑

d

f(d, t)

The fact that the relative entropy is taken relative to the total entropy ofthe documents, which is defined as the 2log of the number of documents,ensures that the resulting measure is between 0 (no information content)and 1 (maximal information content).

Selection of the contextual terms

In order to allow the usage pattern of the target terms to determine the se-mantic distance as well as the co-occurence pattern, a number of contextualterms was included in the Latent Semantics Analysis as well as the targetterms. These contextual terms were picked from the total list of terms bytaking the terms that had the highest information content times total fre-quency, in other words that had the highest f(t) ·G(t), with G(t) as definedabove.

23

Page 25: Automatic Codebook Acquisition

Implementation

The decomposition of the resulting matrix was done using the SVDLIBCtoolkit written by Doug Rohde7, based on the SVDPACKC library writtenby Berry and others. The creation of the document-term matrix and therecreation of this matrix from the SVD decomposition was done using pythonscripts using the Numarray toolkit8. The clustering was done by buildinga distance matrix using the Pycluster toolkit9 and sorting the list of wordsper entity on this distance10. All code was executed on *nix machines, theSVD decomposition on a multiprocessor Solaris SPARC machine with 8GBinternal memory, while the clustering and preprocessing was performed ona more run-of-the-mill linux machine with 1GB internal memory. Totalexecution time was in the order of magnitude of hours, with many of theself-written code very unoptimized.

7http://tedlab.mit.edu/ dr/SVDLIBC/8http://sourceforge.net/projects/numpy9http://bonsai.ims.u-tokyo.ac.jp/ mdehoon/software/cluster/software.htm

10The two toolkits are freely available and the python scripts to combine them areavaiable upon request from the author

24

Page 26: Automatic Codebook Acquisition

Appendix C: Synonym lists

The tables below list the first ten synonyms per entity. The order is thereverse of that in figure 2.

Non-political actors

entity word synonyms

albert heijn/ahold heijn ah super edah boer albert supermarktconcerncasino nutreco meurs offensiefambtenaren ambtenaren ambtenaar topambtenarendepartement departementenonderzoeken verwijt topambtenaar integriteit bewindslieden concludeerdeapothekers apothekers medicinale cannabis knmg buijs kinkhoest mankell woedend perle gda wetswijzigingartsen zonder grenzeazg dagestan losgeld erkel ontvoerde ismail powells kissinger tbilisi mahendra papandreoubelangengroeperingengroepering jordaanse militanten qaida egyptische riyad offensief ahmad hezbollah tikrit israeboeren, agrariërs boeren veerman koeien biologische natuurbeheer kippen gelderse lnv mest pluimveehouderseierenburgemeester / corpsburgemeester gekozen burgemeesterscohen benoeming fractievoorzitterbestuurlijke raadsvergaderingvergadering raadsleden kandidatenenergiewinningsbedrienergiebedrijvenez opta laurens karien bouwbedrijven schaduwboekhoudingnma clement bewindspersoonheinsbroekeuropees parlementep europarlementarierseuroparlement straatsburg buitenen neelie parlementarierseurlings parlementsledencommissievoorzitterparlementariereuropese unie eu unie lidstaten brussel voorzitterschaponderhandelingenluxemburg top grondwet eurocommissarisregeringsleidersgedeputeerde statengedeputeerde statenlid provincies statenfractie maij ordening statenverkiezingenrietkerk gelderse lnv natuurbeheergemeenteraad gemeenteraad fractievoorzitterraadsvergaderingmotie fracties raadsleden leefbaar vergadering aangenomen gekozen namenshoge raad raad state grlinks cu raadsvergaderingmotie raadsleden uitspraak fractievoorzitterdiscussie vergaderinghumanitaire organisahumanitaire vredesmacht ontwapenen troepenmacht mandaat gezant kofi darfoer darfur diplomaten militiesimmigratiedienst ind uitzetting vreemdelingenbeleidvertrekcentra vertrekcentrumdossiers landsadvocaat rita ambassades pardonregelingjacoisrael israel palestijnse israelische palestijnen sharon arafat jeruzalem gazastrook vrede abbas hamaskoninklijke landmachlandmacht vliegbasis baret helikopters luitenant fregatten luchtmacht kolonel majoor defensiepersoneelbevelhebberkoninklijke marechaumarechaussee mariniers sergeant majoor missies irakees gelegerd afmp afgevuurd kolonel defensiepersoneelkoninklijke marine marine landmacht luchtmacht vliegbasis fregatten krijgsmacht helikopters twenthe orion defensiepersoneelveteranenkoninkljike luchtmacluchtmacht landmacht vliegbasis marine krijgsmacht helikopters commando commandant luitenant kolonel eenhedenlanden landen lidstaten unie verdrag italie spanje eu brittannie navo rusland luxemburglaurus laurus konmar edah super albert heijn ah supermarktconcerncasino nutreco jonnieleger leger soldaten militaire troepen gedood militair gazastrook arabische aanval strijders jeruzalemmedia media zender pers journalist symfonie nos commissariaatmco rso eo medymilitairen militairen troepen militair militaire soldaten leger missie irakezen afghanistan as rumsfeldnavo navo scheffer secretaris jaap bondgenootschaprobertson annan missie afghanistan kofi militairepluimveehouders pluimveehoudersgeruimd pluimveebedrijvenbesmette vervoersverboduitbraak eieren ruimen vogelpestvirus mkz besmettingprovincie provincies sybilla rietkerk statenfractie statenverkiezingendemissionair dusver visserij waddenzee kamercommissielnvpublieke omroep publieke omroepen orkest omroep radio media hilversum mco symfonie bnn medyraad van state state rechtspraak cassatie juridische tjeenk juridisch oordeel aanwijzing civiele unaniem cultuurnotarechter rechter uitspraak geding straf rechtszaak rechters opgelegd politierechter advocaten eis zittingrechterlijke macht rechterlijke strafzaken rechtbanken rechtspraak strafproces bewijsmateriaalvooronderzoekwetboek rechtszaken berechting berechtenverenigde staten vs washington amerikanen vn rusland veiligheidsraadkorea powell militaire resolutie natiesvn vn veiligheidsraadnaties resolutie powell annan militaire wederopbouw amerikaans washington iranvn-veiligheidsraad veiligheidsraad resolutie amerikaans wapeninspecteursnaties wederopbouw massavernietigingswapensannan sancties inspecteurs blixwethouder wethouder ruimtelijke leefbaar raadsvergaderingfractievoorzitterordening motie vergadering raadsleden discussie oudkerkwto wto cancun kyoto fischler embargo wapenembargovetorecht giscard estaing meningsverschillensolana

bolkestein, frits, e bolkestein eurocommissariseuroparlementarierkroes barroso straatsburg buttiglione richtlijn europarlement neelie lidstaatbuttiglione buttiglione neelie straatsburg commissievoorzittereuroparlement ep europarlementariersbuitenen parlementariersprodi socialistkroes, neelie kroes barroso buttiglione neelie europarlementarierstraatsburg commissievoorzittereuroparlement ep portefeuille kandidatuur(korps) mariniers mariniers kolonel majoor sergeant missies gelegerd manschappen apache afmp camp defensiestafadvocaat advocaat client mr raadsman bewijs bondscoach getuigen advocaten uitspraak davids gerechtshofaid aid inspectiedienstruimingen hobbydieren mkz vervoersverbodvogelpestvirus kalkoenen hobbydierhoudershobbyboeren pluimveesectorambassade ambassade ambassadeur diplomaten saudi iraanse diplomaat indonesische teheran osama ambassades thaiseclienten client raadsman verklaringen zitting vrijspraak ontkent strafzaak rechtszaak vrijgesproken aanklager ontkendecommando's commando manschappen mariniers gelegerd eenheden commandant luitenant kolonel helikopters troepenmacht stabilisatiemachtcuba cuba guantanamo bay vijandelijke bases ashcroft qaida powells hooggeplaatstedissidenten clarkeillegalen illegale inval onderzoekt woonwagenkampxtc huiszoeking arrestaties vinkenslag huiszoekingen documenten strafbaarnam nam hield legde leek bleef bracht wist toonde zette achterstand verloornma nma bouwbedrijven energiebedrijvenschaduwboekhoudingbouwfraude opta ez laurens enquetecommissiekarien onrechtmatigslachtoffers slachtoffers doden schadevergoedingnabestaanden aanslag getuigen gepleegd gedood autoriteiten madrid misdrijventhailand thailand thaise uitzitten staatsbezoek machiel doodstraf indonesische kuijt veroordeelden uitgezeten drugssmokkel

(important non-political actors)

(unimportant non-political actors)

25

Page 27: Automatic Codebook Acquisition

Political actors

entity word synonyms

az az rbc heerenveen nac nec graafschap roda rkc psv jc eredivisiebuza buitenlandse bot powell vn ambassadeur rusland iran colin washington hoofdstad ambassadebzk binnenlandse remkes aivd veiligheidsdienstinlichtingen ministeries staatsrecht ambtenaren terrorisme terroristische terroristenchristenunie christenunie sgp leefbaar halsema peiling lijsttrekker rouvoet maurice femke zetel oppositiepartijeneerste kamer eerste wedstrijd derde klasse bleef punten speelde rust maakte vierde seizoenez ez opta karien laurens noe comb bewindspersoonmtr quay res delfiagroenlinks groenlinks christenunie leefbaar zetels fracties halsema sgp fractievoorzittermotie lijsttrekker femkekabinet balkenende ikabinet vice ministerraad leider regeerakkoordfinancien gpd ii premier kok bronnenlnv lnv visserij voedselkwaliteitganzen mest ruimingen hobbydieren pluimveesectormkz natuurbeheer hobbyboerenlpf lpf fortuyn pim herben sgp eerdmans cu nawijn zetels grlinks christenuniemin financien financien gerrit stabiliteitspact wijn afm eichel vice pact trichet eurocommissariscorrespondentmin van defensie defensie militairen militaire militair knaap leger rumsfeld troepen afghanistan krijgsmacht soldatenmin van justitie justitie officier verdachte donner straf gevangenisstrafveroordeeld celstraf eiste advocaat verdachtoppositiepartijen oppositiepartij coalitiepartijen coalitiepartnerscoalitiepartner coalitiegenotenobrero frente volkspartij moties verkiezingscampagneregeringspartijparlement, tweede kaparlement europees verkiezingen kandidaat barroso europarlementarierbolkestein kroes kandidaten eurocommissarisstraatsburgpvda pvda groenlinks grlinks fractievoorzittercu christenunie sgp fracties coalitie zetels verkiezingenregering, overheid regering onderhandelingendemocratischenationale oppositie macht troepen hoofdstad verklaarde militaire vnregeringspartijen regeringspartij coalitiepartner coalitiepartnersboris obrero frente fol oppositieleider kamerfracties openlijk volksvertegenwoordigingvrom vrom inspectie milieu ruimtelijke ordening volkshuisvestinggeel provincies state nota illegalevvd vvd fractievoorzittergrlinks liberalen cu sgp fracties coalitie aartsen christenunie zetelsaartsen, tk fv aartsen fractieleider liberale jozias liberaal geert liberalen dijkstal uitspraken partijgenoot dittrichbalkenende, jan-petebalkenende premier leider beatrix koningin voorzitterschappeter rijksvoorlichtingsdienstkok formatie dittrichberg, max van den, eberg martijn marcel dennis quick be jeroen mark bergh marco hoofdklassebos, wouter, tk fv bos wouter leider formatie lijsttrekker fractieleider verhagen kok halsema dittrich voormanbot, ben, cda ministbot ambassadeur colin ambassade diplomaat thailand mensenrechtenbetrekkingen regeringsleiderspowell ontmoetingbrinkhorst, laurens-brinkhorst energiebedrijvengennip nma laurens ez bewindsman opta bouwbedrijven karien schaduwboekhoudingde geus, aart-jan, cgeus bewindsman gestuurd apothekers demissionair robin regeerakkoordprinsjesdag oppositiepartijenregeringspartijenagtde graaf, thom, d66 graaf bestuurlijke antillen burgemeestersthom koninkrijksrelatiesgekozen godett curacao mirna anthonyde hoop scheffer, jascheffer jaap robertson bondgenootschapnavo ovse annan kofi diplomaat diplomatieke ambassadeursdekker, sybille, vvd dekker erik ronde thomas boogerd renner renners michael tijdrit tour volkshuisvestingdittrich, boris, tk dittrich verhagen boris regeringspartijenvoorman maxime jozias regeerakkoordpartijleider regeringspartij kamerdebatdonner, piet-hein, c donner hein gedetineerden terrorisme tbs casino vervolgd misdrijven bevoegdhedenadvocaten informateurhalsema, tk fv halsema femke rouvoet oppositiepartijenlinkse rosenmoller vos kamerverkiezingenduyvendak kamerdebat vliesherben, mat, tk fv herben mat eerdmans nawijn hammerstein partijbestuur hilbrand marten belder freeke joosthirsi ali, tk hirsi ayaan ali mohammed aartsen wilders geert politica bedreigingen theo uitsprakenkamp, henk, vvd minikamp militairen afghanistan henk woonwagenkampmissie militaire militair troepen isaf marinemarijnissen, tk fv marijnissen lazrak eenmansfractievelzen kamervoorzitterstandpunten verkiezingscampagneboris maxime partijleider melkertnicolaï, edzo, staat nicolai atzo conventie solana lidstaat zuydewijn vetorecht giscard estaing bourbon kariennijs, staatssecretar nijs collegegeld groenendaal zesde gewonnen belg hoof sven wellens richard demissionairpeijs, karla, cda mi peijs karla waterstaat haegen schultz hofstra betuwelijn bewindsvrouw netelenbos verkeersministerpuntenrijbewijsremkes, vvd ministerremkes binnenlandse aivd veiligheidsdienstinlichtingen ambtenaren ministeries voordracht voorgedragen kuiper cohenweisglas, tk weisglas kamervoorzittereenmansfractiepresidium binnenhof kamerfractie voorkeurstemmenthom politica spoeddebat boriszalm, gerrit, vvd mi zalm financien gerrit vice peper stabiliteitspact dittrich aartsen regeerakkoordfractieleider voorman

as, tk as samawah gelegerd muthanna camp mariniers manschappen irakees omgekomen sergeant basraatsma, tk atsma ormel buijs rijpstra buma haersma fessem cat aanvoering sorgdrager parlementairbaalen, tk baalen rijpstra oplaat luchtenveld griffith kamerfracties buijs ormel orions korvetten haersmabakker, tk bakker joost bart quick bram mulder voskamp pol jongh martijn hambomhoff, eduard, lpfbomhoff heinsbroek sorgdrager rowi luns quay bewindspersonenstv rijnsb gda foreholtebommel, tk bommel sneijder bouma heitinga ooijer meyde bronckhorst bosvelt vaart zenden nistelrooybuijs, tk buijs mtiliga niemeyer mosselveld lopes nieuwstadt teixeira zuurman ormel lindenbergh axkoning koning kroonprins claus monarchie laurentien oranjes vorstin johan koningshuis staatsbezoek zorreguietawilders, vvd, tk wilders geert liberale liberaal ayaan aartsen bedreigingen eenmansfractiebaalen uitspraken kamervoorzitter

(important political actors)

(unimportant political actors)

26