A Stuctural Approach to Authorship Attribution using ... Stuctural Approach to Authorship Attribution using Dependency Grammars Victor Wennberg Victor Wennberg Fall 2012 Thesis project,

A Stuctural Approach to AuthorshipAttribution using DependencyGrammars

Victor Wennberg

Victor WennbergFall 2012Thesis project, 15 hpSupervisor: Johanna HogbergExaminer: Johanna HogbergBachelor of Science in Computer Science programme, 180 hp

Abstract

Authorship attribution is an important problem, with many appli-cations of practical use in the real-world. One principal constraintin dealing with this problem is related to the type of text beingwritten, and for what purpose — its context. The context of atext has consequences on the stylistics of the resulting text.

This thesis presents an approach to the problem attempting toavoid the implications of context by analyzing grammatical struc-tures, in practice dependency structures derived by computerizedparsing software. For classification, latent semantic indexing isemployed. Results are presented in terms of a comparison, interms of performance, with a similar approach based on phrase-structure trees.

The corpus used in these experiments is a subset of the ICWSM2009corpus, provided by the International Conference on Weblogs andSocial Media. The subset contains only blog posts, and shows ahigh degree of variance in a number of aspects, such as attributesin the authors and actual textual content.

In conclusion, the approach to the problem of attributing author-ship appears to be significantly weaker than its phrase-structurecounterpart. The outcome is further discussed, and possible ap-proaches beyond the realm of authorship attribution is identified.

Acknowledgements

Thanks to Johanna Hogberg for supervising this thesis project, providing cleardescriptions when necessary and answering my questions throughout the project!Thanks also to Lars Bergstrom for his work on the thesis project upon which thisproject has been based.

I also want to thank everyone who knowingly or unknowingly of this project was sup-portive of me during the, sometimes stressful, work on this thesis project. Thanks!

Contents

1 Introduction 1

2 Preliminaries 32.1 Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Dependency Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Unique Terms and Term Selection . . . . . . . . . . . . . . . . . . . 72.5 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Problem Approach 93.1 Processing of the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Dependency Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Document Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Term Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 Attribution of Authorship . . . . . . . . . . . . . . . . . . . . . . . . 143.7 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . 153.8 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 17

5 Discussion 215.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . 225.4 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.5 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Conclusions and Future Work 25

References 25

7

1(28)

1 Introduction

A widely recognized definition of authorship attribution is the science of inferringcharacteristics of documents written by a particular author [12]. Historically, docu-ments under unknown, or disputed, authorship have perplexed human beings andawoken their curiosity as to the true source. From history, a number of particularlynotable examples can be identified such as the book the Hitler Diaries published in1983, and the conflicts regarding its authorship. Another well known example is thecontroversies regarding Shakespeare and his plays[7].

However, deciding authorship only represents the most obvious use of techniques forauthorship attribution. In practice, the field encapsulates a wide range of diverseapplication areas such as combating cybercrime[24], de-anonymizing authors on theInternet, thereby providing notions for the whole concept of Internet anonymity[2].It might not be apparent at a first glance, but techniques for authorship attributionare readily applicable to other forms of production, such as source code, works ofart, and music[10].

Historical approaches to the problem are mostly based on, in some cases naive,stylometric approaches. These include, but are not limited to, quantitiave analysis ofword usage and word frequencies, analysis of richness of vocabulary[16]. The adventof complex computerized techniques for artificial intelligence and classification havebeen a catalyst in the development of novel approaches.

The obstacles involved in attempting to solve this problem is primarily the size ofthe author set, that is, the number of potential authors. Another parameter withsignificant implications on the tractability of the problem is the amount of data fromwhich to draw conclusions as to the characteristics of the authors. It appears that forachieving good performance, despite these obstacles, requires novel approaches[2].

Another problem that is present in attempting to attribute authorship is the contextin which the text was written, and the substantial consequences this has on thestylistics of the written text. The approach to the problem presented in this thesisproject primarily attempts to circumvent the constraints imposed by context byemploying grammatical dependencies derived from dependency grammar structures,generated by computerized parsers. For classification latent semantic indexing (LSI)is utilized.

The experiments within the project is carried out on a subset, described later, ofthe IWCSM2009 corpus, containing blog posts from a large number of differentauthors[1]. The corpus has a number of interesting details such as high varianceregarding both authors and the context of the documents contained.

The main objective, also encapsulating the working hypothesis, of the project is ana-lyzing this approach in terms of a comparison with another, highly similar, approach

2(28)

based on phrase-structure trees. The metric of interest is strictly performance-related, but other issues are brought up as a topic of discussion.

3(28)

2 Preliminaries

This section aims to equip the reader with a basic understanding of the conceptsapplied in this report, to enable him or her to easier make sense of the material.

2.1 Authorship Attribution

As mentioned in the Introduction-Section, authorship attribution is the problemof extracting characteristics from an author and subsequently using them in theanalysis of whether or not this author has composed another piece, or given thenature of the problem how likely he or she is to have authored that piece[12].

One might ask, how is it possible to attribute authorship of texts? As David Holmes,stylometrist at the College of New Jersey so eloquently puts it, “People’s unconscioususe of every-day words comes out with a certain stamp”. The very existence ofauthors leaving their characteristic marks on their texts can be inferred from thisquote. The possibility of authorship attribution is further reinforced by the fact thateven primitive approaches do yield, although limited, results.

An alternative approach to explaining how authorship attribution is possible is bylooking at a text as a series of choices made by its author. From these choices,patterns characteristic of its author emerge. By comparing these patterns the au-thorship of texts can be determined.

By extrapolating the idea of patterns in the choices made by authors, the techniquesdeveloped for authorship attribution can be re-purposed for other forms of produc-tions in which the creator is frequently faced with choices. Examples include entitiessuch as artwork, works of music and computer source code.

As to the factors that dictate how difficult a given instance of this problem is to solve,author set size stands out. The author set is simply a set containing authors thatcould have possibly authored the text. Another important parameter is the amountof text with known authorship available for analysis, that is deriving characteristicpatterns.

Another issue that is a limiting factor for most approaches to the problem suggestedis the context of the texts. Authors tend to change the stylistics of their languagegreatly depending on the purpose, or context, of the text being written [7]. In prac-tice this means that documents from the same author written in different contextsneed not to share any characteristics.

A few other issues worth mentioning is the notion that the style of an author tend tochange during his or her life span. Another issue, similar to the impact of context, isthe use of stylistic envelopes. A stylistic envelope is a passage within a text written

4(28)

in a different style than the surrounding text, primarily used to distinguish dialogof different characters[7].

Techniques for authorship attribution range from those developed on human beings,mostly literary scientists, analyzing texts by hand and determining whether or notit is likely to have been written by one of the authors of the candidate authorset. Approaches to authorship attribution based on using computerized analysis areoften called non-traditional authorship attribution. A few approaches that have beensuggested throughout history to deal with this problem are:

• Analyzing texts in terms of word- or sentence length. This is not a reliablemethod.

• Function words: Frequency, position and/or immediate context of certaincontext-free words. A context-free word is a word that are used independentof the context of the text.

• Vocabulary richness: Attempts to measure the richness or diversity of an au-thors vocabulary, thereby creating a personal profile.

Computerized authorship attribution, that obviously can be done on a much largerscale than its human counterpart has a vast array of practical application areas, notlimited to:

• Signal intelligence: Identifying text written by for example known terroristsor criminals

• Forensics: Evaluate the likelihood that a certain text connected to a crime waswritten by a suspect

• Solving issues of authorship within literature history[7]: Did Shakespeare writehis own plays?

• De-anonymizing authors on the Web: Identification of a real-world personresponsible for texts published online.

• Detecting hoaxes, frauds and deception online[3]: Automation of Internet spamdetection

2.2 Dependency Grammars

Dependency grammars are a collection of modern syntactic theories, all based on theidea that words constituting a sentence have dependency relations between them.The roots of this idea can be traced back to the work by Lucien Tesniere, whodescribed a sentence as a set of words, with connections between them representingdependencies between a superior- and an inferior word[19].

A number of representational styles exist for these structures, satisfying differentgraph theoretical properties such as acyclicity, multigraphness, etc. An example ofa representational style is basic dependencies, providing a connected, rooted tree[8].

5(28)

Figure 1: Example of a dependency grammar tree structure, with relational labelsincluded

For an example of a dependency structure using the basic dependencies representa-tional style, see Figure 1.

For large-scale applications, these representations are commonly extracted usingcomputerized software called parsers. A number of different parsers are freely avail-able today, such as MaltParser[14]. MaltParser is a designated dependency parser.The dependency relations within a sentence can also be extracted from a phrase-structure tree. The Stanford NLP Parser[13] provides facilities for producing adependency structures using a number of different representational styles.

In addition to the raw graphical structures with nodes denoting words and vertexesdenoting relationships, parsers provide additional information to the dependencyrelations in the graph. These labels provide a description of the nature of therelation between the terms.

In regards to these different approaches to parsing, a noticeable trade-off betweenspeed and accuracy in terms of parsing is present. Designated dependency parsersachieve noticeable lower accuracy, but much improved speed compared to using con-stituent parsers and subsequent conversion of its output to dependency structures[6].

6(28)

2.3 Latent Semantic Indexing

Latent semantic indexing, or latent semantic analysis, is a method for machine learn-ing, that relies on a mathematical technique called singular value decomposition. Itis used to identify patterns in the relationship between terms and concepts within anunstructured mass of text. The main ideas underlying LSI is that words used in thesame contexts tend to have similar meanings. Experiments has shown that there area number of correlations between the workings of LSI and the way humans processand categorize text[20]. The strength of latent semantic indexing lies in its abilityto overcome the two most problematic constraints in dealing with Boolean keywordqueries: synonymy and polysemy. That is different words with similar meaning, andwords with different, but related meanings, respectively[20].

Latent semantic indexing works by constructing a matrix representation, denoted theterm-document matrix, containing columns for each unique term and rows for eachdocument. The value at each position in this matrix denote the number of times theterm, given by the column, occurr in the document represented by the row. Terms,also called features, in this context is simply an attribute in the document.

Commonly, a weighing function is applied to the values within the term-documentmatrix, aiming to increase the performance of the model. These functions attemptto emphasize terms of importance, and disregard those deemed irrelevant. Termfunctions are commonly partitioned into global- and local term weighing functions.

The documents that are encapsulated within the term-document matrix are thenprojected onto a space of dimensionality k, by applying singular value decomposi-tion. In effect this partitions the terms into k latent classes. This k-space is saidto represent the important information from the original matrix, and also latentinformation from said matrix. It is within this k-space that the actual comparisonsof documents is performed, in terms of vector comparisons.

The choice for dimensionality for the space on which the documents that was used toconstruct the model, commonly denoted the k value, is one of the most challengingtasks when using latent semantic indexing. First, the choice of k-value has linearimpact on the memory usage and computational time in the algorithm. The choiceof k-value also has direct consequences for the actual performance of the algorithm.Choosing a too high k value will introduce too much noise, and choosing a k-valuetoo low will imply poor ability to draw conclusions within the model. There aresome systematic approaches to choosing the k value, such as evaluating the entropy,but in practice these are rather poor. Thus, the generally recognized approach tochoosing k-value is simply trial and error[5].

Latent semantic indexing has a large number of practical applications, primarilywithin the field of information retrieval. These include: information discovery, rela-tionship discovery, automatic keyword annotation of images and interpreting sourcecode. Another application related to the very topic of this thesis is authorshipattribution, differentiating approaches only by the method for choosing terms.

The complexity of latent semantic indexing leads to the widespread use of establishedlibraries for this task. Canonical examples include SVDLIBC[15] and LAPACK[17].By employing a standardized library such as these, the probability of errors are

7(28)

mitigated.

2.4 Unique Terms and Term Selection

Up to this point, terms have only been discussed in vague terms, but being the mostprimitive object in the context of latent semantic indexing these have substantial im-plications for the performance. In this context terms are used to extract informationabout the couplings between documents.

The primary difficulty in the selection of terms is that of finding the optimal balancebetween the amount of structure and information that is retained in the terms, whilestill providing a basis for establishing conclusions regarding the connections betweendocuments. First, terms need to be present in more than one documents. In datamining, the number of documents that contain the term is denoted its support-value[18]

Selecting appropriate terms is hard enough already in the string case, and thesedifficulties are aggravated in attempting to extract structures from graph-like struc-tures, such as trees, compared to more primitive units such as words. Thus, a schemefor the extraction of terms is required and has to negotiate the difficulties earlierdescribed.

A topic from the scientific field of data mining that relates to this is the frequentsubstructure pattern problem, stated as follows: Given a graph data set, D, and aminimum support-value, enumerate the substructures from the data set having asupport-value no less than the stipulated minimum value[11].

2.5 Feature Reduction

Another topic related to term selection is feature reduction, sometimes called featureset selection. The goal is to attenuate the number of terms, while at least retainingmodel performance. The principal methodology of these techniques is to pick theterms deemed most important, yielding a subset of terms. This subset is supposed tocontain the terms from the original set thought to be the most relevant, or expressive,in terms of construction of the latent semantic indexing model. The use of frequentsubstructure patterns, as described earlier with a minimum support-value can beconsidered one form of feature set selection[23].

In effect, the feature set selection reduces the dimensionality of the term-documentmatrix, thus providing implications on memory usage and computational require-ments. The performance of the model is also obviously modulated, given the cor-relation between term selection and model performance. Correct employment of afeature reduction technique can also increase the ability to generalize in the model.

8(28)

9(28)

3 Problem Approach

This chapter explains how the experiments are conducted. The objective is to pro-vide a reader desiring to reproduce the experiments with the facilities to do so athis or her own will. For brevity, no explanations or motivations for the choices arepresented in this chapter, instead these are left for Chapter 5.

Due to time constraints, no thorough description concerning the phrase-structureexperiments will be given, even though ultimately this project aims to statisticallyestablish the difference in terms of performance between the phrase-structure ap-proach and dependency approach. The reader wishing to gain a deeper understand-ing regarding the phrase-structure tests are referred to Lars Bergstrom’s thesis[4].For the reader content with a rough outline of Bergstrom’s approach, it is identicalto this approach up to the choice of tree structures.

3.1 Processing of the Corpus

This section describes the corpus in general, the processing that was undertakenin order to make the corpus more manageable in terms of size, and also the pre-processing employed to facilitate subsequent parsing. Note that all of this work,except the final processing into dependency tree structures, was performed by LarsBergstrom, as a part of his thesis project[4].

The corpus used is commonly known as the ICWSM2009 corpus. The source ofthe data that makes up the corpus is the now defunct Tailrank.com (formerlySpinn3r.com). The website focused on providing feeds of content from the WorldWide Web, distributing the content currently being discussed. The corpus in itsentirety consists of nearly 200 gigabytes of both blog- and news posts in the form ofRSS-feeds. The corpus was published in conjunction with the International AAAIConference on Weblogs and Social Media, and is available to the public[1].

This particular corpus is interesting from several perspectives. First, blog postsare generally different from many other forms of text in terms of for example thelanguage used. Further, there are no boundaries regarding the context of the textsin the corpus, implying a wide variety of different type texts. The corpus is alsocomposed of texts from a set of authors having very high discrepancy, consideringmany different attributes.

A subset of this gigantic set of data is then created by selecting authors and theirposts satisfying the following set of criteria[4]:

• All posts written by the author is in English (derived from meta-data)

10(28)

• Total of between 10000 and 100000 characters of posts per authors

• Authors have between 25 and 60 distinct posts satisfying the other criteria

• Every post contains no less than 400 characters

• All posts are of known authorship and source, as defined by the post meta-data

This subset is further made more easily manageable by a computerized parser by anumber of pre-processing passes, each performing one item on the list below:

• Replacement of URL-addresses with its hostname

• Removal of any occurrences of whitespaces and newlines past one

• Repeated interpunctuation signs are replaced with one occurrence

• Accumulating all variants of citation marks and apostrofs under one commoncharacter

• Removal of text within the li and blockquote HTML tags

• Removal of non-ASCII characters

Table 1 Corpus statistics for the original data, and post- filtering, phrase-structureparsing (denoted P-S) and dependency grammar parsing (denoted Dep.). Valuesleft out are not available.

Original Filtered P-S Dep.Number of posts 12873609 316260 191683 191683Number of unique authors 834744 10580 7506 7506Number of sentences - - 4638035 4567654Avg. post length (characters) 2150 3143 1590 1539Avg. posts per author 15 30 26 26Avg. sentences per document - - 19.0 18.0Avg. sentence length (char) - - 80.7 81.6

After above phases the pre-processing, each sentence is parsed using the StanfordNLP Parser[13], producing the phrase-structure representation for each sentence.Table 1 shows a dissection of the document statistics before the different phases.Note that the difference in number of sentences in the Filtered-column as comparedto the columns for the parsings imply that the sentence was unable to be parsed.

The final step in the process of preparing the corpus for experiments is the furtherprocessing of the phrase-structure parser output to generate dependency structures.Although this process achieves remarkably high accuracy, it is however unable toprocess some sentences as apparent from Table 1. The total number of sentenceswithout dependency representations are 70381. It is important to note that nodocuments are left completely void of dependency structures. The fact that sentenceswithout dependency parsings are simply ignored during the experiments is discussedin the Discussion chapter.

11(28)

3.2 Dependency Parser

The data constituting the corpus are organized by author and further divided nat-urally into documents, corresponding to blog entries. These are parsed using theStanford Parser[13] and stored in the database used by the main framework, re-taining the original author/document hierarchy. However, as parsers operate withtextual sentences as the smallest unit this leads to another level in the hierarchy.

3.3 Document Selection

This document describes the selection process for documents. That is how docu-ments are chosen and further partitioned into the training set and the query set.

Table 2 Parameters considered the document selection process. Note that the mean-ingful ranges for some of these parameters are confined by the filtering procedure asdescribed earlier in this sectionParameter Descriptionauthors Number of distinct authors to include documents fromqueryDocs The amount documents used to query the classifierauthorMinItems Min. amount of documents for a certain author to be consideredauthorMaxItems Max. amount of documents belonging to a certain authorauthorMinChars Min. characters for documents to be applicableauthorMaxChars Max. characters for documents to be considered

At first a number of authors are selected at random in accordance with a numberof parameters. These parameters can be seen in Table 2, note that some limit thetotal number of authors in the selection by imposing requirements on them. Oncethe authors have been selected, all their documents are retrieved from the databaseand partitioned into a training set and a query set. The number of documents inthese sets are also dependent on the parameters set.

The training set is used for constructing the model, and the query set are used totest its performance. Each document in the query set represents one query to thesystem after the model are constructed.

The only parameter fixed in the tests conducted in this project are queryDocs, in thisreport always set to 1. The reason for using this parameter value is to maximize thesize of the training set for a given set of authors. Further all the parameters consider-ing the eligibility of authors (authorMinItems, authorMaxItems, authorMinCharsand authorMaxChars) are ignored, because they are already constrained implicitlyby the processing of the corpus.

3.4 Term Selection

This section describes the extraction of terms from the corpus in more detail.

12(28)

Given the dependency structures as presented earlier in this report, a number ofchoices is available in the context of extracting terms. First, there is the perhapsmost apparent option as to whether to consider words and/or relation. The testsconducted within the context of this project only regards relations between nodes,and not the words themselves.

Table 3 Example grouping of relationship labels derived from the Stanford NLPParser manual[8].

Group identifier Member relationship labelsaux aux, auxpass, coparg arg, agent, acomp, attr, ccomp, xcomp

complm, obj, doj, iobj, pobj, markrel, subj, nsubj, nsubjpass, csubj, csubjpass

mod mod, abbrev, amod, appos, advcl, purpclpreconj, infmod, mwe, portmod, adv, modneg, rcmod, quantmod, nn, npadvmod, numnumber, prep, poss, possesive, prt

Further, a form of feature set reduction is applied to the terms extracted afterstripping them of their words, while keeping the relations. This is implemented bycreating so called mergers, or groups consisting of a number of relational labels.The reason why this reduces the number of features, or terms, is that by loweringthe amount of possible relational labels, the domain of the possible terms is alsorestricted.

Presented in Table 3 is the relational merger used for the tests within this report.These groupings are derived from, and constructed, with the hierarchical structurepresented within the Stanford NLP Parser manual[8] as a starting point. All possiblelabels that are not included in the table referred to in this paragraph are not placedin any particular group but rather make up on a new one containing only that label.It is important to note that the identifiers for the groupings presented in the table aremerely to facilitate human understanding, and have no computational significance.

Table 4 Example of subtree generation of height 1, at most 2 children. The rootnode is presented and the children with their respective relation to the root node.The merged relation are presented within parentheses, relations not grouped arerepresented by -.Term Root node Child #1 Rel. Child #2 Rel.

1 submitted Bills nsubjpass (arg) were auxpass (aux)2 submitted Bills nsubjpass (arg) by prep (mod)3 submitted were auxpass (aux) by prep (mod)4 Brownback Senator nn (mod) Republican appos (mod)5 ports and cc (-) immigration conj (-)

A clarifying example of the term selection, as performed within the scope of thisthesis, is provided in Table 4. The example is derived from the dependency tree

13(28)

in the introductory section, Figure 1. Terms are systematically extracted from thetrees by enumerating all possible combinations of subtrees such that its height is 1,and it has 2 child nodes. Note that the ordering amongst the children is irrelevant,as this is handled explicitly when comparing terms.

As shown in Table 4, a total of 5 distinct subtrees are generated from this particulardependency structure. The table includes words, as well as the labellings for thedependency relations. The grouped relations are also presented, when applicable,within braces. It is important to note that even though words are included in thetable, they are otherwise ignored.

3.5 Classification

Continuing, all the terms are extracted from the documents that constitute thetraining set mentioned earlier. Based on these terms a term-document matrix forthe latent semantic indexing is formed. The columns in this matrix represent everyterm that was encountered, on a global basis. Further emphasizing this, every termthat was encountered at least once in the extraction of terms from the documents willhave a corresponding column in this matrix. The rows in the term-document matrixrepresent the documents that constitute the training set. Entries in the matrixdenote the term frequency for the term represented by the row of the entry, in thedocument represented by the row. Term frequency is the number of occurrences forthe term.

Note that any rows containing only zeros is removed from the matrix. A row con-sisting of only zeroes correspond to an empty document.

The process of weighing the values constituting the matrix is done in according withthe term frequency-inverse document frequency (abbreviated tf*ldf). This weighingapproach emphasizes terms appearing in few documents. The idea is that theseterms are more characteristic, or unique.

The next step in processing this matrix is the singular value decomposition (SVD)of the matrix. Mathematically, this process consists in decomposing the term-document matrix, A, into A = USV T . The resulting matrix U is a m×n unitarymatrix, V is a n×n unitary matrix and S is a diagonal matrix that contains theeigenvalues. Moreover, approximations of rank k are created by using the first kcolumns of U and V and the first k columns and rows of S. The resulting sub-matrices are denoted Uk, Vk and Sk respectively.

Upon querying the latent semantic indexing system, a n×1 matrix is constructed.This matrix is generally denoted the query vector, q. This matrix is a representationof the query document, where the i:th column denotes the term frequency, or numberof occurrences, for the unique term denoted by the i:th column in the term-documentmatrix.

However, this query vector still complies to coordinates in the space correspondingto the original term-document matrix. So, in order to query the reduced k space,coordinates for this space has to be derived. In practice, these coordinates arecalculates by decomposing the query vector, q, into q = qTUkS

−1k .

14(28)

In querying, this query vector, q, is related to the documents constituting the reducedspace of dimensionality k constructed earlier. This is done by calculating the cosinesimilarity, that is the angle between two vectors. In this case, the angles calculatedwill be between the query vector q and all the documents, denoted di. The highest ofthese cosine similarities will be the document most similar to the query vector whenconsidering the terms in the original term-document matrix, or rather what wasretained after reducing it to k dimensions. The cosine similarity is a value between0 and 1. The mathematical formula for the cosine similarity is:

sim(q,di) = qdi

|q||di|

The strength of using a metric such as the cosine similarity in assessing the similar-ities between vectors is that by presenting a single value, the issue of high dimen-sionality and the inability in humans to interpret these spaces are resolved.

3.6 Attribution of Authorship

Figure 2: Illustration of the scoring approach utilizing summation of cosine simi-larities by author, using a cut-off value of 10.

The primary method for attribution of authorship is to consider the n documentswith the highest cosine similarity in comparison with the query vector, q. This nvalue is just a cut-off value. The cosine similarities are then summed according toauthor, creating a sum of similarities for each author having documents within thesen. The author whose summed cosine similarities is the highest is the guess regardingauthorship. The process described in this paragraph is clarified in Figure 2. Theexample uses a cut-off value of 10. In the figure di denote the distinct documentsand ai denote the unique authors. The sub-graph in the upper-right corner showthe summed cosine similarities, by author.

15(28)

Another possible approach as to the attribution of authorship that is based on thek best results having the highest cosine similarities is by creating an entry in a tablefor every author having documents within this set of k results. Note that this subsetis sorted. Further, this collection of resulting cosine similarities are traversed andfor every document, the author who wrote it is assigned points according to thedocuments position in this list. The amount of points alotted are 1 divided by thecurrent documents position within the list. The author that has the highest amountof points at the end is guess provided.

3.7 Comparison of Methods

A Wilcoxon signed-rank test is employed for comparing the two approaches statis-tically. The test aims to assess hypotheses that consider whether or not the meanof the population ranks differ. Thus, it is a paired difference test and is useful whenthe populations cannot be assumed to come from a normal distribution[22].

A number of assumptions have to be satisfied for the test to be applicable[21]:

1. Data is paired

2. Pairs are chosen in a random, and independent manner

3. Data is measured on an interval scale, to enable calculation of differences

A coarse outline of the test procedure contains calculating the absolute value of thedifferences for all pairs, and assigning ranks depending on the sorted differences.Finally the test statistic is produced by summing the absolute value of the sum ofsigned ranks for all terms.

3.8 Implementation Details

The implementation of the approach, as described earlier in this section, is realizedas an extension to an already existing framework for authorship attribution. Thissystem is called Lind, and was described already by Lars Bergstrom in his thesis[4].The Lind-framework encapsulates facilities for performing authorship attributionusing latent semantic indexing, and employs a number of different approaches toterm selection. Examples included are stop words and phrase-structure trees, aswell as an approach based on the combination of these. The dependency structureapproach as described in this report is implemented in the manner of these two.

Underlying the system, managing the corpus and all related data, is a MySQL-database, providing a uniform approach to retrieval.

The framework is primarily coded using the Java programming language, but certainunits employ external programs. An example of an external program used is a stand-alone C application for performing the computations involved in the singular valuedecomposition. For this, functionality in SVDLIBC[15] is used. Further, for man-agement and manipulation of matrices, the Colt Framework for High Performance

16(28)

Scientific and Technical Computation in Java[9] is used.

17(28)

4 Results

Table 5 This table describes a selection of relevant parameters for fine-tuning Lind,and the respective values used in the tests.

Parameter Valueauthors VariesqueryDocs 1authorMinItems −1 (ignored)authorMaxItems −1 (ignored)authorMinChars −1 (ignored)authorMaxChars −1 (ignored)k (for LSI) 150, unless explicitly stated otherwisen (cut-off in attribution) 20

Table 5 provide a summarizing for the values used for the difference parametersdescribed earlier. For more information about the parameters and their involvementin the actual process of analysis, the reader is referred to the section dealing withthe respective parameters, Section Document Selection.

Table 6 Mean number of documents contained in the training set for different sizesof the author set (n). The number of documents is the average over 25 test runs.

n Number of documents10 25120 49730 76940 100750 129060 152570 176180 200290 2264100 2498

Table 6 shows the mean number of documents for the 25 runs conducted for differentsettings for the number of authors considered.

Table 7 illustrates the number of unique terms, as well as the total number of terms(that is the sum of each terms term frequency) for both the tests using dependencygrammar trees and phrase-structure trees.

18(28)

Table 7 Statistics for the number of terms for different author set sizes (n), showingthe number of unique- and the total number of terms for phrase-structure (UP andTP respectively) and dependency (UD and TD respectively). The numbers are themean over 25 executions.

n UP TP UD TD10 3190 60405 514 6466120 5300 183222 628 18396230 6794 266567 678 25660540 7302 336198 667 32208050 8457 436846 692 41857460 8706 510476 718 50964170 9617 598914 727 58261680 10079 672018 715 66732790 10715 761398 738 733228100 11314 866720 750 841779

0

0.1

0.2

0.3

0.4

0.5

10 20 30 40 50 60 70 80 90 100

Cosin

e

Number of authors

Maximum cosine assigned to the correct author

Figure 3: Mean summed similarities for different author set sizes (n =10,20, ...,100) for phrase-structure trees (o), dependency trees (x) andchance (+)

Figure 3 contain mean summed cosine similarities for authors, for a number ofdifferent author set sizes (n). o indicates phrase-structure trees, x dependency treesand + chance.

19(28)

0

0.5

1

1.5

2

10 20 30 40 50 60 70 80 90 100

Cosin

e

Number of authors

Maximum cosine assigned to the correct author

Figure 4: Mean score for different author set sizes (n) for phrase-structure trees(o), dependency trees (x) and chance (+)

Figure 4 show the average scores for different author set sizes (n= 10,20, ...,100) forphrase-structure trees (o), dependency trees (x) and chance (+).

Table 8 Results from the Wilcoxon signed-rank test in the form of p-values fordifferent author set sizes, n.

n p-value10 0.000000520 0.000000030 0.000101940 0.000004350 0.000055460 0.000002070 0.002317280 0.000003190 0.0000009100 0.0000001

Table 8 presents the p-values for a number of different author set sizes. With Psidand Psis denoting the mean summed cosine similarities for the 25 distinct runs, the

20(28)

hypothesis being tested is:H0 : µd−µs > 0

21(28)

5 Discussion

In addition to more speculative discussions, this section motivates the choices made,and the pros and cons associated with them.

5.1 Corpus

The parameters relating to the corpus raises the questions as to how much impact theparticular choice of parameter value had on performance, and whether using anothercorpus would change the outcome. As already mentioned the corpus displays a highdegree of discrepancy, both in terms of the authors but also in the context of thedocuments.

The fact that the corpus has such a high degree of variance regarding the contextof the texts most likely impacts on the performance. However, this does not neces-sarily mean that is has a negative impact. Quite the contrary, it could actually aidperformance. This is due to the fact that variance in terms of context most likely isat a level above individual authors, meaning documents written by a certain authorare likely to share contextual traits. As the attribution is carried out in terms ofcomparing documents, this might tend towards lowering the score of other authorsdue to contextual effects[7].

Further, the linguistic nature of postings on the Internet are likely to impact theresults. This stems from the idea that text published on Internet blogs are morelikely to contain grammatical errors, and tokens such as smileys that make the textharder to parse. It is possible that sentences unable to be parsed actually expressmore characteristics of individual authors.

In addition, the nature of the corpus and most importantly the fact that it is acollection of Internet postings are obviously very important also. One problem thatcan be identified is the usage of acronyms that are commonplace on the Web, butunable to be parsed by the parser used. One could argue that these have as much,if not greater, value in terms of expressing characteristic features of authors.

5.2 Term Extraction

Intuitively, the choices made during the extraction of the terms, are highly importantfor the performance of the model. A number of caveats can be identified whichadheres to the selection procedure. First, the somewhat crude issue of achievingoverlap between the terms of the documents in such a way that conclusions can bedrawn. By choosing terms in such a way that no overlap is present, terms simply

22(28)

become useless as they are the least significant unit in the context of LSI. On theother hand, having too high degree of overlap between the terms of documents implythat no conclusions can be drawn either.

Another, albeit more intricate, aspect that can elaborated further is how to actuallyconduct the selection of terms. It may seem impractical to just enumerate all thetrees of a certain size from the structures. A more refined approach that mayyield better results in the real-world would be to actively try to find terms havinga support-value believed to match the optimal level of overlap, and ranking theimportance of terms in accordance with the support value. The implications that thisapproach has on the model is that by actively choosing the important structures, alesser degree of responsibility is placed upon the classification model for emphasizingand suppressing terms.

By analyzing the statistics regarding the number of unique terms, and the totalnumber of terms presented in 7, it can be argued that the unique terms for thedependency structures are saturated too early. This could have direct consequencesin terms of performance of the latent semantic indexing model. For example, itemphasizes the weighing of terms further.

Personally, I believe that by modifying the way in which terms are selected, and thenature of the information contained within them, that performance can be improvedsignificantly. However, no systematic approaches to term selection exist, and a hugeamount of variations has to be considered. Further, it is a time consuming taskas actual testing has to be conducted for every method. Therefore no systematicattempts in assessing term selection has thus been performed.

An alternative approach, not implemented herein due to time constraints is to con-duct term selection in two phases. The first phase would aim to find terms withindocuments from a specific author, looking for documents appearing in as many ofthese as possible. The second phase would be to select what terms from each authorto consider by using those that occur in as few documents from other authors aspossible. The amount of terms selected per author would of course have be tweakedfor performance.

5.3 Latent Semantic Indexing

An issue when using latent semantic indexing is to unravel the optimal choice ofdimensionality for the vector space, the k-value. Even though there are some ap-proaches to estimate this value, none of them are of any practical use when appliedto a real-world problem domain. As the comparison presented in this thesis is theresult of tests using different author set sizes (n = 10,20, ...,100) the most rationalapproach is to simply conduct tests for the different author set sizes, attempting tofind the optimal dimensionality (k-value) by analyzing the outcome.

A critical reader might pose criticism against the fact that in the comparison betweenthe methods, the same dimensionality (k-value) are applied to both. The maincritique here is partly that the sheer number of terms are significantly differentbetween the two methods, as seen in table 7. This implies that for the phrase-

23(28)

structure trees, more information is disregarded as compared to the dependency treeapproach. Further, regardless of the number of unique terms, no analysis have beenconducted for finding the number of terms that actually have impact on the modelsability to infer relationships within the model. This have not been done for eitherof the approaches.

Term weighing is also a factor in the performance of the model. Choosing the rightapproach to weighing is analogous to the choice of dimensionality, that is k-value,a craft rather than a science. One could argue that the optimal choice in terms ofweighing is determined by the nature of, and features contained in, the text beinganalyzed.

A possible approach to weighing, that could either stand on its own or act as anauxiliary, is to add weighing that does take the authorship of document rows withinthe term-document matrix into account. One intuitive idea is to emphasize termsin proportion to how often they appear within the documents of a given author,compared to the inverse of how often they appear within the matrix globally. Thiswould attempt to find terms that are common within texts from each author, butrare on a global level — that is characteristic terms.

An alternative approach to the construction of the term-document matrix can alsobe identified. Instead of constructing the model by considering documents as stand-alone entries, it is possible to merge all documents from a certain author. Theprincipal idea behind this is to, taking all documents from an author into account, toderive a set of characteristics from the author. This would lead to every author beingrepresented by one row in the term-document matrix. Intuitively this approach doeshowever ignore the partitioning of text into distinct documents, and thus informationis lost in this regard.

5.4 Parser

As mentioned earlier, a number of parsers are available. The reason for using theStanford NLP parser is that it was used in the Lars Bergstroms thesis project[4] andthat the parser output from the phrase-structure parsing was provided. This outputcould then be used to produce the dependency relations, effectively cutting downthe time to produce the dependency relation to one tenth of the original time. Thefact that the same was also used aids in the analysis when comparing the methods.

Another potential consequence originating from the choice of parsing software is theremarkably high degree of accuracy that could be achieved. In fact, every documentfrom the original data set used in Lars Bergstroms thesis[4] has at least one sentencewith a valid dependency representation. In practice this means that all documentseligible for partaking in the analysis can be included. An alternative way of dealingwith these documents is to simply exclude them, however this might lead to thewithdrawal of all documents from that author in the case when the author no longersatisfies the constraints imposed, as described in the Problem Approach section.

It is also likely that the choice of parsing software had a major role in the fact thatsuch a high degree of accuracy could be achieved in the parsing of the sentences

24(28)

composing the documents. In fact, every document from the original data usedfor testing has at least 1 sentence with a successfully generated dependency repre-sentation. This means that in practice, all documents are eligible for testing. Analternative way of dealing with this issue is simply excluding these documents, andpossibly every document by its author depending if the constraints described earlierare no longer satisfied.

5.5 Statistical Analysis

Critique could be raised against the fact that there exists a mismatch in the numberof sentences that have their dependency relations parsed. This comes from the factthat the parser was unable to process all the phrase-structure parsings of sentences.One could say that the phrase-structure approach to authorship attribution is fa-vored for these documents. However, conveyed to a real-world situation the fact thatthe dependency parsing can not process as many sentences as the phrase-structureparsing has to be taken into account. By looking at the shortcoming in this wayit can be seen as a natural consequence of the choice of approach rather than aproblem.

5.6 Implementation

Limited primarily by the time constraints imposed, testing has not been extensivelyconducted. This does not mean the system has not been tested at all, but that thetests have been analogous with sanity tests, rather than systematic.

In the absence of time constraints, the employment of a more systematic approachto testing would be favorable. In particular, tests aimed at assessing the correctnessfor the more mathematically intensive parts of the system could prove beneficial.These parts are more prone to errors being more obscure and less apparent.

In conclusion, the lack of systematic testing does in no way directly imply thepresence of errors in the implementation. However, it does increase the likelihoodof it containing actual errors.

25(28)

6 Conclusions and Future Work

In face of the p-values resulting the Wilcoxon signed-rank test, the null hypothesiscan be rejected. It is apparent that this approach is significantly weaker than itscounterpart utilizing phrase-structure trees. This implies that dependency struc-tures are less expressive in terms of author characteristics than phrase-structuretrees. Further, presented the poor performance of this approach it appears to lackthe performance capacity required to be used solely.

However, one can not deduce from the results that the use of dependency treesare of no merit in dealing with authorship attribution. It is also likely that thedependency structures, and the inherent metrics, could prove to be efficient usedas an auxiliary for other, already established, techniques such at stop words. Forexample, an additional dimension for the stop words could be added. The level ofthe stop word in the dependency grammar structure for the sentence containing it.

Further, it is the very nature of this alternative approach to representing sentencesof text that further metrics about words can be extracted compared to a standardsentence. These metrics, whoms extraction is trivial, could prove to be of use indealing with problems such as text classification.

26(28)

27(28)

Bibliography

[1] ICWSM-11 – 5th International AAAI Conference, 2011.http://www.icwsm.org/data/ [Online; accessed 05-April-2012].

[2] N. Z. Gong A. Narayanan, H. Paskov and J. Bethencourt. On the feasibilityof internet-scale author identification. In In Proceedings of the 33rd conferenceon IEEE Sympsoium on Security and Privacy. IEEE, 2012.

[3] S. Afroz, M. Brennan, and R. Greenstadt. Detecting hoaxes, frauds, and de-ception in writing style online. In Proceedings of the 33rd conference on IEEE,Symposium on Security and Privacy. IEEE.

[4] L. Bergstrom. Syntaxbaserad forfattarigenkanning. 2010.

[5] R. B. Bradford. An empirical study of required dimensionality for large-scalelatent semantic indexing applications. In Proceedings of the 17th ACM confer-ence on Information and knowledge management, CIKM ’08, pages 153–162,New York, NY, USA, 2008. ACM.

[6] D. Cer, M. de Marneffe, D. Jurafsky, and C. D. Manning. Parsing to stanforddependencies: Trade-offs between speed and accuracy. In In LREC 2010, 2010.

[7] D. H. Craig and A. F. Kinney. Shakespeare, Computers, and the Mystery ofAuthorship. Cambridge University Press, 2009.

[8] M. de Marneffe and C. D. Manning. Stanford dependencies man-ual. http://nlp.stanford.edu/software/dependencies manual.pdf [Online; vis-ited 2012-April-08].

[9] CERN European Organization for Nuclear Research. Colt, 2012.http://acs.lbl.gov/software/colt/ [Online; accessed 05-May-2012].

[10] G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas. Effective iden-tification of source code authors using byte-level information. In Proceedingsof the 28th international conference on Software engineering, ICSE ’06, pages893–896, New York, NY, USA, 2006. ACM.

[11] Andres Gago Alonso, Jose Medina Pagola, Jesus Carrasco-Ochoa, and JoseMartinez-Trinidad. Mining frequent connected subgraphs reducing the numberof candidates. In Walter Daelemans, Bart Goethals, and Katharina Morik, ed-itors, Machine Learning and Knowledge Discovery in Databases, volume 5211of Lecture Notes in Computer Science, pages 365–376. Springer Berlin / Hei-delberg, 2008.

[12] P. Juola. Authorship attribution. Found. Trends Inf. Retr., 1(3):233–334, 2006.

28(28)

[13] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings ofthe 41st Annual Meeting on Association for Computational Linguistics - Vol-ume 1, ACL ’03, pages 423–430, Stroudsburg, PA, USA, 2003. Association forComputational Linguistics.

[14] J. Nivre, J. Hall, J. Nilsson, A. Chanes, G. Eryigit, S. Kubler, S. Marinov,and E. Marsi. Maltparser: A language-independent system for data-drivendependency parsing. Natural Language Engineering, 13(02):95–135, 2007.

[15] D. Rohde. SVDLIBC, 2012. http://tedlab.mit.edu/˜dr/SVDLIBC/ [Online;accessed 08-May-2012].

[16] E. Stamatatos. A survey of modern authorship attribution methods. Journal ofthe American Society for Information Science and Technology, 60(3):538–556,2009.

[17] Univ. of Tennessee; Univ. of California, Berkeley; Univ. of ColoradoDenver; and NAG Ltd. LAPACK — Linear Algebra PACKage, 2012.http://www.netlib.org/lapack/ [Online; accessed 12-May-2012].

[18] Wei Wang, Qing-Qing Yuan, Hao-Feng Zhou, Ming-Sheng Hong, and Bai-LeShi. Extracting frequent connected subgraphs from large graph sets. J. Comput.Sci. Technol., 19(6):867–875, November 2004.

[19] Wikipedia. Dependency grammar — Wikipedia, the Free Encyclopedia, 2012.[Online; accessed 02-May-2012].

[20] Wikipedia. Latent semantic indexing — Wikipedia, the Free Encyclopedia,2012. [Online; accessed 02-May-2012].

[21] Wikipedia. Wilcoxon signed-rank test — wikipedia, the free encyclopedia, 2012.[Online; accessed 21-May-2012].

[22] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin,1(8):80–83, 1945.

[23] Y. Yang and J. Pedersen. A comparative study on feature selection in textcategorization. 1997.

[24] R. Zheng, Y. Qin, Z. Huang, and C. Hsinchun. Authorship analysis in cyber-crime investigation. In Hsinchun Chen, Richard Miranda, Daniel Zeng, ChrisDemchak, Jenny Schroeder, and Therani Madhusudan, editors, Intelligence andSecurity Informatics, volume 2665 of Lecture Notes in Computer Science, pages959–959. Springer Berlin / Heidelberg, 2003.

Documents

A Stuctural Approach to Authorship Attribution using ... Stuctural Approach to Authorship Attribution using Dependency Grammars Victor Wennberg Victor Wennberg Fall 2012 Thesis project,