Upload
king
View
32
Download
0
Embed Size (px)
DESCRIPTION
Determining Common Authorship Among Documents. Paul Bonamy Mentor: Dr. Paul Kantor. Author Identification & Common Authorship. Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis - PowerPoint PPT Presentation
Citation preview
Determining Common Authorship Among
Documents
Paul Bonamy
Mentor: Dr. Paul Kantor
Author Identification & Common Authorship
Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis
Common Authorship: “Do these share an author?” Does not (necessarily) require statistics/training Useful for detecting forgeries, etc
BMR/BXR
Implements Bayesian Multinomial Regression
Used to perform 1-of-k classification BMRtrain accepts feature vectors, outputs
assignment model BMRclassify accepts model & vectors,
outputs assignments Can output author probability vectors
Bayesian Analysis
Consider two match boxes Probability of Box 1, given black marble?
H0 = We have Box 1, E = We see a black marble
)(
)()|()|(
Theorem Bayes'
000 EP
HPHEPEHP
3
1
75.
25.
75.
)5(.5.)|(
75.)(,5.)(,5.)|(
0
00
EHP
EPHPHEP
Bayesian Analysis in BMR
Bayes’ Theorem Extendable to P(C|F1…FN) C is a class F1…FN are features
Effectively applies Bayes’ Theorem to itself
Author
Probabilities
BMR/BXR WorkflowData
( Doc Corpus)
Feature Extractor
BMRtrain
AuthorIdentification
Feature Vectors
AuthorProbabilities
Test/TrainSplitter
Training Set Testing Set
BMRclassifyModel
Feature Vectors
Corpus Construction Articles from 2006-07 issues of The Compass Newspaper 16 Authors 130 Documents
300 - 500 Words: 69 500+ Words: 61
Varied Topics
On Friday, November 3, LSSU experienced its first closing of the semester due to inclement weather. The Soo Evening News reported a “number of minor mishaps,” and “slippery-road induced mishaps,” including two crashes near the campus of LSSU. All classes before 10 AM were canceled because of the snow and ice that had accumulated overnight, but many students arrived for classes as usual, unaware of the cancellation. …
Feature Extraction
Perl script using Lingua::EN::Tagger Selects words, part-of-speech (POS), or both
(wordPOS) address/VB address/NN
Used wordPOS in common authorship study Returns vector of feature frequencies
4:9.0 16:5.0 22:4.0 23:2.0 28:5.0 29:1.0 33:4.0 36:9.0 38:1.0 41:3.0 46:13.0 56:2.0 …
Author Probability Vectors
Produced by BMR/BXR upon request Probability doc belongs to each author in the
training set Not normalized (sum not necessarily 1)
0.17% 0.68% 9.13% 8.90% 2.42% 0.94% 10.55% 0.32% 0.72% 36.95% 0.31% 0.50% 0.48% 22.08% 1.34% 4.52%
Computed With Features
Start with feature vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data
Descending by dot product Ascending by Euclidean distance
Computed With Authors
Start with author probability vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data
Descending by dot product Ascending by Euclidean distance
What Are We Looking For?
DP and Euclidean distance measure distance Computed distances between vectors Sorted from closest to furthest
Docs by same author are close together Docs by different authors far apart
Same Auth? Doc # Auth # Doc # Auth # DP Euclid
1 5 2 6 2 0.756 28.302
0 2 0 27 9 0.702 30.116
0 5 2 32 13 0.711 30.133
1 32 13 33 13 0.771 30.381
0 6 2 32 13 0.729 30.708
ROC Curve
Shows fractions of not-pairs versus fraction of pairs
Area under curve indicates model accuracy Higher is better
Euclidean distance of feature vector
This curve: 64.7% of area under curve
Can We Improve This?
Euclid Dot
Features 64.7%
Authors
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6%
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6% 83.3%
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6% 83.3%
Results for Other Data Splits
Analysis vs. Area Under ROC Curve
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
Analysis Type
Are
a U
nd
er
RO
C C
urv
e
33.33% Accurate 73.5% 69.9% 95.1% 95.3%
38.10% Accurate 77.8% 65.7% 69.9% 75.2%
56.40% Accurate 64.7% 65.2% 78.6% 83.3%
80.00% Accurate 65.0% 77.0% 88.3% 92.0%
Features Euclid Features DP Author Euclid Author DP
Analyzing Other Corpora
Obtained second corpus 9377 Documents 24 Authors
Results similar to those on Compass dataset
Euclid Dot
Features 55.2% 59.5%
Authors 79.7% 84.5%
Open Questions
Are Area Under Curve variations significant? How does Author ID model accuracy affect
same-author accuracy? A low Author-ID accuracy model did very well
Can we reduce memory/processing requirements?