Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
► Statistical text analysis has a long history in literary analysisand in solving disputed authorship problems
► First (?) is Thomas C. Mendenhall in 1887
Statistical Analysis of Textual Data
► Automatic assignment of documents with respect tomanually defined set of categories
► Applications automated indexing, spam filtering, contentfilters, medical coding, CRM, essay grading
► Dominant technology is supervised machine learning:
> Manually classify some documents, then learn aclassification rule from them (possibly with manualintervention)
Text categorization
► Documents usually represented as “bag of words:”
Document Representation
► xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI)
► Many text processing choices: stopwords, stemming, phrases,synonyms, NLP, etc.
► For instance, linear classifier:
Classifier Representation
IF ,THEN 1
ELSE 1
j i j i
j
i
x y
y
! "> = +
= #
$
► xi’s derived from text of document
► yi indicates whether to put document in category
► βj are parameters chosen to give good classificationeffectiveness
► Linear model for log odds of category membership:
Logistic Regression Model
( 1| )ln
( 1| )
i
j i j
ji
P yx
P y!
= += =
= "#i
i
i
xâx
x
► Equivalent to
( 1| )1
i
eP y
e= + =
+
i
i
âx
i âxx
► Conditional probability model
► If estimated probability of category membership is greaterthan p, assign document to category:
Logistic Regression as a Linear Classifier
IF ln ,THEN 11
j i j i
j
px y
p! > = +
"#
► Choose p to optimize expected value of your effectivenessmeasure
► Can change measure w/o changing model
Polytomous Logistic Regression
• Sparse Bayesian (aka lasso) Logistic regressiontrivially generalizes to 1-of-k problems
• Laplace prior particularly appealing here:– Suppose 100 classes and a word that predicts class 17– Word gets used 100 times if build 100 binary models,
or if use polytomous with Gaussian prior– With Laplace prior and polytomous it's used only once
1-of-K Sample Results: brittany-l1-of-K Sample Results: brittany-l
5249223.9All words
2205727.63suff+POS+3suff*POS+Argamon
1297627.93suff*POS
867628.73suff
365534.92suff*POS
184940.62suff
55450.91suff*POS
12164.21suff
4475.1POS
38074.8“Argamon” functionwords, raw tf
Number ofFeatures
%errors
Feature Set
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
4.6 million parameters
The Federalist• “The authorship of certain numbers of the ‘Federalist’
has fairly reached the dignity of a well-establishedhistorical controversy.” (Henry Cabot Lodge, 1886)
• Historical evidence is muddled
Table 1 Authorship of the Federalist Papers
Paper Number Author
1 Hamilton
2-5 Jay
6-9 Hamilton
10 Madison
11-13 Hamilton
14 Madison
15-17 Hamilton
18-20 Joint: Hamilton and Madison
21-36 Hamilton
37-48 Madison
49-58 Disputed
59-61 Hamilton
62-63 Disputed
64 Jay
65-85 Hamilton
http://www.gutenberg.org/dirs/etext91/feder16.txt
•Used function words with Naïve Bayes with Poissonand Negative Binomial model
•Out-of-sample predictive performance
four papers to Hamilton
0.05Each Word
0.05Words (>=2)
0.05Wallace features
0.05484 features
0.08Words+POS
0.04Suffix3+POS
0.08Suffix2+POS
0.12Charcount+POS
0.10Words
0.09Suffix3
0.12Suffix2
0.19POS
0.21Charcount
10-fold Error RateFeature Set
Conclusion
• Authorship attribution needs to pay seriousattention to predictive uncertainty derivingfrom representational issues.