Statistical Analysis of Textual Data - University of Washington · 2017-02-02 · Statistical Analysis of Textual Data Automatic assignment of documents with respect to manually defined

► Statistical text analysis has a long history in literary analysisand in solving disputed authorship problems

► First (?) is Thomas C. Mendenhall in 1887

Statistical Analysis of Textual Data

► Automatic assignment of documents with respect tomanually defined set of categories

► Applications automated indexing, spam filtering, contentfilters, medical coding, CRM, essay grading

► Dominant technology is supervised machine learning:

> Manually classify some documents, then learn aclassification rule from them (possibly with manualintervention)

Text categorization

► Documents usually represented as “bag of words:”

Document Representation

► xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI)

► Many text processing choices: stopwords, stemming, phrases,synonyms, NLP, etc.

► For instance, linear classifier:

Classifier Representation

IF ,THEN 1

ELSE 1

j i j i

j

i

x y

y

! "> = +

= #

$

► xi’s derived from text of document

► yi indicates whether to put document in category

► βj are parameters chosen to give good classificationeffectiveness

► Linear model for log odds of category membership:

Logistic Regression Model

( 1| )ln

( 1| )

i

j i j

ji

P yx

P y!

= += =

= "#i

i

i

xâx

x

► Equivalent to

( 1| )1

i

eP y

e= + =

+

i

i

âx

i âxx

► Conditional probability model

► If estimated probability of category membership is greaterthan p, assign document to category:

Logistic Regression as a Linear Classifier

IF ln ,THEN 11

j i j i

j

px y

p! > = +

"#

► Choose p to optimize expected value of your effectivenessmeasure

► Can change measure w/o changing model

Polytomous Logistic Regression

• Sparse Bayesian (aka lasso) Logistic regressiontrivially generalizes to 1-of-k problems

• Laplace prior particularly appealing here:– Suppose 100 classes and a word that predicts class 17– Word gets used 100 times if build 100 binary models,

or if use polytomous with Gaussian prior– With Laplace prior and polytomous it's used only once

1-of-K Sample Results: brittany-l1-of-K Sample Results: brittany-l

5249223.9All words

2205727.63suff+POS+3suff*POS+Argamon

1297627.93suff*POS

867628.73suff

365534.92suff*POS

184940.62suff

55450.91suff*POS

12164.21suff

4475.1POS

38074.8“Argamon” functionwords, raw tf

Number ofFeatures

%errors

Feature Set

89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.

BMR-Laplace classification, default hyperparameter

4.6 million parameters

The Federalist• “The authorship of certain numbers of the ‘Federalist’

has fairly reached the dignity of a well-establishedhistorical controversy.” (Henry Cabot Lodge, 1886)

• Historical evidence is muddled

Table 1 Authorship of the Federalist Papers

Paper Number Author

1 Hamilton

2-5 Jay

6-9 Hamilton

10 Madison

11-13 Hamilton

14 Madison

15-17 Hamilton

18-20 Joint: Hamilton and Madison

21-36 Hamilton

37-48 Madison

49-58 Disputed

59-61 Hamilton

62-63 Disputed

64 Jay

65-85 Hamilton

http://www.gutenberg.org/dirs/etext91/feder16.txt

•Used function words with Naïve Bayes with Poissonand Negative Binomial model

•Out-of-sample predictive performance

four papers to Hamilton

0.05Each Word

0.05Words (>=2)

0.05Wallace features

0.05484 features

0.08Words+POS

0.04Suffix3+POS

0.08Suffix2+POS

0.12Charcount+POS

0.10Words

0.09Suffix3

0.12Suffix2

0.19POS

0.21Charcount

10-fold Error RateFeature Set

Conclusion

• Authorship attribution needs to pay seriousattention to predictive uncertainty derivingfrom representational issues.