Computational Stylometryai.vub.ac.be/sites/default/files/StylometryULB.pdf · they see two guys are...

Preview:

Citation preview

Computational Stylometry

Ben Verhoeven

CLiPS, University of Antwerp

Guest lecture at Universite Libre de Bruxelles8 May 2015

Text mining

Three layers of information in textI Objective

I Facts, concepts, characteristics of concepts,relations between concepts, . . .

I Who does what, where, how and why?

I SubjectiveI Opinion, sentiment, emotion, . . .I Who believes what about what?

I Metadata - ProfileI Age, gender, region, . . .I What do we know about the author?

Example

Objective

Who? Ed MilibandWhat? Steve Coogan backs Labour this election

Subjective

What? Really great that SC backs LabourWho believes this? Ed Miliband

Example

ProfileWho? Ed MilibandAge? 40+Gender? MalePersonality? weird?Education? High

Sentiment Analysis

Commercial applications

Semantria - https://semantria.com/demo

Open-source implementations

Pattern - http://www.clips.ua.ac.be/pages/patternNLTK - http://www.nltk.org

Stylometry

The quantitative study of stylistic characteristics of a text

Writing style

A combination of invariant and unconscious decisions in languageproduction on all linguistic levels, uniquely associated with specificauthors or groups of authors

→ Human Stylome Hypothesis (Van Halteren et al. 2005)

Federalist papers

I Collection of 85 essays from 1780s in favour of new USconstitution

I Written by Hamilton, Madison (& Jay)

I Disputed authorship

Federalist papers

Traditional stylometry

I Guesswork

I Odd verbs

I Checklist of conspicuous features

I But: schools, workshops, imitation, . . .

Mosteller & Wallace (1964)

I Quantitative

I Inconspicuous features

I Functors

Experiment

Count the number of letters ‘f’ on the next slide.

Experiment

Finished files are the resultof years of scientific studycombined with the experienceof many years.

Experiment

How many were there?

Experiment

Finished f iles are the resultof years of scientif ic studycombined with the experienceof many years.

Experiment

Which text is on the following slide?

Experiment

Experiment

Difficult error spotting . . .

Does our brain process little words differently?And is this important?

Functors

I ↔ content words

I also called function words

I words or morphemes with little lexical meaning, they servegrammatical functions

I the, and, they, she, on, at, . . .

Measuring unconscious language decisions

Can we measure them?How would we measure them?What do they mean?

Functors

Use of pronouns reveals personality and mental state of the author

I More ‘we’ than ‘I’ after 9/11 in USA

I More use of pronouns when in depression

I www.analyzewords.com

Computational Stylometry

Authorship attribution & verification

I Attribution - attribute text to one of limited set of authors

I Verification - is unknown text written by given author?

Author profiling

I Age

I Gender

I Location

I Personality

I Education

I Ideology

I Mental health

Authorship attribution

Digital humanities

The study of arts and human sciences by use of digital methods.

Case studies

I Federalist Papers

I Unabomber

I Robert Galbraith = J.K. Rowling

I Dutch medieval writers

Unmasking the Unabomber

I Theodore Kaczynski

I Professor Mathematics at Berkeley

I Bomb letters against universities and airlines in 70s and 80s

I Wrote a manifesto of 35,000 words

I Word use recognised by family member

Robert Galbraith

I Crime writer, debute in 2013

I Stylometric investigation by Patrick Juola confirms gossipstarted by lawyer’s wife

I Pseudonym of J.K. Rowling

Medieval Dutch literature

Case: Spiegel Historiael

I Gigantic chronicles in rhyme verses

I History of creation until around 1316I Three authors

I Jacob Van MaerlantI Filip UtenbroekeI Lodewijk van Velthem

Medieval Dutch literature

Problems

I Few texts in bad shape

I Spelling variation

I Copyists changing the text, making errors

Medieval Dutch literature

Solution? Rhyme words

Fraeye historie ende al waerMach ic v tellen hoort naerHet was op enen auontstontDat karel slapen begondeTengelem op den rijnDlant was alle gader sijn.Hi was keyser ende coninc mede.Hoort hier wonder ende waerhedeWat den coninc daer gheuelDat weten noch die menige wel

Medieval Dutch literature

Maerlant and Utenbroeke

I Knew each other well

I Worked closely together

I Hard to distinguishusing machine learning?

Medieval literature

Documentary on Hildegard of Bingen

https://vimeo.com/70881172

Text categorization

Given a text and a predefined set of classes,predict the class the text belongs to

Many applications

I Spam

I News article topics

I Stylometry

I . . .

Methods

I Handcrafted approach

I Machine learning (since 1992)

Text categorization

Different aspects

I Class representation

I Document representation (features)

I Supervised machine learning method

Class representation

Binary

I Spam vs. ‘ham’

I Man vs. woman writer

I Maerlant vs. Utenbroeke

Multiclass

I Genres: blogs, news, jokes, novels, . . .

I Topics of reviews: books, phones, movies, hotels, . . .

I Age groups: 10s, 20s, 30s, 40s

Brief catalogue of features for stylometry

Numeric

I Complexity, readabilityI Vocabulary richness

I Type-token ratioI Hapax legomena

I Averages or distributions ofI Syllable lengthI Word lengthI Sentence length

Character-level

I Letter frequency

I Punctuation

I Spelling errors

I Character n-grams

Brief catalogue of features for stylometry

Word-level

I Word n-grams

I Special dictionaries

I Morphology: prefixes and suffixes

Syntax

I Part-of-speech distributions

I Frequencies of syntactic chunks (e.g. NP = Det + Adj + N)

. . .

Metadata reflected in → language

Document variation

I Topic

I Register, genre

I Diachrony

Individual variation

I Age

I Gender

I Region

I Personality

I Education

I Ideology

I Mental Health

Language variation

I Spelling, punctuation

I Word choice

I Sentence structure

I Thematics, tone

I Text structure

Language: problematic for machine learning

Ambiguity at multiple levels

I Lexical

I Structural

Naive assumptions are false

I Word order doesn’t matter ?

I Features are independent ?

Ambiguity

Lexical

I Polysemy: word with different meanings (“book”)

I Context matters (“fall terribly” vs. “terribly interesting”)

Structural

I Metaphor (“Her smile was like sunshine”)

I Implication (“A bus!” could mean “Watch out!”)

I Co-reference (“I am here”)

Naive assumptions in machine learning are false forlanguage

Word order matters

I The woman hit the man.

I The man hit the woman.

Word occurences are not independent

I Syntax has rules, certain word forms demand others(Determiner + Noun)

I Phrases, proverbs (“Kind regards”, “Raining cats and dogs”)

I Whole idea of distributional semantics: a word can beunderstood by the company it keeps

Experiment: gender recognition

Is the author male or female?

omdat je t zo netjes vraagt.. en omdat je mijn tekst zo goed vond,een spoor achterlaten op je profiel....

dezze is gans moooiii meisje ik mis je verschrikkelijk hard we moeteeeens afspreken iloveyou

niiice (’; srry datk ni antwoorde maar pc ga traag ;)

o; ge moogt wel ni vloeken ea o; & ontkenning (aa) das ni goed eao; x

Experiment: gender recognition

Is the author male or female?

maleomdat je t zo netjes vraagt.. en omdat je mijn tekst zo goed vond,een spoor achterlaten op je profiel....

femaledezze is gans moooiii meisje ik mis je verschrikkelijk hard we moeteeeens afspreken iloveyou

femaleniiice (’; srry datk ni antwoorde maar pc ga traag ;)

maleo; ge moogt wel ni vloeken ea o; & ontkenning (aa) das ni goed eao; x

Experiment: age recognition

Is the author younger than 16 or older than 25?

thx vr u notaaa ˆˆ ik ben toch geweldig in de plaaats van superhahaha das een grapje he xd?

kan alles tege uu zeggge ;s gy kom maandag bx mx slape ea ;otaniaa ga ook kome ;o & da ga leuuk worde enall x

erg knappe en sexy vrouw ben je!!! ge prikkelt me!!! val je nog opjongere mannen ??? ;-)

mercikes! hij is echt ongelofelijk mooi en zo lief, echt niet te doen!het is een echt beertje [hug]

Experiment: age recognition

Is the author younger than 16 or older than 25?

−16thx vr u notaaa ˆˆ ik ben toch geweldig in de plaaats van superhahaha das een grapje he xd?

+25kan alles tege uu zeggge ;s gy kom maandag bx mx slape ea ;otaniaa ga ook kome ;o & da ga leuuk worde enall x

+25erg knappe en sexy vrouw ben je!!! ge prikkelt me!!! val je nog opjongere mannen ??? ;-)

+25mercikes! hij is echt ongelofelijk mooi en zo lief, echt niet te doen!het is een echt beertje [hug]

Age and gender recognition

Dutch

I Chat language on NetlogI Age: −16 vs. +25 achieves 82% accuracyI Gender: male vs. female achieves 70% accuracy

English

I Argamon & Koppel (2003)I Age: 10s vs. 20s vs. 30s achieves 75% accuracyI Gender: male vs. female achieves 80% accuracy

Gender recognition: Explanation

Dutch chat language

Male

I dame

I knappe

I vrouw

I maat

I profiel

I den

I tege

I :-)

Female

I kei

I italics

I zonder

I iloveyou

I twee

I zijn

I ll

I ;-)

Gender recognition: Explanation

British National Corpus

I Use of pronouns (more by women) and certain types of nounmodification (more by men)

I ‘Male’ words: a, the, that, these, one, two, more, someI ‘Female’ words: I, you, she, her, their, myself, yourself, herself

I More ‘relational’ language by women, more‘informative/rational’ language by men

I Even in formal language (non-fiction)

Gender recognition: by content

Age and gender recognition

www.tweetgenie.nl

Personality recognition

What is personality?

I “individual differences among people in behavior patterns,cognition and emotion” (Michel, Shoda & Smith, 2004)

I Personality explains 35% of variance in life satisfactionI Compare: income (4%), employment (4%),

marital status (1-4%)

I Personality changesI Reflected in language use?

Personality traits

I Personality can be broken down in componentsI Different typologies

I Big Five (OCEAN)I Myers-Briggs Type Indicator (MBTI)

Personality recognition

Big Five (OCEAN)

I Openness to experienceI Inventive/curious vs. consistent/cautious

I ConscientiousnessI Efficient/organized vs. easy-going/careless

I ExtraversionI Outgoing/energetic vs. solitary/reserved

I AgreeablenessI Friendly/compassionate vs. analytical/detached

I Neuroticism (emotional stability)I Sensitive/nervous vs. secure/confident

Do the test at http://www.outofservice.com/bigfive/

Personality recognition

Interesting method - LIWC

I Linguistic Inquiry and Word Count

I James W. PennebakerI 80 categories of words related to personality and mental state

I Syntactic categories: e.g. self-reference, articles, . . .I Emotional categories: e.g. positive emotion, anxiety, . . .I Thematic categories: e.g. job, leisure, family, . . .

Personality recognition

Interesting corpora

I Essays dataset (Pennebaker, later Mairesse)I English stream-of-consciousness texts by students

I myPersonality (Stillwell & Kosinski)I Large-scale data collection through Facebook app,

many languages

I Personae (Luyckx & Daelemans)I Dutch essays, written by students

I CSI Corpus (Verhoeven & Daelemans)I Dutch papers, essays and reviews written by students

Results

I 55-65% for most traits

I Better than humans

Experiment: Personality recognition

Which text was written by an extravert/introvert author?

Hey there, if you are watching this movie you probably all ready know whatCircle Lens are. For those of you that don’t I will just let you know really quick.Um, Circle Lens is a type of contact lens, um, that make your iris appearlarger. So they’re really good for cross playing or giving a dolly effect. Theyalso help with helping make somebody look, like, more awake. And, um,they’re colored lens usually. They come in, like, black, brown, but like, green,blue, all different colors.

This is how it is in my school. Okay, here’s an example. All right, um, whenthey see two guys are gay, they’re together, they’re like no, ew, no. No, no that– that doesn’t go together - - you know, two guys, no. two sticks, no. It justdoesn’t work like . But when they see two girls, they’re like, get it on. And Idon’t get these people. I’ve never seen someone say like, oh, you’re sohomosexual or you’re so lesbian or you’re such a child molester. It is always theword gay, cause apparently gay is now an insult, even though the word meanslike happy and lively and that kinda giddy feeling you have inside

Experiment: Personality recognition

Which text was written by an extravert/introvert author?

ExtravertHey there, if you are watching this movie you probably all ready know whatCircle Lens are. For those of you that don’t I will just let you know really quick.Um, Circle Lens is a type of contact lens, um, that make your iris appearlarger. So they’re really good for cross playing or giving a dolly effect. Theyalso help with helping make somebody look, like, more awake. And, um,they’re colored lens usually. They come in, like, black, brown, but like, green,blue, all different colors.

IntrovertThis is how it is in my school. Okay, here’s an example. All right, um, whenthey see two guys are gay, they’re together, they’re like no, ew, no. No, no that– that doesn’t go together - - you know, two guys, no. two sticks, no. It justdoesn’t work like . But when they see two girls, they’re like, get it on. And Idon’t get these people. I’ve never seen someone say like, oh, you’re sohomosexual or you’re so lesbian or you’re such a child molester. It is always theword gay, cause apparently gay is now an insult, even though the word meanslike happy and lively and that kinda giddy feeling you have inside

Region recognition

I Expecially interesting on chat data where regional languagevariation is very visible

Mental health recognition

Mental illnesses such as Alzheimer’s disease or schizophrenia mightbe discovered by diachronically looking at writing style changes

Possible indicators

I Reduced vocabulary size

I Increased repetition

I More vague words

Case study

I Agatha Christie, British mystery writer

I Never diagnosed, but believed to have Alzheimer’s

I Investigated by Hirst, Le & Lancashire

Mental health recognition

Repetition of content words within ten lemmatized words

She got near the door. She stopped suddenly, then walked on. Itlooked as though something like a bundle of clothes was lyingnear the door. Something they’d pulled out of Mathilde and notthought to look at, Tuppence wondered. She quickened her pace,almost running. When she got near the door she stopped suddenly.It was not a bundle of old clothes. The clothes were old enough,and so was the body that wore them. Tuppence bent over andthen stood up again, steadied herself with a hand on the door.(Agatha Christie, Postern of Fate)

Mental health recognition

The Nun Study

I Life-long diaries of nuns of Notre Dame congregation inMinnesota (Kemper et al., 2001)

I Measure scores forI Grammatical complexityI Idea density: number of distinct ideas per 10 words

I ResultsI AD initially lower scores than non-ADI AD declines at a faster rate

I Possible explanationI Early-life language ability can predict risk of dementia

Ideology detection

Task

I Predict the cultural or ideological differences between textualsources

Possible use cases

I Find cultural differences between Western and local sources inAfrican election (Pollak, 2008)

I Can we distinguish left-wing from right-wing politicians bytheir social media writing?

I Can we distinguish left-wing from right-wing newspapers bytheir writing?

Political opinion mining

Politieke Barometer

I Track mentions of politicians and parties on Twitter

I Analyse sentiment of these tweets

I Try to predict outcome of elections

www.politiekebarometer.be

Applications

Marketing

I TextGain

Text forensics

I Daphne & AMiCA

I Adversarial stylometry

I Deception detection

Stylometry for marketing purposes

Demographic market research

I Who says what about your product?I Are young educated women critical of it?

Demographic marketing

I Aim your advertising at specific groups of peopleI Google and Facebook are already doing this, because they just

have all your personal dataI e.g. pregnancy advertisingI http://mashable.com/2014/04/26/big-data-pregnancy/

Text forensics

Daphne

I Defending Against Paedophiles inHeterogeneous Network Environments

I Predict age and gender of userI Compare predicted with profile informationI Suspect if they don’t match

AMiCA

I Automatic Monitoring in Cyberspace ApplicationsI Cyberbullying detection

I Children from different ages find different things offensiveI Personality may have an influence on the way people bully

Text forensics

Adversarial stylometry

I Style = beyond conscious control?I Can you make your style unrecognisable? (Yes.)

I Machine translation (bad idea, but works)I Obfuscation: try to cover up your own writing styleI Imitation: try to pretend to be someone else

(pastiche, fanfiction)

I Context of cyberpaedophiliaI Are older people recognisable when they pretend

to be younger?

Text forensics

Deception detection

I Problem: fake reviewsI Positive by owner/producerI Negative by competitor

I Let’s do the test

Experiment: real or fake review?

I have stayed at many hotels traveling for both business andpleasure and I can honestly stay that The James is tops. Theservice at the hotel is first class. The rooms are modern and verycomfortable. The location is perfect within walking distance to allof the great sights and restaurants. Highly recommend to bothbusiness travellers and couples.

My husband and I stayed at the James Chicago Hotel for ouranniversary. This place is fantastic! We knew as soon as we arrivedwe made the right choice! The rooms are BEAUTIFUL and thestaff very attentive and wonderful!! The area of the hotel is great,since I love to shop I couldn’t ask for more!! We will definatly beback to Chicago and we will for sure be back to the James Chicago.

Experiment: real or fake review?

TrueI have stayed at many hotels traveling for both business andpleasure and I can honestly stay that The James is tops. Theservice at the hotel is first class. The rooms are modern and verycomfortable. The location is perfect within walking distance to allof the great sights and restaurants. Highly recommend to bothbusiness travellers and couples.

FakeMy husband and I stayed at the James Chicago Hotel for ouranniversary. This place is fantastic! We knew as soon as we arrivedwe made the right choice! The rooms are BEAUTIFUL and thestaff very attentive and wonderful!! The area of the hotel is great,since I love to shop I couldn’t ask for more!! We will definatly beback to Chicago and we will for sure be back to the James Chicago.

Deception detection

Cornell University Study

I Positive reviewsI Truthful reviews from TripAdvisorI Deceptive reviews from Mechanical Turk

I FeaturesI LIWC, word unigrams and bigrams

I ResultsI Human judges fail to make the distinctionI Classifier is 90% accurateI Deceptive language is imaginative and narrative rather than

informative and contains more superlatives

Deception detection

CLiPS Stylometry Investigation Corpus

I Positive and negative reviewsI Same authorsI Deceptive reviews about fictional products from same

categories

I FeaturesI Word unigramsI Without domain-specific words (product names)

I Results

Final demo

Stylene

I Analyse your writing style

I www.stylene.be