48
Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Embed Size (px)

Citation preview

Page 1: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Computational Extraction of Social and Interactional Meaning

from Speech

Dan Jurafsky and Mari Ostendorf

Lecture 5: Register & Genre Mari Ostendorf

Page 2: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Register and GenreVariation in language associated with the social

context/situationFormal vs. casual, status and familiaritySpoken vs. writtenAudience sizeBroadcast vs. private (performance vs. personal?)Reading level or audience age

Rhetorical form or purposeReporting, editorial, review, entertainment

Page 3: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Example:

Yeah. Yeah. I’ve noticed that that that’s one of the first things I do when I go home is I either turn on the TV or the radio. It’s really weird.

I want to go from Denver to Seattle on January 15.

In which case is the speaker assuming that a human (vs. a computer) will be listening to them?

Page 4: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Example:

A: Ohio State’s pretty big, isn’t it?B: Yeah. We’re about to do the Fiesta Bowl there.A: Oh, yeah.

o- ohio state’s pretty big isn’t it yeah yeah I mean oh it’s you know we’re about to do like the the uh fiesta bowl there oh yeah

structure,content

register/genre

Page 5: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

More Speech Examples

A: Ok, so what do you think?B: Well that’s a pretty loaded topic. A: Absolutely.B: Well, here in – Hang on just a minute, the dog is barking -- Ok, here in Oklahoma, we just went through a major educational reform…

A: This’s probably what the LDC uses. I mean they do a lot of transcription at the LDC. B: OK. A: I could ask my contacts at the LDC what it is they actually use.B: Oh! Good idea, great idea.

A: After all these things, he raises hundreds of millions of dollars. I mean uh the fella B: but he never stops talking about it. A: but okB: Aren’t you supposed to y- I mean A: well that’s a little- the Lord saysB: Does charity mean something if you’re constantly using it as a cudgel to beat your enemies over the- I’m better than you. I give money to charity.A: Well look, now I…

What can you tell about these people?

Page 6: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Text Examples WSJ treebank

Fujitsu Ltd.'s top executive took the unusual step of publicly apologizing for his company's making bids of just one yen for several local government projects, while computer rival NEC Corp. made a written apology for indulging in the same practice.

Amazon book reviewBy tradition, I have to say SPOILERS ALERT here. By opinion, I have to say the book is spoiled already. I don't think I've ever seen a worse case of missed opportunity than Breaking Dawn.

WeblogsI realise I'm probably about the last person in the blogosphere to post on this stuff, but I was away so have some catching up to do. When I read John Reid's knuckle-headed pronouncements on Thursday, my first thought was that the one who "just doesn't get it" is Reid.*

NewgroupsMany disagreements almost charge the unique committee. How will we neglect after Rahavan merges the hon tour's match? Some entrances project, cast, and terminate. Others wickedly set. Somebody going perhaps, unless Edwina strokes flowers without Mohammar's killer.

Page 7: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Text or Speech?Interestingly, four Republicans, including the Senate Majority Leader, joined all the Democrats on the losing end of a 17-12 vote. Welcome to the coven of secularists and atheists. Not that this has anything to do with religion, as the upright Senator ___ swears, … I guess no one told this neanderthal that lying is considered a sin by most religions.

Brian: we should get a RC person if we canSidney: ah, nice connectionSidney: Mary ___Brian: she's coming to our mini-conference tooSidney: great. we can cite her :) Brian: copiouslyAlan: ha…Sidney: Brian, check out ch 2 because it relates directly to some of the task modeling issues we're discussing. the lit review can be leveraged

Page 8: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Word Usage in Different Genres

UW ‘03 mtgs swbd email papers rand. web conv. web

nouns 17 13 29 31 32 19

pronouns 10 14 3 2 2 13

adjectives 6 4 11 10 10 9

uh 3 3 0 0 0 .04

Biber ‘93 conversations press reports

relative clauses 2.9 4.6causative adverbial subord. clauses 3.5 .5

that complement clauses 4.1 3.4

Page 9: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Why should you care about genre?Document retrieval:

Genre specification improves IRJunk filtering

Automatic detection of register characteristics provides cues to social contextSocial role, group affinity, etc.

Training computational models for ASR, NLP or text classification: word usage varies as a function of genreImpacts utility of different data sources in training,

strategies for mixing dataImpacts strategy for domain transfer

Page 10: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

OverviewDimensions of register and genreGenre classification

Computational considerationsExamples

Cues to social context: accommodation examples Impact on NLP: system engineering examples

Page 11: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

OverviewDimensions of register and genreGenre classification

Computational considerationsExamples

Cues to social context: accommodation examples Impact on NLP: system engineering examples

Page 12: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Biber’s 5 Main Dimensions of RegisterInformational vs Involved ProductionNarrative vs Nonnarrative ConcernsElaborated vs Situation-Dependent ReferenceOvert Expression of PersuasionAbstract vs Non-abstract Style

Page 13: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

informational

involved

Dimension 1

situated elaboratedDimension 3

0

conversations

personal letters

broadcasts

fiction

spontaneous speeches

professional letters

news editorials

news reportageacademic prose

From Biber, 1993Comp. Linguistics

Page 14: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Examples of Other DimensionsNarrative vs Nonnarrative

Fiction Exposition, professional letters, telephone conversations

Overt argumentation and persuasionEditorials news reports

Abstract vs Nonabstract StyleAcademic prose conversations, public speeches,

fiction

Page 15: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Register as FormalityBrown & Levinson (1987)

model of politenessFactors that influence

communication techniquesSymmetric social distance

between participantsAsymmetric power/status

difference between participants

Weight of an imposition

Petersen et al. (2010) mapping for Enron email

Influencing factorsSymmetric social distance

Person vs. business Frequency of social contact

Asymmetric power/status Rank difference

(CEO>pres>VP>director…)Weight of an imposition

Automatic request classifierSize of audience

Page 16: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

OverviewDimensions of register and genreGenre classification

Computational considerationsExamples

Cues to social context: accommodation examples Impact on NLP: system engineering examples

Page 17: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Features for Genre ClassificationLayout (generally used for web pages)

Inclusion of graphics, links, etc.Line spacing, tabulation, ….

Features of text or transcriptionsLexicalstructural

Acoustic features (if speech)

NOTE: Typically you need to normalize for doc length.

We won’t consider these

Page 18: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Features for Genre ClassificationFeatures of text or transcriptions

Words, n-grams, phrasesWord classes (POS, LIWC, slang, fillers, …)Punctuation, emoticons, caseSentence complexity, verb tenseDisfluencies

Acoustic featuresSpeaker turn-takingSpeaking rate

Page 19: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Feature Selection Methods

Information filtering (information theoretic)Max MI between word & class labelMax information gain (MI of word indicator & class label)Max KL distance: D[p(c|w)||p(c))

D(p||q) = i p(i)log[p(i)/q(i)]

Decision tree learningRegularization in learning (see later slide)

MI = mutual information (see lecture 1)

Page 20: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Popular ClassifiersNaïve Bayes (see lecture 1)

Assumes features are independent (e.g. bag of words)Different variations in Rainbow toolkit for weighting

word features, feature selection, smoothingDecision tree (in Mallet)

Greedy rule learner, good for mixed continuous & discrete features (can be high variance)

Implicit feature selection in learningAdaboost (ICSIboost = version that’s good for text)

Weighted combination of little trees, progressively trained to minimize errors from previous iterations

Page 21: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Popular Classifiers (cont.)Maximum entropy (in Mallet)

Loglinear model: exp(weighted comb. of features)Used with regularization (penalty on feature weights)

provides feature selection mechanismSupport vector machine (SVM in svmlight)

2-class linear classifier: weighted sum of similarity to important examples

Can use kernel functions to compute similarity and increase complexity

For multi-class problems, use a collection of binary classifiers

Page 22: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Genre ClassificationStandard text classification problem

Extract feature vector apply model score classesChoose class with best score

Possible variationThreshold test for “unknown genre”

Evaluation:Classification accuracyPrecision/recall (if allowing unknown genre)

Page 23: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Genre as Text Types IR-motivated text types (Dewdney et al., 2001) 7 Types: Ads, bulletin board, FAQ, message board,

Reuters news, radio news, TV news (ASR audio transcripts)

Forced decision: 92% recall Best results with SVM & multiple feature types Most confusable categories:

1. radio vs. TV transcripts (5-10%)2. ads vs. bulletin board (2-8%)

Page 24: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Genre as Text Types (II)British National Corpus

4 text & 6 speech genresResults:

Santini et al., 2004POS trigrams & Naïve Bayes, truncated documents85.8% accuracy 10-way, 99.3% speech vs. text

Unpublished UW duplicationSimilar results with full documentsSlight improvement for POS histogram approach

Page 25: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Genre as Text Types (III)

Data sources: (from LDC)

• Speech: broadcast news (bn), broadcast conversations (bc), meetings (mt), switchboard (sb)

• Text: newswire (nw), weblogs (wl)

POStagging

collectwindowedhistograms

computehistogramstatistics

Z-norm+ PCA

Gaussianclassifier

Document

(Feldman et al. 09)

Page 26: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Open Set Challenges

Test on matched sources

Main confusions:BC BNWL BN, NW

Test on “BC-like” web text collected with frequent n-grams (Feldman et al. ’09)

% correct

QDA w/ POS histograms

98%

Naïve Bayes w/ bag-of-words

95%

Improve classifier with higher-order moments & more genres for training.

Very little of BC-like web data is actually classified as BC!

Consider formality filtering instead.

Page 27: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Genre as Formality (Peterson et al. 2011)Features:

Informal words, including: interjections, misspellings and words classified as informal, vulgar or offensive by Wordnik

Punctuation: !, …, absence of sentence-final punctuationCase: various measures of lower casing

Classifier: maximum entropyResults: 81% acc, 72% FPunctuation is single most useful feature; informal

words and case are lower on recall

Page 28: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

OverviewDimensions of register and genreGenre classification

Computational considerationsExamples

Cues to social context: accommodation examples Impact on NLP: system engineering examples

Page 29: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Group AccommodationLanguage use & socialization (Nguyen & Rose, 2011)Data: online health forum (re breast cancer) Jan 2011

crawl, <8 year span, only long-term users (2+ yrs)Analysis variables:

Distribution change of high frequency wordsQuestions:

What are characteristic language features of the group?How does language change for long-term participants?

Page 30: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Language Change for Long-Term Poster

Early post:I am also new to the form, but not new to bc, diagnosed last yr, [..] My follow-up with surgeon for reports is not until 8/9 over a week later. My husband too is so wonderful, only married a yr in May, 1 month before bc diagnosed, I could not get through this if it weren’t for him, […] I wish everyone well. We will all survive.

2-4 years later:Oh Kim- sorry you have so much going on – and an idiot DH on top of it all. [..] Steph- vent away – that sucks – [..] XOXOXOXXOXOXOX [..] quiet weekend kids went to DD’s & SIL o Friday evening, [..] mad an AM pop in as I am supposed to, SIL is an idiot but then you all know that

Page 31: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Short- vs. Long-time Members

Predicting long vs. short-time users:• 88 LIWC categories better than 1258 POS• Best single type is unigrams+bigrams

Page 32: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

K-L Divergence between…

Page 33: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Gender Accommodation/Differences

Language use & gender-pairs (Boulis & Ostendorf, 2005)Data: Switchboard telephone conversations

Mostly strangersPrescribed topics, 5 min conversations

Analysis variables: MM, MF, FM, FFQuestions:

Can you detect gender or gender pair? What words matter?

Classification features = unigrams or unigrams+bigramsFeature selection: KL distance

Page 34: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Accommodation??

F-measure

FF .78

FM .07

MF .21

MM .64

Detecting gender pair from one side of conversation (unigrams)

unigrams bigrams

FF-MM 98.9 99.5

FM-MF 69.2 78.9

Distinguishing same/different gender pairs (accuracy)

People change styles more with matched genders… ORThe matched-gender is an affiliation group.

Page 35: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Gender-Indicative Language UseMen:

Swear wordsWifeNames of menBass, dudeFilled pauses (uh) – floor holding

Women:Family relation termsHusband, boyfriendNames of womenCuteLaughter, backchannels (uh-huh) -- acknowledging

Page 36: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Gender-dependent Language ModelsBig matched/mismatched differences in perplexity for

pair-dependent LMs (FF vs. MM biggest difference)Significant F/M differenceBUT, best results are from combining all data, since

more data trumps gender differences

Page 37: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

OverviewDimensions of register and genreGenre classification

Computational considerationsExamples

Cues to social context: accommodation examples Impact on NLP: system engineering examples

Page 38: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Design Issues in HLT for New Genres

Text normalization101 one hundred and one (text to speech)lol laugh out loud (messaging, twitter, etc.)

Lexicon differencesNew words/symbolsSame words but different senses

Feature engineeringModel retraining or adaptation

Page 39: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Sentiment Detection on Twitter

Text normalizationAbbreviations: gr8 great, rotf rolling on the floor

(http://www.noslang.com)Mapping targets and urls to generic token (||T||, ||U||)Spelling variants: coooool coool

Punctuation and other symbolsEmoticons emoticon polarity dictionaryEmphasis punctuation: !!!!, ????, !*?#!!

Only 30% of tokens are found in WordNet

Argarwal et al., 2011

Page 40: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Sentiment in Twitter (Agarwal et al. cont.)Feature engineering:

UnigramsSentiment features (counts & polarity scores of pos/neg

sentiment words from dictionary; punctuation, capitalization)

Tree kernelModel: SVMObservations

100 senti-features have similar performance to 10k unigrams alone, tree kernel is better, combo is best

Pos/neg acc = 75.4%, Pos/Neg/Neutral acc = 60.6%Most important features are prior word polarity & POS

Page 41: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

N-gram Language ModelingConventional wisdom:

Mercer: There’s no data like more data.Banko & Brill: Getting more data has more impact than algorithm

tuning.Manning & Schutze: Having more training data is generally more

useful than any concern of balance.Since the 70’s, the amount of data used in language model

training has grown by an order of magnitude every decadeProblem: Genre mismatch

Mismatched data can actually hurt performance (e.g. using newswire to train air travel information system)

General web n-gram statistics ≠ general English (bias of advertising & pornography)

Page 42: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

More Text/Transcript Examples Meeting transcript

A: okay. so there are certain cues that are very strong either lexical or topic-based um concept cues

B: from the discourse that – yeah.A: for one of those. and then in that second row or whatever that row of time of day

through that – so all of those – some of them come from the utterance and some of them are sort of either world knowledge or situational things. right? so that you have no distinction between those and okay

B: right. one uh – uh. um, anything else you want to say Bhaskara?C: umA: time of dayC: yeah i m- i mean –B: one thing – uh –D: yeah. they’re – they’re are a couple of more things. i mean uh. I would actually

suggest we go through this one more time so we – we all uh agree on what – what the meaning of these things is at the moment and maybe what changes....

WSJ: Fujitsu Ltd.'s top executive took the unusual step of publicly apologizing for his

company's making bids of just one yen for several local government projects, while computer rival NEC Corp. made a written apology for indulging in the same practice.

Page 43: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Examples (cont.)

Lecture transcript right okay so the expectation operator is actually a functional which means it is a function of a function so we describe it as such hereis e. which means expectation then we may or may not indicate theactual random variable over which we are taking the expectationthen we have an open bracket we have the function of which weare taking the expectation and a closing bracket and this is in factequal to the integral over all x minus infinity infinity of f. at x. timesthe probability of x. d. x. and this is actually the probability densityfunction of x. okay so us there are two expectations that are far moreimportant than all the rest the first one is ...

Page 44: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

N-gram Language Modeling (cont.)

Standard approach to dealing with this: mixture modelingTrain separate language models on each data sourceLearn weights of the different components from target data

P(wt|wt-1) = i i(wt-1) Pi(wt|wt-1)

Page 45: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Class-dependent Mixture Weights

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1gr 2gr 3gr 2gr 3gr 2gr 3gr

ch_en

swbd-p2

swbd-cell

swbd

BN

web

No class

Noun

Backchannel

• Weights for web data are higher for content words, lower for conversational speech phenomena• Higher order n-grams have higher weight on web data

CTS LM

Bulyko et al. 03)

Page 46: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

N-gram Language Modeling (cont.)

Some text sources hurt WER unless weight is very low:Newswire for telephone speech (Iyer & Ostendorf 99)Newswire for lectures (Fuegen et al. 06)General web data for talk shows (Marin et al. 09, even with

weight = .001)Small, query-based topic language model outperforms

large, static topic mixture (UW unpublished)Question: Can we get BETTER data from the web?

Genre-specific web queriesGenre filtering

Page 47: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

N-gram Language Models (cont.)Bulyko et al., 2007

Page 48: Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 5: Register & Genre Mari Ostendorf

Register/Genre Take-Aways

Our choice of wording depends on the social context: the event, the audience, and our relationship to them

Detecting different genresIs useful for information retrievalIs fairly reliable with just word-POS features and standard

classifiersGenre variations reflect social phenomena, so genre cues

are also useful for detecting social role, affiliation, etc.Genre variations in language impact the design of human

language technology in terms of: text processing, feature engineering, and how we leverage different data sources