51
Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach H. Andrew Schwartz, Johannes Eichstaedt, Lukasz Dziurzynski, Eduardo Blanco,* Margaret L. Kern, Stephanie Ramones, Martin Seligman, and Lyle Ungar University of Pennsylvania *Lymba Corporation wwbp.org

Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Embed Size (px)

Citation preview

Page 1: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Choosing the Right Words:Characterizing and Reducing Error of the Word Count Approach

H. Andrew Schwartz, Johannes Eichstaedt, Lukasz Dziurzynski, Eduardo Blanco,* Margaret L. Kern, Stephanie Ramones, Martin Seligman, and Lyle Ungar

University of Pennsylvania*Lymba Corporation

wwbp.org

Page 2: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Discoursefrom

people

Lexicon

CountWords

(Relative Frequency)

expression of psychological

state

Page 3: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Discoursefrom

people

Happywords

CountWords

(Relative Frequency)

Happinessor people

(over time / space)

friend

happy

...

play good...

like

love

Page 4: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Some problems?

so everyone should come to the play tomorrow...

Does anyone what type of file I need to convert youtube videos to play on PS3???

Time to go play with Chalk from the Easter Bunny!

Page 5: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Some problems?

so everyone should come to the play tomorrow...

Does anyone what type of file I need to convert youtube videos to play on PS3???

Time to go play with Chalk from the Easter Bunny!

part of

speech

word seense

OK

Page 6: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Some problems?

so everyone should come to the play tomorrow...

Does anyone what type of file I need to convert youtube videos to play on PS3???

Time to go play with Chalk from the Easter Bunny!

...all work no play :­(

I sure wish I had about 50 hours a day to play cod

part of

speech

word seense

OK

negation

desire

Page 7: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Word Count Approach

| wwbp.org

Why?

• Simple to implement

• Scalable

• Not a black box (somewhat interpretable)

It's being used extensively for social science; some high-impact publications. (Golder and Macy, 2011. Science; Dodds et al., 2011. Plos One; Kramer, 2010)

Page 8: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Outline

| wwbp.org

● What is the word count approach?

● Background

● Characterizing Error

● Refining Lexica to Reduce Error

● Conclusion

Page 9: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Outline

| wwbp.org

● What is the word count approach?

● Background

● Characterizing Error

● Refining Lexica to Reduce Error

● Conclusion

Page 10: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Development of Word Count Approach

Social scientists most often use:Linguistic Inquiry and Word Count (LIWC)(Pennebaker et al., 2007)

● 4500 words across ~64 categories

● Originally used mostly for analyzing long form text

● i.e. how many emotion words in an essay● Recently, increased use in short form as

psychological state measurement tool.

Page 11: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Why recent interest?

Page 12: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Why recent interest?

Page 13: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Why recent interest?

Inexpensive

Temporal and Spatial Resolution

Unobtrusive measurement

Page 14: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Evaluating LIWC over social media.

• Bantum and Owen, 2009.• evaluated emotion lexicons over an Web-based

breast cancer support group• Sensitivity: 0.88• Predictive validity: 0.31

Page 15: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Use of lexica in computational linguistics

• Lexicon expansion from seed words(Hatzivassiloglou and McKeown, 1997; Kamps and Marx, 2002; Kim and Hovy, 2004; Kanayama and Nasukawa, 2006; Baccianella et al., 2010)

• Supervised learning of lexica for sentiment or subjectivity analysis(Pang et al., 2002; Wiebe & Cardie, 2005)

Page 16: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Background

| wwbp.org

Use of lexica in computational linguistics

• Lexicon expansion from seed words(Hatzivassiloglou and McKeown, 1997; Kamps and Marx, 2002; Kim and Hovy, 2004; Kanayama and Nasukawa, 2006; Baccianella et al., 2010)

• Supervised learning of lexica for sentiment or subjectivity analysis(Pang et al., 2002; Wiebe & Cardie, 2005)

Distinguishing our method: • Improve human-created lexicons • Refine rather than expand• Explore utility of a lexical ambiguity metric

Page 17: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Outline

| wwbp.org

● What is the word count approach?

● Background

● Characterizing Errors

● Refining Lexica

● Conclusion

Page 18: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Corpus1000 instances of LIWC terms occurring in Facebook status updates

Judged whether terms contribute intended signal (i.e. positive emotion) to message.

For sample of errorneous instances, label the type of signal error

Page 19: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Annotation Process

● Stuck with POSEMO and NEGEMO in LIWC

● Well vetted and developed over 2 decades(Pennebaker et al., 2007)

Page 20: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Annotation Process

● Stuck with POSEMO and NEGEMO in LIWC

● Well vetted and developed over 2 decades

● Instruction:

● “does the term contribute to the <associated psychological state (POSEMO or NEGEMO)> within the sentence it appears?” in other words:

● “would the sentence convey less … without this term?”

Page 21: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Annotation Process

● Stuck with POSEMO and NEGEMO in LIWC

● Well vetted and developed over 2 decades

● Instruction:

● “does the term contribute to the <associated psychological state (POSEMO or NEGEMO)> within the sentence it appears?” in other words:

● “would the sentence convey less … without this term?”

tolerant criteria

Page 22: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Annotation Process

● Stuck with POSEMO and NEGEMO in LIWC

● Well vetted and developed over 2 decades

● Instruction:

● “does the term contribute to the <associated psychological state (POSEMO or NEGEMO)> within the sentence it appears?” in other words:

● “would the sentence convey less … without this term?”

● 3 judges per instance

● used majority vote of yes/no answers

tolerant criteria

Page 23: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Examples

Page 24: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Examples

Has had a very good day (POSEMO)

is so very bored. (NEGEMO)

damn. That octopus is good, lol (NEGEMO)

thank you for his number. (NEGMO: “numb*”)

don't be afraid to fail (NEGEMO)

I wish I could … and we could all just be happy (POSEMO)

Page 25: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Results: Agreement

Page 26: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Results: Accuracy

Page 27: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Results: Accuracy

tolerant criteria

Page 28: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Analysis of Errors

Page 29: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Analysis of Errors

over 100 randomly selected erroneous instances

Page 30: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Analysis of Errors

over 100 randomly selected erroneous instances

Page 31: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Outline

| wwbp.org

● What is the word count approach?

● Background

● Characterizing Errors

● Refining Lexica

● Conclusion

Page 32: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

The Idea

Remove words likely to carry erroneous signal

(i.e. “play”, “number”, ...etc..)

Page 33: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

The Idea

Remove words likely to carry erroneous signal...with a self-imposed impairment

Page 34: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Remove words likely to carry erroneous signal… with a self-imposed impairment:

No training data of which posts are indicative of the outcome (positive or negative emotion).

Page 35: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Remove words likely to carry erroneous signal… with a self-imposed impairment:● No training data of which posts are indicative of the

outcome (positive or negative emotion).

Why limitation?– apply to any lexica

– scalable for social media

– within bounds of accepted approach

Page 36: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity

explains over 50% of errors

Page 37: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity

● Part of Speech

● Word Sense

Page 38: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity

● Part of Speech– Google N-grams 2.0 (Lin et al., 2010)

● Word Sense– SemCor (Miller et al., 1993)

Page 39: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity● Part of Speech

● Word Sense

Page 40: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity

● probability that a given instance of a word is:– the most frequent part-of-speech and

– the most-frequent sense of that pos.

Page 41: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Focus on lexical ambiguity

● probability that a given instance of a word is:– the most frequent part-of-speech and

– the most-frequent sense of that pos.

Page 42: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Evaluation

● 1000 instances of LIWC POSEMO and NEGEMO terms judged for the error analysis.

Page 43: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Evaluation

Page 44: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Characterizing Word Count Errors

| wwbp.org

Filtered (theta = 0.5)

Page 45: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Refining Lexica

| wwbp.org

Page 46: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

| wwbp.org

Page 47: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

… mostly due to lexical ambiguity

| wwbp.org

Page 48: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

… mostly due to lexical ambiguity

word count with refined lexica => less errors

| wwbp.org

Page 49: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

… mostly due to lexical ambiguity

word count with refined lexica => less errors

… simple and scalable...within bounds of social science's accepted approach

| wwbp.org

Page 50: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

… mostly due to lexical ambiguity

word count with refined lexica => less errors

Future Work:● refinements based on other criteria● supervised and more sophisticated approaches to measuring 

psychological state. 

| wwbp.org

Page 51: Choosing the Right Words - unitn.itWord Count Approach | wwbp.org Discourse from people Lexicon Count Words (Relative Frequency) expression of psychological stateclic.cimec.unitn.it/starsem2013-program/33_Presentation.pdf ·

Conclusions

word count approach has problems

… mostly due to lexical ambiguity

word count with refined lexica => less errors

Future Work:● refinements based on other criteria● supervised and more sophisticated approaches to measuring 

psychological state

| wwbp.orgSee what else we've been up to.