23
Laboratoire ERIC Université Lumière Lyon 2 How CATS can help researchers collect and analyze a corpus of tweets Adrien Guille*, Ciprian-Octavian Truică** & Michael Gauthier*** * Université Lyon 2 (ERIC) ** University Politehnica of Bucharest *** Université Lyon 2 (CRTT) Lyon, 18 juin 2015 Institut des Sciences de l’Homme Big Data Mining and Visualization - Digital Humanities

How cats can help researchers collect and analyze a corpus of tweets

Embed Size (px)

Citation preview

Page 1: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

How CATS can help researchers collect and analyze a corpus of tweetsAdrien Guille*, Ciprian-Octavian Truică** & Michael Gauthier***

* Université Lyon 2 (ERIC)** University Politehnica of Bucharest*** Université Lyon 2 (CRTT)

Lyon, 18 juin 2015Institut des Sciences de l’HommeBig Data Mining and Visualization  - Digital Humanities

Page 2: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Pourquoi étudier Twitter ?• Un média social apprécié du grand public

• Une source de données textuelles pour les chercheurs

2

Page 3: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Pourquoi étudier Twitter ?• Données textuelles utiles dans divers domaines

• Science du langage

• Science politique

• Médecine

• etc.

3

Page 4: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : motivation• Constats

• Données inaccessibles par des non programmeurs

• Méthodes avancées d’analyse textuelle inutilisables par des non spécialistes

• CATS: Collection and Analysis of Tweets made Simple

• Outil simple à utiliser (site web)

• Implémentation robuste (données et calculs distribués - MongoDB)

4

Page 5: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : motivation• Constats

• Données inaccessibles par des non programmeurs

• Méthodes avancées d’analyse textuelle inutilisables par des non spécialistes

• CATS: Collection and Analysis of Tweets made Simple

• Outil simple à utiliser (site web)

• Implémentation robuste (données et calculs distribués - MongoDB)

4

Page 6: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : collecter un corpus de tweets• Cibler les tweets : selon le contenu, la localisation ou l’auteur

5

Page 7: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Filtrer le corpus : par dates, par mots-clés, selon le genre et l’âge

6

Page 8: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Filtrer le corpus : par dates, par mots-clés, selon le genre et l’âge

7

Page 9: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Explorer le corpus : vocabulaire

8

Page 10: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Explorer le corpus : vocabulaire

9

Page 11: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Explorer le corpus : tweets

10

Page 12: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Explorer le corpus : tweets

11

Page 13: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : reconnaître les entités nommées (Finkel, Grenager & Manning 05)

12

Page 14: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : reconnaître les entités nommées (Finkel, Grenager & Manning 05)

13

Page 15: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : modéliser les thématiques latentes (Blei, Ng & Jordan 03)

14

Page 16: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : modéliser les thématiques latentes (Blei, Ng & Jordan 03)

15

Page 17: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : détecter et suivre les évènements (Guille & Favre 14, 15)

16

Page 18: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : analyser un corpus de tweets• Fouiller le corpus : détecter et suivre les évènements (Guille & Favre 14, 15)

17

Page 19: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : cas d’utilisation• Socio-linguistique : étudier la vulgarité (Gauthier, Guille, Rico & Deseille 15)

• But : analyser la manière dont les hommes et les femmes emploient la vulgarité

• Intérêt particulier pour les jeunes femmes en Grande-Bretagne

• Constitution d’un corpus de tweets avec CATS

• Collecte des tweets selon la localisation

• Analyse du corpus de tweets collectés avec CATS

• Filtrage du corpus de tweets selon le genre et l’âge

18

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 5

Table 2Basic corpus properties.

Male Female Total

# of users 10313 7747 18060# of tweets 579864 381322 961186

0

1

2

3·105

[5; 11] [12; 18] [19; 30] [31; 45] [46; 60] [61; 99]

Age

Num

bero

ftw

eets

MenWomen

Figure 1. Distribution of the number of tweets per genderand age.

showing that the interference of potential spam accounts pro-ducing a great number of tweets in a limited amount of timeis very limited.

Proportion of swearing tweets among women and men

In our corpus, 5.8% of the male tweets contained at leastone swear word, compared to 4.8% for women. Figures 3and 4 present the proportion of tweets containing the elevenmost common swear words for women and men. However,as percentages of this kind do not provide much informationabout the specific use of each word, we decided to normalizethe frequency of each swear word on one million words forboth women and men. The results are presented below inTable 3.

100 101 102 103

100

101

102

103

Number of tweets

Num

bero

fuse

rs

Figure 2. Distribution of the number of tweets per user on alog-log scale.

0

0.2

0.4

fuck

shit

hell

cunt pis

s tit

blood

ydic

kbit

chda

mn

basta

rd

0.1

0.2

0.3

0.4

Prop

ortio

nof

twee

tsFigure 3. Most common swear words found in swearingtweets published by male users.

0

0.1

0.2

0.3

0.4

fuck

shit

hell

bitch pis

s

blood

yda

mndic

k tit crap

cunt

Prop

ortio

nof

twee

ts

Figure 4. Most common swear words found in vulgar tweetspublished by female users.

Proportion of vulgar tweets by gender per million words

Table 3 presents the proportions of use of all the swearwords we took into account for both genders. As we men-tioned before, there is an imbalance in the number of maleand female users, as well as in the number of tweets for eachgender. Thus, raw percentages would have been useless inthat they are not comparable in such situations. To be ableto e�ciently compare the use of swear words by women andmen, we calculated the number of instances of each swearword there would be in one million words. This then givesus an objective value on which to base our analyses. To thosevalues, we added a log-likelihood (LL) test in order to checkwhether the word is more statistically likely to be used bymales or females according to our data. In other words, themore gendered a word is, the higher the LL score will be.In Table 3, the three most statistically significant words forwomen and men are highlighted (orange for women and greyfor men). These words are, in descending order of signifi-cance, bitch, bloody and hell for women, and cunt, tit andfuck for men. It would seem that some of the findings ofMcEnery (2006) are verified here, as in his study of the useof profanity of women and men on MySpace, he found thatfucking, fuck, jesus, cunt and fucker were more typical of

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 5

Table 2Basic corpus properties.

Male Female Total

# of users 10313 7747 18060# of tweets 579864 381322 961186

0

1

2

3·105

[5; 11] [12; 18] [19; 30] [31; 45] [46; 60] [61; 99]

Age

Num

bero

ftw

eets

MenWomen

Figure 1. Distribution of the number of tweets per genderand age.

showing that the interference of potential spam accounts pro-ducing a great number of tweets in a limited amount of timeis very limited.

Proportion of swearing tweets among women and men

In our corpus, 5.8% of the male tweets contained at leastone swear word, compared to 4.8% for women. Figures 3and 4 present the proportion of tweets containing the elevenmost common swear words for women and men. However,as percentages of this kind do not provide much informationabout the specific use of each word, we decided to normalizethe frequency of each swear word on one million words forboth women and men. The results are presented below inTable 3.

100 101 102 103

100

101

102

103

Number of tweets

Num

bero

fuse

rs

Figure 2. Distribution of the number of tweets per user on alog-log scale.

0

0.2

0.4

fuck

shit

hell

cunt pis

s tit

blood

ydic

kbit

chda

mn

basta

rd

0.1

0.2

0.3

0.4

Prop

ortio

nof

twee

ts

Figure 3. Most common swear words found in swearingtweets published by male users.

0

0.1

0.2

0.3

0.4

fuck

shit

hell

bitch pis

s

blood

yda

mndic

k tit crap

cunt

Prop

ortio

nof

twee

ts

Figure 4. Most common swear words found in vulgar tweetspublished by female users.

Proportion of vulgar tweets by gender per million words

Table 3 presents the proportions of use of all the swearwords we took into account for both genders. As we men-tioned before, there is an imbalance in the number of maleand female users, as well as in the number of tweets for eachgender. Thus, raw percentages would have been useless inthat they are not comparable in such situations. To be ableto e�ciently compare the use of swear words by women andmen, we calculated the number of instances of each swearword there would be in one million words. This then givesus an objective value on which to base our analyses. To thosevalues, we added a log-likelihood (LL) test in order to checkwhether the word is more statistically likely to be used bymales or females according to our data. In other words, themore gendered a word is, the higher the LL score will be.In Table 3, the three most statistically significant words forwomen and men are highlighted (orange for women and greyfor men). These words are, in descending order of signifi-cance, bitch, bloody and hell for women, and cunt, tit andfuck for men. It would seem that some of the findings ofMcEnery (2006) are verified here, as in his study of the useof profanity of women and men on MySpace, he found thatfucking, fuck, jesus, cunt and fucker were more typical of

Page 20: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : cas d’utilisation• Socio-linguistique : étudier la vulgarité (Gauthier, Guille, Rico & Deseille 15)

• But : analyser la manière dont les hommes et les femmes emploient la vulgarité

• Intérêt particulier pour les jeunes femmes en Grande-Bretagne

• Analyse du corpus de tweets collectés avec CATS

• Vocabulaire employé dans les tweets vulgaires

19

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 5

Table 2Basic corpus properties.

Male Female Total

# of users 10313 7747 18060# of tweets 579864 381322 961186

0

1

2

3·105

[5; 11] [12; 18] [19; 30] [31; 45] [46; 60] [61; 99]

Age

Num

bero

ftw

eets

MenWomen

Figure 1. Distribution of the number of tweets per genderand age.

showing that the interference of potential spam accounts pro-ducing a great number of tweets in a limited amount of timeis very limited.

Proportion of swearing tweets among women and men

In our corpus, 5.8% of the male tweets contained at leastone swear word, compared to 4.8% for women. Figures 3and 4 present the proportion of tweets containing the elevenmost common swear words for women and men. However,as percentages of this kind do not provide much informationabout the specific use of each word, we decided to normalizethe frequency of each swear word on one million words forboth women and men. The results are presented below inTable 3.

100 101 102 103

100

101

102

103

Number of tweets

Num

bero

fuse

rs

Figure 2. Distribution of the number of tweets per user on alog-log scale.

0

0.2

0.4

fuck

shit

hell

cunt pis

s tit

blood

ydic

kbit

chda

mn

basta

rd

0.1

0.2

0.3

0.4

Prop

ortio

nof

twee

ts

Figure 3. Most common swear words found in swearingtweets published by male users.

0

0.1

0.2

0.3

0.4

fuck

shit

hell

bitch pis

s

blood

yda

mndic

k tit crap

cunt

Prop

ortio

nof

twee

ts

Figure 4. Most common swear words found in vulgar tweetspublished by female users.

Proportion of vulgar tweets by gender per million words

Table 3 presents the proportions of use of all the swearwords we took into account for both genders. As we men-tioned before, there is an imbalance in the number of maleand female users, as well as in the number of tweets for eachgender. Thus, raw percentages would have been useless inthat they are not comparable in such situations. To be ableto e�ciently compare the use of swear words by women andmen, we calculated the number of instances of each swearword there would be in one million words. This then givesus an objective value on which to base our analyses. To thosevalues, we added a log-likelihood (LL) test in order to checkwhether the word is more statistically likely to be used bymales or females according to our data. In other words, themore gendered a word is, the higher the LL score will be.In Table 3, the three most statistically significant words forwomen and men are highlighted (orange for women and greyfor men). These words are, in descending order of signifi-cance, bitch, bloody and hell for women, and cunt, tit andfuck for men. It would seem that some of the findings ofMcEnery (2006) are verified here, as in his study of the useof profanity of women and men on MySpace, he found thatfucking, fuck, jesus, cunt and fucker were more typical of

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 5

Table 2Basic corpus properties.

Male Female Total

# of users 10313 7747 18060# of tweets 579864 381322 961186

0

1

2

3·105

[5; 11] [12; 18] [19; 30] [31; 45] [46; 60] [61; 99]

Age

Num

bero

ftw

eets

MenWomen

Figure 1. Distribution of the number of tweets per genderand age.

showing that the interference of potential spam accounts pro-ducing a great number of tweets in a limited amount of timeis very limited.

Proportion of swearing tweets among women and men

In our corpus, 5.8% of the male tweets contained at leastone swear word, compared to 4.8% for women. Figures 3and 4 present the proportion of tweets containing the elevenmost common swear words for women and men. However,as percentages of this kind do not provide much informationabout the specific use of each word, we decided to normalizethe frequency of each swear word on one million words forboth women and men. The results are presented below inTable 3.

100 101 102 103

100

101

102

103

Number of tweets

Num

bero

fuse

rs

Figure 2. Distribution of the number of tweets per user on alog-log scale.

0

0.2

0.4

fuck

shit

hell

cunt pis

s tit

blood

ydic

kbit

chda

mn

basta

rd

0.1

0.2

0.3

0.4

Prop

ortio

nof

twee

ts

Figure 3. Most common swear words found in swearingtweets published by male users.

0

0.1

0.2

0.3

0.4

fuck

shit

hell

bitch pis

s

blood

yda

mndic

k tit crap

cunt

Prop

ortio

nof

twee

ts

Figure 4. Most common swear words found in vulgar tweetspublished by female users.

Proportion of vulgar tweets by gender per million words

Table 3 presents the proportions of use of all the swearwords we took into account for both genders. As we men-tioned before, there is an imbalance in the number of maleand female users, as well as in the number of tweets for eachgender. Thus, raw percentages would have been useless inthat they are not comparable in such situations. To be ableto e�ciently compare the use of swear words by women andmen, we calculated the number of instances of each swearword there would be in one million words. This then givesus an objective value on which to base our analyses. To thosevalues, we added a log-likelihood (LL) test in order to checkwhether the word is more statistically likely to be used bymales or females according to our data. In other words, themore gendered a word is, the higher the LL score will be.In Table 3, the three most statistically significant words forwomen and men are highlighted (orange for women and greyfor men). These words are, in descending order of signifi-cance, bitch, bloody and hell for women, and cunt, tit andfuck for men. It would seem that some of the findings ofMcEnery (2006) are verified here, as in his study of the useof profanity of women and men on MySpace, he found thatfucking, fuck, jesus, cunt and fucker were more typical of

Page 21: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : cas d’utilisation• Socio-linguistique : étudier la vulgarité (Gauthier, Guille, Rico & Deseille 15)

• But : analyser la manière dont les hommes et les femmes emploient la vulgarité

• Intérêt particulier pour les jeunes femmes en Grande-Bretagne

• Analyse du corpus de tweets collectés avec CATS

• Distribution des types d’entités nommées au sein des tweets vulgaire, par genre et âge

20

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 7

4

6

8·10�2

00:00 05:00 10:00 15:00 20:00Time of day

Vul

garit

yra

tio

Women Men

Figure 6. Cursing ratio versus time of day in Cm12�18.

rules based on features of the word sequences that constitutetweets. Table 4 reveals that on average, men use named enti-ties more than women. Also, for both women and men, userstend to mention named entities consistently more as they getolder. Figures 7 and 8 present detailed proportions of namedentities per gender and age group in swearing tweets. Thisshows that whatever their age, both women and men ma-joritarily mention named entities referring to people whenswearing. However, what di↵ers is the fact that women fromevery age groups seem to favor locations over men, who pre-fer mentioning organizations. This method then highlightsthe fact that as far as swearing is concerned, context plays abig role, and that the pragmatic functions of swear words forwomen and men of the same age groups may di↵er. We sug-gest that these di↵erences may point at gendered di↵erencesin the topics women and men focus on, at least when theyswear.

Table 4Proportion of tweets that contain named entities.

[12; 18] [19; 30] [31; 45] Average

Women 10.28% 13.41% 14.60% 12.76%Men 14.25% 19.67% 20.59% 18.18%

Event detection

In order to analyze the impact that real world events mayhave on Twitter discussions, we studied specific reactions onTwitter triggered by the most influential of these real worldevents (e.g. the broadcast of a popular TV show, a politi-cal event etc...) for users. We use Mention-Anomaly-BasedEvent Detection (MABED), a statistical method proposed byGuille & Favre (2015, 2014) for the detection of significantevents from tweets. Thanks to this method, we are able tomap both macro and micro levels of gendered reactions, asit describes each event it detects with a set of words, a timeinterval and a score that reflects the magnitude of impact of

0.2

0.4

0.6

0.8

Cm12�18 Cm

19�30 Cm31�45

Corpus

Prop

ortio

n

Person Location Organization

Figure 7. Distribution of the types of named entities in vulgartweets published by men.

0.2

0.4

0.6

0.8

Cm12�18 C f

19�30 C f31�45

Corpus

Prop

ortio

n

Person Location Organization

Figure 8. Distribution of the types of named entities in vulgartweets published by women.

the event over users. Moreover, it is possible to analyze thetweets associated with these events, to understand their un-derlying composition, and the way swear words are in ourcase for example. Table 5 and Table 6 presents as an ex-ample the ten most significant events detected by MABEDfor women and men aged 19-30. These events are numberedfrom 1 to 10 in decreasing order of significance. The ‘event’column presents the keywords recognized as the most repre-sentative of the event, and the last column presents the per-centage of tweets containing at least one swear word insideeach event. Events marked as “spam” were events whichwere considered as such because one single user posted thesame spam tweet very often, thus virtually generating key-words considered by MABED as relevant events. Thesespammers, though being a minority in our corpus as shownin figure 1, create a considerable amount of noise for ourevent detection method and prevent a more accurate analysisof gendered events. It is however interesting to note that inthis sample, spammers are twice as more present in the fe-male corpus than in the male ones, thus suggesting that spamaccounts are more likely to adopt a female name.

TEXT MINING AND TWITTER TO ANALYZE BRITISH SWEARING HABITS 7

4

6

8·10�2

00:00 05:00 10:00 15:00 20:00Time of day

Vul

garit

yra

tio

Women Men

Figure 6. Cursing ratio versus time of day in Cm12�18.

rules based on features of the word sequences that constitutetweets. Table 4 reveals that on average, men use named enti-ties more than women. Also, for both women and men, userstend to mention named entities consistently more as they getolder. Figures 7 and 8 present detailed proportions of namedentities per gender and age group in swearing tweets. Thisshows that whatever their age, both women and men ma-joritarily mention named entities referring to people whenswearing. However, what di↵ers is the fact that women fromevery age groups seem to favor locations over men, who pre-fer mentioning organizations. This method then highlightsthe fact that as far as swearing is concerned, context plays abig role, and that the pragmatic functions of swear words forwomen and men of the same age groups may di↵er. We sug-gest that these di↵erences may point at gendered di↵erencesin the topics women and men focus on, at least when theyswear.

Table 4Proportion of tweets that contain named entities.

[12; 18] [19; 30] [31; 45] Average

Women 10.28% 13.41% 14.60% 12.76%Men 14.25% 19.67% 20.59% 18.18%

Event detection

In order to analyze the impact that real world events mayhave on Twitter discussions, we studied specific reactions onTwitter triggered by the most influential of these real worldevents (e.g. the broadcast of a popular TV show, a politi-cal event etc...) for users. We use Mention-Anomaly-BasedEvent Detection (MABED), a statistical method proposed byGuille & Favre (2015, 2014) for the detection of significantevents from tweets. Thanks to this method, we are able tomap both macro and micro levels of gendered reactions, asit describes each event it detects with a set of words, a timeinterval and a score that reflects the magnitude of impact of

0.2

0.4

0.6

0.8

Cm12�18 Cm

19�30 Cm31�45

Corpus

Prop

ortio

n

Person Location Organization

Figure 7. Distribution of the types of named entities in vulgartweets published by men.

0.2

0.4

0.6

0.8

Cm12�18 C f

19�30 C f31�45

Corpus

Prop

ortio

n

Person Location Organization

Figure 8. Distribution of the types of named entities in vulgartweets published by women.

the event over users. Moreover, it is possible to analyze thetweets associated with these events, to understand their un-derlying composition, and the way swear words are in ourcase for example. Table 5 and Table 6 presents as an ex-ample the ten most significant events detected by MABEDfor women and men aged 19-30. These events are numberedfrom 1 to 10 in decreasing order of significance. The ‘event’column presents the keywords recognized as the most repre-sentative of the event, and the last column presents the per-centage of tweets containing at least one swear word insideeach event. Events marked as “spam” were events whichwere considered as such because one single user posted thesame spam tweet very often, thus virtually generating key-words considered by MABED as relevant events. Thesespammers, though being a minority in our corpus as shownin figure 1, create a considerable amount of noise for ourevent detection method and prevent a more accurate analysisof gendered events. It is however interesting to note that inthis sample, spammers are twice as more present in the fe-male corpus than in the male ones, thus suggesting that spamaccounts are more likely to adopt a female name.

Page 22: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Le projet CATS : cas d’utilisation• Socio-linguistique : étudier la vulgarité (Gauthier, Guille, Rico & Deseille 15)

• But : analyser la manière dont les hommes et les femmes emploient la vulgarité

• Intérêt particulier pour les jeunes femmes en Grande-Bretagne

• Analyse du corpus de tweets collectés avec CATS

• Réaction aux évènements selon le genre

21

8 MICHAEL GAUTHIER, ADRIEN GUILLE, FABIEN RICO, ANTHONY DESEILLE

0

50

100

04-16 19:00 04-17 01:00Time

Num

bero

ftw

eets

per3

0mn Women

Men

Figure 9. Evolution of the number of tweets containing#bbcdebate.

Apart from spams, what the results from MABED re-veal is that generally speaking, male events could be sum-marized by sport (boxing and soccer) and politics, and fe-male events by media/entertainment (birth of the Royal baby,BGT (Britain’s Got Talent)), sport (Grand National) and pol-itics. Generally speaking, we observe that throughout thoseten events, men use more swear words than women, with theleast vulgar event among men being still more vulgar than themost vulgar event among women. The most and least vulgarevents are highlighted for each gender, and this reveals thatboth women and men are more vulgar when talking aboutpolitics. What is interesting to notice is that on average,the proportion of swearing tweets in reaction to an event ishigher (11.3% for men and 6.53% for women) than in clas-sic interactions on Twitter (5.8% for men and 4.8%). It isalso worth mentioning that the only common topic betweenwomen and men is the one containing the hashtag #bbcde-bate. Figure 9 plots the evolution of the number of genderedtweets containing this hashtag. Though men tweet consis-tently more about that hashtag than women, we observe thatthe two patterns are very similar, and that the events bothare detected roughly when the broadcast of the debate startson television, and gradually decrease after the broadcast isover, as it triggers fewer and fewer reactions. The propor-tion of vulgar tweets inside this common event does not dif-fer much between men and men, which may again suggestthat gendered di↵erences in swearing are not triggered bygender alone, but by the context in which swearing occurs.In other words, women and men in the exact same contextwould not di↵er much in the linguistic attitudes they display.This would imply that swear words are not so much genderedas contextualized, which would correspond to other studiespointing to the fact that when considering gendered speechpatterns, the context of use plays a greater role than gen-der alone (Eckert, 2008; Bamman et al., 2014; Baker, 2014;Holmes, 1995; Ladegaard, 2004). In our case, further quali-

tative research would however be needed to confirm or refutethat hypothesis.

Limitations

This study presents certain limits. The first one concernsthe way we categorized users according to their age. Thoughit has some advantages, it is not perfect, as some users willhave children before age 30, or will leave school before age18, so the linguistic patterns potentially influenced by thosesocial phenomena may di↵er. Another potential problem isthat we did not include hashtags in our swear word detectionmethods, and hashtags often contain curse words, thus poten-tially limiting our data in this regard. A manual verificationof the information provided in the description of a lot of usersin our sample reveals that many people mention that they arestudents. Even if it sounds normal, as the most representedage group is the 19-30, there may exist a bias towards thiscategory of users.

Conclusion

In this article, we tried to give hints about new methodswhich could be used to analyze specific sociolinguistic pa-rameters on Twitter. For that purpose, we analyzed the dataof a corpus of about one million tweets from users for whomwe could infer both the age and the gender. Even if our datawould need to be analyzed more thoroughly in order to drawmore generalizable conclusions, our goal here was to showthat by combining techniques from both computer scienceand linguistics, it is possible to provide innovative ways ofstudying the way women and men swear on Twitter. Thesetools showed that beyond mere quantitative data which couldlead to erroneous impressions and generalizations on the rea-sons why women and men swear, contextual parameters aresometimes more important in being able to determine whatis influential, as we concluded with the event detection andNER analyses. This work is then the continuation of priorstudies which showed that gender is often enacted in subtilways, hence the necessity to develop more tools to explorethese questions. We believe that some of the tools presentedhere can be used improved in future research based on Twit-ter data, so that the analyses present in this paper can be re-fined.

References

Baker, P. (2014). Using corpora to analyze gender. BloomsburyPublishing.

Bamman, D., Eisenstein, J., & Schnoebelen, T. (2014). Genderidentity and lexical variation in social media. Journal of Soci-olinguistics, 18(2), 135–160.

Baruch, Y., & Jenkins, S. (2007). Swearing at work and permis-sive leadership culture: When anti-social becomes social and in-

Page 23: How cats can help researchers collect and analyze a corpus of tweets

Laboratoire ERICUniversité Lumière Lyon 2

Page

How CATS can help researchers collect and analyze a corpus of tweets Big Data Mining and Visualization  - Digital Humanities

Lyon - 18 juin 2015 Adrien Guille

Merci pour votre attention

• Bibliographie• S. Bird. NLTK: the natural language toolkit. Proceedings of the Annual Meeting of the Association for

Computational Linguistics (ACL), 2006

• D.  Blei, A.  Ng, and M.  Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, vol. 3, 2003

• J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2005

• M. Gauthier, A. Guille, F. Rico and A. Deseille. Text Mining and Twitter to Analyze British Swearing Habits. In proceedings of the International Conference on using Twitter for Academic Research, 2015

• A. Guille and C. Favre. Mention-anomaly-based Event Detection. Proceedings of the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 2014

• A. Guille and C. Favre. Event detection, tracking and visualization in Twitter: a mention-anomaly-based approach. Springer Social Network Analysis and Mining, vol. 5, iss. 1, 2015

• R. Řehůřek and P. Sojka. Software framework for topic modeling with large corpora. Proceedings of the Workshop on New Challenges for NLP Frameworks (LREC), 2010

• Pointeurs• CATS. http://mediamining.univ-lyon2.fr/cats ou bien chercher «CATS lyon2» sur Google

• MongoDB. http://www.mongodb.org

• Twitter. http://twitter.com

22