19

Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

  • Upload
    vukhue

  • View
    223

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and
Page 2: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

Abstract

With advances in natural language processing, the use of author profiling has become a task of growing importance. Author profiling is the application of study of language use to determine demographic factors or personality traits of text authors. One of the most common target variables of author profiling tasks is gender. This paper is concerned with the features used to discriminate between male and female authors. Because social media have become extremely popular and well-spread in the recent years, the focus of this paper is placed on gender identification tasks carried out upon user-generated content, such as tweets, blog entries, chats and the like. Additionally, we try to determine whether qualitative approaches to the subject of gender and language, namely critical discourse analysis, can provide a useful insight into the language use of men and women. Corpus analyses that take into account the theoretical argumentation of the subject are also discussed. The paper finally suggests possible improvements for gender identification and provides features that are based on qualitative studies, but can be contribute to author profiling tasks.

Keywords: user-generated content, gender identification, corpus analysis, critical discourse analysis

Page 3: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

Table of contents

Contents Abbreviations .............................................................................................1 1 Introduction .........................................................................................2 2 Gender identification in user-generated content .............................................3

2.1 Features and methods ........................................................................3 2.2 Related studies for Slovene .................................................................6 2.3 Goals and application ........................................................................6

3 Gender and language in CDA and corpus linguistics ...........................................7 3.1 Related work in critical discourse analysis ................................................7 3.2 Gender in corpus analyses ...................................................................9

4 Critical judgement of the existing research in the field ................................... 11 5 Suggestions for upgrading the existing conditions ........................................... 13 References .............................................................................................. 14

Page 4: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

1

Abbreviations

BOW Bag-of-Words

CDA Critical Discourse Analysis

LDA Latent Dirichlet Allocation

MD Manhattan Distance

ML Machine Learning

NLP Natural Language Processing

POS Part-of-Speech

SVM Support Vector Machines

UGC User-Generated Content

Page 5: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

2

1 Introduction

Data analysis is often performed in connection with other research disciplines, as its general aim is to discover useful information from the data and transform big quantities of data into information that can be used in various applications. The study of natural languages, for example, can combine text mining, linguistics and social sciences. With the rise of social media, the focus of many studies on natural languages switched to the so-called user-generated content (UGC), which is defined as ‘’any form of content such as blogs, wikis, discussion forums, posts, chats, tweets, podcasts, digital images, video, audio files, advertisements and other forms of media that was created by users of an online system or service, often made available via social media website’’ (Chua et al., 2014). The data from UGC is especially interesting because of its vast quantity, availability and real-time publishing. In contrast to other media, UGC tends to be less formal and more personal (Crystal, 2001). One of the questions posed by text analytics is what does textual data tell us about the text author(s) and how can we extract information about the author based on text. This task is addressed in the field of author profiling1. Among the author attributes with the longest focus of the data and text mining community are demographic characteristics, and among them gender. The issue of gender and language or discourse originates from the curiosity about the differences and similarities between the language use of the sexes and the search after underlying reasons. The issue has been addressed in various research fields concerned with language: from statistics oriented (text mining) to applied linguistics (corpus linguistics and pragmatics) and critical discourse analysis (CDA). This paper is concerned with sociolinguistic interpretability of features used for gender identification2 tasks. These tasks are valuable for application in user profiling, marketing and text forensics, so their main objective is to construct a model that successfully performs on unseen textual documents. Little or no attention has been paid to the reasoning on why certain features seem more differential for distinguishing between the genders. However, there exist some studies of gender identification in UGC (Rao et al., 2010; Sarawgi et al., 2011) which applied features based on linguistic findings on gender and discourse. These findings take a more theoretically founded and qualitative approach. According to Kendall and Tannen (2001) two perspectives dominate in CDA research in gender and discourse. The first perspective views the issue in terms of power relations and is concerned with the way social power and inequality between different groups are enacted, reproduced and resisted by language use in talk and text (van Dijk, 2001). The other approach perceives the male and female gender as two cultures, and the communication between them a cross-cultural one (Tannen, 1990). Corpus linguistics on the other hand uses corpus methodology to retrieve information when studying the use of particular words (Schmid, 2003) or topic distribution based on keyword extraction (Baker, 2014). When gathered and compared, the features from gender identification tasks on the one hand and characteristics of gendered discourse from linguistics on the other may offer an insight

1 It is important to distinguish between author profiling and authorship attribution. In author profiling, the goal is to determine the profile of text authors based on their texts (determining factors such as gender, age, political orientation, ethnicity, native language, personality type etc.). In contrast, authorship attribution deals with a closed groups of authors, whereby the goal is to identify which of these authors produced the given anonymous text (Rangel et al., 2013). 2 Beside gender identification, the terms gender prediction and gender attribution also occur with the same meaning.

Page 6: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

3

into the language of men and women, which can deepen our understanding of the underlying social structure of communication and identity. The rest of the paper is organized as follows. In Section 2, we give a summary of the research in the field of gender identification in tasks concerning UGC. In Section 3, the studies on gender and discourse in CDA and corpus linguistics are described. The related works are critically evaluated in Section 4. In Section 5, we finally conclude the paper and provide suggestions for future work.

2 Gender identification in user-generated content

In the data and text mining community, the gender identification task has been addressed with regard to various text genres, e.g. scientific papers (Vogel and Jurafsky, 2012), literature (Argamon et al., 2009), emails (Prabhakaran et al., 2014) or social media/UGC. In this section, we provide an overview of studies focusing on user-generated content. First, the used features and methods for user gender identification are discussed. Next, we present related works for Slovene texts. We conclude the section by presenting the practical applications of related works.

2.1 Features and methods

The input data in text mining are more or less structured natural language documents. Because most machine learning and data mining algorithms are generally not designed to work on textual data, the feature extraction or construction is an important step in text mining, thus obtaining a feature-based representation (Brank et al., 2010). Given the structure of features constructed, we distinguish between four feature construction types: word-based, character-based, kernel function and other (ibid.). In the setting of UGC, other material beside documents may be used for feature construction, e.g. user metadata like usernames, full names and descriptions (Burger et al., 2011). In the context of gender identification, the features which work beneficially on the model, can be perceived through a social lens ‒ these features present the discriminating characteristics of male and female language and, provided that the output enables a linguistic and social interpretation, can therefore be applied to studies of gender and communication in general. Mukherjee and Liu (2010) improved the existing methods in gender classification of blog authors and propose a new feature selection algorithm which uses an ensemble of feature selection criteria. They used a combination of four existing feature and one newly constructed feature. The F-measure3 feature introduces a distinction between contextuality and formality of a document and is based in the frequency of the POS usage. A lower score of F-measure indicates contextuality (implicitness) of the text, marked by a greater use of pronouns, verbs, adverbs and interjections. In contrast, a higher F-measure value means that the text in more explicit (indicated formality), as it contains more nouns, adjectives, prepositions and articles. The second feature of Mukherjee and Liu (2010) is what they call ‘’stylistic features’’, which are actually words and blog specific words, i.e. abbreviations (lol), similar to spoken discourse (hmmm) or typical of UGC (emoticons). Their next feature is a list of word endings that indicate the use of emotionally intensive adverbs and adjectives, which are often ascribed to female language. They also used a word list of twenty factors from Argamon et al. (2007);

3 Note that F-measure here is not the F-score or F-measure used for performance evaluation.

Page 7: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

4

factors are groups of related words that tend to occur in similar documents and are mostly topic-related (Family, Work, Location, Poetic, Swearing, Romance etc.). Mukherjee et al. (2010) added three new ‘’word classes’’ (of mostly adverbs and adjectives) according to their connotation, which is either emotional (careful, puzzled), positive (cool, wow) or negative (stupid, hopeless). Their newly constructed features are POS sequence patterns which distinguish from normal POS-tags and n-grams in that the sequence are not of fixed lengths (maximum 7 consecutive tags), satisfy the minimum support (30%) and adherence (20%) constraint. All the mined POS sequence patterns were used as features. They applied their feature selection algorithm to reduce the vector space and carried out some experiments using different classification (Naïve Bayes, SVM) and regression (SVM regression) methods, with different feature weighting (Boolean, TF). They determined that the highest accuracy (88.56%) was achieved when only the POS sequence feature was used together with their feature selection algorithm. The use of metadata about the Twitter user was reported to benefit the gender identification model by Burger et al. (2011), who, interestingly, compiled a corpus of tweets in various languages and carried out no language-specific processing. Each Twitter user has a unique username; users can also provide their full name and a short description which sometimes indicates their gender (Father, Wife). Burger et al. (2011) compared the performance in the settings with just tweets text or only metadata and a combination of both. They also experimented with different classifier types (SVM, Naïve Bayes, and Balanced Winnow2) and found that the model performs best with the Balanced Winnow2: for only tweet text features is performed with an accuracy of 75.5%, whereas it achieved the best accuracy (92.0%) when the metadata features and the tweet texts (with a Boolean indicator) were used. Rao et al. (2010) carried out an interesting comparison of three SVM models for gender prediction of Twitter users. Based on sociolinguistic research of spoken discourse (Macaulay, 2005) they constructed a list of what they call sociolinguistic features (actually more pragmatic) as the prosodic cues from spoken discourse are absent in Twitter. The list included smileys, ellipses, possessive bigrams (my_XX, our_XX), references to self, markers of agreement, affection, excitement etc. They also built a uni- and bigram feature model. The third model was stacked and its features were predictions from the n-gram feature and sociolinguistic models along with their prediction weights. The sociolinguistic model performed better (71.76%) than the n-gram model (68.70%), while the stacked model performed best (72.22%). They also determined that emoticons, ellipses, character repetition, repeated exclamation, puzzled punctuation and the abbreviation OMG were more typical of tweets by women. They also learned that the sociolinguistic model is improved when bigrams with the possessive pronoun my from the bigram model were added to it. The above reported performances of statistical approaches to gender identification were achieved by models trained and tested on the same UGC genre ‒ blog entries or tweets. Some studies (Sarawgi et al., 2011; Rangel et al., 2013) indicate bias in genre and show that the classifiers trained on one UGC genre and tested on another perform worse than classifiers trained and tested on the same genre, suggesting that the existing achievements of gender identification are limited to a single among the range of UGC genres. Sarawgi et al. (2011) also point out a topic bias in gender, with the help of which the above listed models perform with a high accuracy. In order to find stronger evidence in gender-specific styles in language and to discover gender identification model robustness against change in topic and genre, Sarawgi et al. (2011) define their gender identification task as cross-topic and cross-genre. Their dataset consisted of blog entries from seven distinctive topics (education, travel, spirituality, history, book reviews, entertainment and politics) and scientific papers from the NLP community. Several

Page 8: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

5

experiments were conducted using three types of statistical language models: probabilistic context free grammar that learns deep long-distance syntactic patterns; token-level language models that learn shallow lexico-semantic patterns, and character-level models that learn morphological patterns. Finally, they also used the BOW approach with the maximum entropy classifier. On topic-balanced blog data, the character-level model performed best (71.3%). In the cross-topic experiment (trained on 6 topics, tested on the remaining 1 topic), the character-level model performed best (68.3%). In the next setting, the models were evaluated across topic and genre, as they were trained on the blog dataset and tested on scientific papers; the best performing model was the BOW approach with the accuracy of 61.5%, while the character-level model achieved 58.5%. When the cross-topic approach was applied to scientific papers, which use formal language and do not give lexical or topical cues, the best performing models were the probabilistic context free grammar and the character-level model (both 76.0%), suggesting that deep syntactic patterns play a much greater role in detecting gender-specific styles. Next are described the winners of the PAN author profiling tasks of 2013–2015. The objective of the task was the identification of age and gender (Rangel et al., 2013, 2014, and 2015), adding identification of personality traits (Rangel et al., 2015). Meina et al. (2013) achieved the best result for gender classification of English Netlog posts. They used an ensemble method combining a several weak classifiers into one classifier. They constructed features of various types: structural features (conversation length, number of conversations, paragraphs, special characters and words per sentence), POS-tags, POS-sequences, text difficulty or readability (based on the Dale-Chall list of familiar words), dictionary-based features (number of emoticons, abbreviations and bad words, basic emotions words, connective words, and persuasive words using Nodebox4), number of errors and language mistakes, topic specific features and n-gram model. They used a random forest classifier, which performed with the accuracy 59.20%. These features did not work that well on Spanish texts. The winner of the gender identification task for Spanish were Santosh et al. (2013) who also used various style-, content- and topic-based features. The content-based features included n-grams (40,000 n-grams that differentiate male and female users the most), while the stylistic features consisted of punctuation symbols and n-grams of POS-tags. For the topic-based features they carried out a detailed topic modelling, as they found that men and women may use the same words, but in different contexts, which is why they employed the LDA algorithm and created 200 topics for each gender. Interestingly, they used different machine learning algorithms to train on different features: SVM for content- and style-based, maximum entropy for topic-based features and decision trees for merged features. The latter performed best with the accuracy of 64.73%. The committee of PAN 2014 author profiling task conducted a cross-genre task, whereby a corpus of four UGC genres (social media, tweets, blogs, hotel reviews) in English and Spanish (all genres, but hotel reviews) was compiled. The most successful gender identification model for English and Spanish was built using a LibLinear classifier (Lopez-Monroy et al., 2014). The winning team used second order attributes, but also considering the information among documents belonging to the same class i.e. the same profile (females, males). The approach found more specific subgroups of authors (male employees, female teenagers etc.). By using intra-class relationships inside target profiles, they automatically generated few, but more detailed attributes, which improved the classification rates, achieving 64.76% accuracy.

4 The Nodebox English Linguistics Library enables users to perform grammar inflection and semantic

operations on English content: https://www.nodebox.net/code/index.php/Linguistics

Page 9: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

6

In the PAN 2015 author profiling task (Rangel et al., 2015), the participants had to identify the gender, age and five personality traits (extroverted, stable, agreeable, conscientious and open) of Twitter users for English, Dutch, Italian and Spanish. The best achieving team (Alvarez-Carmona et al., 2015) built the most discriminant and descriptive features using second order attributes and latent semantic analysis techniques jointly. Using a LibLinear classifier for each language, their model performed with the average accuracy 78.95%. Another interesting experiment was carried out by Bartle and Zheng (2015) who classified the gender of bloggers (using the same dataset as Mukherjee and Liu (2010)) and writers from the 19th and 20th century, applying deep learning models and comparing different models. They built a windowed recurrent convolutional neutral network (WRCNN) which performed with the accuracy 86%, whereas a POS-tag model trained with SVM achieved 88%.

2.2 Related studies for Slovene

According to the findings of related works on gender identification, the language of the source documents plays an important role as the same features and classification models may not result in the same classification accuracy even if performed on the same genre (Rangel et al., 2013, 2014 and 2015). However, Burger et al. (2011) identified the gender of Twitter users using a tweet corpus in over 14 languages (English, Portuguese, Spanish, Indonesian, Malay, German, Chinese, Japanese, French, Dutch, Swedish, Filipino, Italian and other), so they did not carry out any language-specific text pre-processing. For Slovene, both author profiling and authorship attribution have been studied on practical examples. The application of author profiling has been addressed mostly in criminal investigations using text forensics, e.g. Brglez and Umek (2009) found that the annotators of text from anonymous authors could successfully determine the authors demographic characteristic (gender and age), whereas the psychological profile was more difficult to recognize (personality). Zwitter Vitez (2014) performed authorship attribution to find the author of an anonymous text published on the website of a Slovene political party which upset the Slovene general public.

To the best of our knowledge there has only been one attempt of automatic gender identification for Slovene. Slovene is a language with a high degree of inflection and when language users talk or write about themselves, they usually express their gender. Automatic gender annotation has been carried out on UGC texts (tweets, blogs, forum posts and news comments) and is described in Erjavec et al. (2015). The rules depend on the use of the combination of finite auxiliary verbs forms and participles occurring in minimum 5 entries. Osrajnik et al. (2015) examined the automatically annotated tags for tweets and found that 90% of users were classified correctly, where 10% were classified incorrectly of which 98% were corporate tweets, i.e. tweets posted by companies, news or travel agencies etc.

2.3 Goals and application

There is a number of potential applications in computational linguistics and social sciences where statistical methods for gender identification can be made use of. From the related works we can extract three main application fields: text forensics,

Page 10: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

7

psycholinguistics and personality studies and commercial/marketing domains; however, not all related works provide results of a real-life application of their model. Within text forensics, determining the profile of the author of a text that raises suspicion is valuable, e.g. in discovering cybercriminal acts. Different studies focus on discovering social media users that sexually harass other users and moreover recognizing the lines where their intention is revealed (Inches and Crestani, 2012), or detecting pedophiles (Peersman et al., 2011), where gender identification is one of demographic factors usually predicted in combination with age. Recently, researchers became interested in other latent user attributes beside demographic features, such as analysis of human behavior by employing personality type recognition (Rangel et al., 2015; Schwartz et al., 2013). Rangel et al. (2015) conducted a shared task of identifying age, gender and personality type of Twitter users, whereby the participants had to determine five main personality traits: extroversion, emotional stability/neuroticism, agreeableness, conscientiousness, and openness to experience. The researchers gain ground truth about personality type when the traits were self-assessed with an online test. Another study on personality, gender and age in social media was carried out by Schwartz et al. (2013), but on a much larger scale. They collected Facebook messages of 75,000 volunteers who also had to take a standard Five Factor Model personality test, also used in Rangel et al. (2015). Interestingly, when building features for the task, the also used LIWC (Linguistic Inquiry and Word Count), which is a psycholinguistic dictionary for English based on various word-category lexica, ranging from POS to topical categories (Pennebaker et al., 2007).

3 Gender and language in CDA and corpus linguistics

While text mining approaches to gender mostly focus on extracting features that are most useful for the task of gender identification of text author, many researchers in the field of gender and language favor qualitative analyses of small collections of texts to perform detailed studies with theoretical background. This is the case of critical discourse analysis, which often serves as base for studies of gender and language in corpus analyses which, however, rely on statistical techniques applied to a large dataset ‒ a language corpus. In this section, we present the central questions around which CDA positions its studies. Then we describe the main approaches to discovering differences and similarities of male and female speech in corpus linguistics and present related work.

3.1 Related work in critical discourse analysis

In this section, the most significant contributions to the question of gender and language in the CDA community are discussed. The general focus in these studies in placed on whether and how language is used to impose and reinforce social norms and expectations of gender roles and behavior and how social power is distributed among different groups of speakers. In one of the first and most referential works, Lakoff (1975) finds girls are taught from early on to use specific words or syntactic patterns, whereas the same limitations are not directed towards boys. She bases her claims mainly on her own observations for herself and her colleagues and friends. She argues that the social norms imposed to the language use of women support their subordination and prevent them to assert a position equal to men. Women are educated to talk like ladies and Lakoff provides the core examples of

Page 11: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

8

‘’women’s language’’, focusing only on speech/spoken languages. The majority of the vocabulary women use is what is considered women’s work, e.g. household errands, sewing and a wide vocabulary for the topic of colors, implying that these topics reflect what is most important to women and what they do during the day. Another lexical characteristic of women’s language are adjectives that indicate the speaker’s approbation and admiration for something. Neutral adjectives are used by men and women (great, terrific, cool, neat), whereas some are used mostly by women (adorable, charming, sweet, lovely, divine, cute). According to Lakoff’s observation of syntax, some (over)polite forms are more often used by women. In order to ease the effect or impact of an indicative sentence, women frequently hedge their statements by using words or phrases like well, y’know, kinda, sorta, I guess, I think (for statements), I wonder (for indirect questions). Such features are a part of the general belief that women are more polite in their speech than men and use polite forms more often (please, thank you). Being put in their place, as Lakoff classifies discriminative treatment of women, women are more reserved and careful about their speech and general behaviour, which results in the seemingly lack of humour on the one side and hyper correct grammar (no use of non-standard form like ain’t) on the other. For Lakoff, all of these restrictions work like a double-edged sword: for acting in compliance with the norms in terms of their language use, women are not rewarded; on the contrary, Lakoff states that because of the use of hedges and polite forms, women are seen as lacking assertiveness and as inappropriate leaders. Using empty adjectives and talking about house work makes the impression what they have to say is trivial. However, if women refuse to talk ‘’like ladies’’ (not using polite forms; swearing), they may be subjected to criticism as unfeminine and bossy. Lakoff concludes that gender norms with regard to language clearly reflect the existing unequal and even discriminative treatment of women. Tannen (1990) approaches the differences in male and female speech from the perspective of communication and interaction between men and women. She draws examples from anecdotes from fiction and experiences of her family, friends and colleagues. As it is evident from the title (You just don’t understand: Women and men in conversation), she understands the linguistic differences in a broader sense and deals mostly with meaning, but not so much with particular words or syntactic patterns. Again, the underlying reason for the misunderstandings between men and women lies in the way girls and boys were brought up; however, the implication here is not as political as in Lakoff (1975). Tannen depicts the differences in male and female conversational style with opposite concepts: women talk to start interaction and promote social affiliation, whereas men talk to provide or receive information and ‘’and negotiate and maintain status in a hierarchical social order’’ (Tannen, 1990). She names women’s language rapport talk, and men’s report talk. Women value intimacy and men prefer independence, which they feel, is threatened by intimacy, which results in a conflict between a man and a woman. One of the interesting points she makes is that men often engage in a conversation in a more competitive way, so that they perform an imitation of a duel; however, this is not a conflict or fight. When a man uses this approach in a conversation with his (female) partner, the women feels this way of conversing might endanger their relationship, so she does not wish to imitate his conversational style, which again may end in a conflict. Generally, Lakoff (1975) sees language as one of the fields where women are subjected to discrimination, whereas Tannen (1990) observes men and women as groups with two different cultural backgrounds, whereby cultural differences lead to conflict, which can be overcome. There have of course been other CDA studies of gender and language, but many focus on a very specific environment (work or home) to observe how language is

Page 12: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

9

used. It is important to note that both Lakoff (1975) and Tannen (1990) provide examples from speech, not text. However, because UGC tends to be less structured and formal (no proof-reading in general) and the main purpose of social media is for the users to meet and interact, UGC language often contains patterns similar to those in speech, e.g. elements of interaction between users (Zwitter Vitez and Fišer, 2015).

3.2 Gender in corpus analyses

In this chapter, the studies of language use within corpus linguistics are presented. In Section 3.1, we discussed two important contributions of CDA to the issue of gender and language. Both, Lakoff (1975) and Tannen (1990) based their findings on detailed qualitative observations of a small sample of people consisting of themselves, their friends, family and colleagues. When approaching the issue of gender and language in order to extract general facts about how society through language imposes norms, which are, furthermore, reproduced by language use, the method of building a representative sample and analyzing it is key. One of the possible approaches is corpus analysis, where we can use statistical techniques over large amounts of textual data. Tognini-Bonelli (2001) introduced the distinction between two methodologic approaches to corpus analysis: the corpus-based approach, where the corpus is used to test hypotheses set prior to corpus analysis; and the corpus-driven approach, where the corpus is the source of hypotheses about language. This distinction is to be understood as a continuum, not a binary choice, as in a single research, both approaches may be used. A typical corpus-based method of studies in gender is the analysis of use of particular words, phrases or other tokens (Schmid, 2003; Osrajnik, 2015). Researchers who take a more corpus-driven approach usually build frequency lists or list keywords of the male and female part of the corpus in question (Baker, 2014; Jeon and Choe, 2009). Baker (2014) argues that some methods in corpus analysis are problematic, as they favor findings that reveal difference between male and female language, while paying little or no attention to the similarities. Another possible analysis is comparing the male and female subcorpora by using measures of statistical similarity, e.g. Manhattan distance or Spearman rank correlation test (Baker, 2014). Tannen’s hypothesis that male-female conversation is cross-cultural communication (Tannen, 1990) was taken into consideration in Schmid’s analysis (2003) of the male and female part of the BNC Demographic corpus. He used a fairly simple formula for the difference coefficient using relative frequencies. He observed the use of empty words and some syntactic patterns proposed by Lakoff (1975) to be used by women or expected from women to use them. Second, he analyzed the use of words typical of different topics to find whether the topics of conversation of men and women differ. When comparing the use of what he calls ‘’women’s words’’ (adjectives and adverbs like handsome, lovely, sweet, awful), Schmid found that with a few exceptions, the words in question do occur more frequently among female speakers. Markers of hesitation and hedges (well, really, maybe, erm, er) are distributed equally between the male and female subcorpora with some of them (well, really, you see) being used more often by women, and others (er, erm, in fact) by men. Minimal responses (mm, aha, yes) indicate that one is paying attention to what other discourse participants are saying; all minimal responses were found to occur more frequently in the female subcorpus. Schmid observed hand-crafted lists of words which are typical of particular topics ‒ 14 topics all together. An overrepresentation of women speakers was detected for the domains clothing, basic colors, home, food and drink, body and health, and people. In contrast,

Page 13: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

10

domains work, computing, sports and public affairs were considered more typical of the male subcorpus. Interestingly, in the domain car and traffic, the more general terms (bus, train, car) where found more often among female speakers, while specified words were more typical of male speakers. The investigation of swear words yielded interesting results, as four of seven items were more frequently used by women. Schmid concludes that the findings from the spoken BNC reveal the differences about British men and women: what they are occupied with during the day, their problems, hobbies and interest, whereby these linguistic differences indicate some sort of cultural differences. Similarly, Baker (2014) tested the hypothesis of previous studies that men tend to use more prolific language than women. First, he examined the use of the word fucking in the spoken part of BNC Demographic and found that the words was used 1,724 times by a very small percentage of women (4.19%) and men (3.63%). Most of the occurrences of this swear word were contributed by male speakers, which means only a few men contributed many occurrences of the words with some of the speakers using it frequently in short passages of transcripts. Baker then expanded the analysis of swear words in BNC to a list of swear words (fuck, bastard, shit, cunt, bloody, bugger, arse, piss and its derivatives). The results showed that the words were relatively equally distributed between male and female speakers with a slightly greater frequency in the male subcorpus. Baker reports that 18.3% (of the 1,360 women) of females used swear words in their speech, and 15.5% of male speakers (of the 2,448 men) used them. Interestingly, he found that these words were ‘’relatively more widely dispersed among female speakers than males.’’ Baker concluded that the majority of men and women did not swear when being recorded. Osrajnik et al. (2015) compared the use of expressive punctuation in the corpus of Slovene tweets with regard to the gender of Twitter users. In the tweet corpus, over 1 million tweets (27%) was contributed by female users, over 2 million by male users (55%) and over 700,000 by gender-neutral users (18%). They observed only the tweets that were less standard in terms of linguistic standardness. The term expressive punctuation was used to denote either emoticons or punctuation signs that were repeated. The only non-repeated punctuation signs that were considered expressive were the question and exclamation marks. Osrajnik et al. found that emoticons occurred more frequently in the female subcorpus, whereas repeated punctuation signs were used more often by male Twitter users. The automatically classified sentiment of the tweets was also observed: a great majority of emoticons and punctuation signs were annotated with a negative sentiment disregarding the gender. Some of expressive punctuation signs occurred only in one or the other tweet subcorpus; however, the unequal size of the subcorpora should be taken into consideration. Jeon and Choe (2009) conducted a keyword analysis of male and female speech in the International Corpus of English – Great Britain (ICE-GB). In the study, they focused on the use of intensifying adverbs, which, according to Lakoff (1975), typically occur in women’s language. In the WordSmith tool, they used chi-squared metrics to calculate the keywords. When observing the raw frequency of top ten rank-ordered adverbs, they found that the lists of male and female subcorpora differ in only two adverbs (as in male and really in female). Then, each of the subcorpora was compared against a reference corpus (here the entire ICE-GB). In the female corpus, 13 key intensifying adverbs were found (really, sort of, so, a bit, that, very, too, slightly, all, quite, completely, at all, and much), whereas the male corpus has seven of such keywords (entirely, right, around, over, some relatively, and well). They also investigated the dispersion of keywords yielded. All the keywords in the female subcorpus showed very high dispersion values, but the dispersion values of male keywords seems to fluctuate and thus, some words might not be representative for the subcorpus.

Page 14: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

11

A keyword analysis of male and female speech in BNC was carried out by Baker (2014), who used the FLOB corpus of written British English as a reference, which means that the male and female subcorpora were not compared directly, but rather to a third corpus. Baker considered the strongest 100 keywords from each of the keywords list. Out of the top 100 keywords, 88 keywords were identical and occurred in both lists. He reported that the actual ranking of the keywords in each list was very similar. Among the keywords that occurred in the male subcorpus only, numerals seem to stand out the most. Female speakers tend to use verbs that indicate reported speech (says and said). Baker points out that the context and circumstances of recording not entirely symmetrical when recoding men and women for the BNC spoken: the demographic part (recorded at home) contains bigger amounts of female speech, whereas the context-governed part (recorded at work or in public) contains larger amounts of male speech. Taking in account the most frequent 20 male words in the male and female speech in the BNC, Baker (2014) performed a Spearman rank correlation test and yielded the result 0.93, which indicates a very high correlation, meaning that the similarity between the subcorpora outweigh the differences. With no regards to particular groups of words or phrases, subcorpora of male and female speech can still be compared according to the similarity. Baker (2014) used the Manhattan distance (MD) which takes into account the actual frequencies of words. He randomly split both, the male and female subcorpus, into four equal parts and used MD over the combinations. Baker found that the greatest distance was indeed between ‘’mixed-sex pairs’’ as he calls them. However, the highest distances between same-sex combinations were larger than some mixed-sex ones, what he understands as the fact that the speech of a single gender is not a heterogeneous language. He argues that the differences between male and female language evident in the corpus mostly happen due to various roles men and women perform in society. Thus, lexical differences ‘’should not be viewed as telling us anything essential about a particular sex, instead these differences are circumstantial’’ (Baker, 2014).

4 Critical judgement of the existing research in the field

In this section, we will summarize the overview of related works by providing a critical evaluation of the approaches to finding differences in how language is used by men and women. First, we will focus on the features that yielded the best results in gender identification tasks for the users in UGC. Then, the approaches in corpus linguistics referring to CDA will be discussed. After studying the most relevant related works for our research, we found that in general the features used for gender attribution can be divided into several groups, with some features belonging to more than one: lexical (unigrams, POS-tags, POS-sequences), stylistic (special characters, punctuation marks, number of paragraphs, number of words per sentence), topical (topic-specific words or hand-crafted lexica of e.g. emotional words), language models (n-grams), complex features (readability, F-measure, linguistic mistakes) and other (second order representation, metadata, sociolinguistic, dictionary-based). The majority of studies took into account a combination of different types of features; only two used a single feature: Mukherjee and Liu (2010) used POS-sequences, while Bartle and Zheng (2015) employed POS-tags. With regard to machine learning techniques, most related works applied SVM (Mukherjee and Liu, 2010; Rao et al., 2010; Bartle and Zheng, 2015), some used LibLinear classifier (Lopez-Monroy et al., 2014; Alvarez-Carmona et al., 2015), Balanced Winnow2 (Burger el

Page 15: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

12

al., 2011), maximum entropy (Sarawgi et al., 2011) or other. These models are normally not visualized or presented in a manner that is easily interpretable. It might be interesting to see which features/feature values from the entire combination were the most decisive for gender identification, as this kind of information may be valuable for researchers of gender and language coming from other fields (social sciences and humanities in particular). Only two approaches decided to use an interpretable model, such as decision trees (Santosh et al., 2013) or random forest (Meina et al., 2013); however, none of these show or provide examples of the most discriminating feature values. As Sarawgi et al. (2011) point out, recent work in gender identification does not reveal stylistic (syntactic) differences between male and female language use, but rather non-stylistic characteristics which are based on topic bias. Baker (2014) argues that differentiating words do not describe a particular sex, but rather reflect its role in society and the norms or expectations imposed by society. Lexical variation between genders is on the surface level, but they are accompanied by various prosodic/pragmatic factors and used in different syntactic patterns, which at least in corpus analysis remain out of primary focus (Baker, 2014). Sarawgi et al. (2011) conducted a cross-topic and cross-genre analysis of male and female language to find the model that proves syntactic differences. Rao et al. (2010) tested a set of hand-crafted sociolinguistic features inspired by Lakoff (1975) that are mostly written variations of spoken prosodic cues, which are absent in UGC. Just like Lakoff (1975) gives examples of characteristics of women’s language, Rao et al. (2010) report most of their features are more typical of female than male Twitter users. To our knowledge, this is the only study where a model was improved with the knowledge of social aspect of language and gender. In contrast to ML tasks of gender identification, many corpus analyses take into account the social factor of male and female language and refer to theoretical argumentations (CDA) on why and how the language of women differs from or is similar to language of men. However, CDA grounds its theory and gives examples based on speech, because speech is spontaneous and the participants more or less do not prepare their speech. Among the works discussed in Section 3.2, all but one (Osrajnik et al., 2015) used corpora of spoken language. As Baker (2014) points out, a potential problem arises, if the circumstances of recording vary too much in the male and female subcorpora creating asymmetries that occur due to gender roles, not differences in how language is used by men and women. This is why it is interesting that only Osrajnik et al. (2015) used the corpus of tweets, as both, men and women equally, can participate on social media and create UGC. Furthermore, Lakoff (1975) and Tannen (1990) draw examples of language use from their own experience and from the lives of their colleagues, friends and family. Their method is highly introspective; besides, the representativeness of their sample is questionable, as the people they describe presumably have little demographic variation (North American, white, middle class and educated). Lakoff’s and Tannen’s studies do not bring strong evidence, but rather a social perspective that is made use of in corpus analyses, even though the corpus itself can be used as theory (the corpus-driven approach), especially because there are tests (MD, Spearman rank correlation in Baker (2014)), which can easily be used to prove the general similarity between the language of men and women.

Page 16: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

13

5 Suggestions for upgrading the existing conditions

In this section, we provide possible suggestions for improving the features used for gender identification on one hand, and for conducting corpus analyses of male and female language on the other. As it is evident from Sections 3 and 4, the body of research of gender and language in the UGC of the ML community is much larger and continues to grow, whereas corpus analyses focus mostly on the existing speech corpora. Some of the features used for the best performing gender identification models can be manually converted into features for corpus analysis, such as POS-tags, POS-sequences or sociolinguistic/prosodic features used by Rao et al. (2010). For their gender identification model, Meina et al. (2013) used a wide combination of features, including number of errors and language mistakes; similarly, Ljubešić et al. (2015) developed a model that predicts the linguistic and technical standardness of tweets, whereby it distinguishes between three different levels. The corpus we plan to work with first is the Janes corpus5, which is already annotated for the levels of linguistic and technical standardness. The feature construction heavily depends on the task and the data we are working with (tweets, blogs, news comments, Facebook messages etc.). However, the approaches to feature construction as it is presented in Section 3 seem not to take into account all the possible context features of UGC. One of the most prominent pragmatic circumstances (but also disregarded in related works presented) is the dialogue. Communication between users is one of key utilities of social media. In a dialogue between two or more participants, we can observe the role each of the participants plays and the relationship between them, e.g. hierarchy as opposed to equality. Based on Lakoff’s (1975) and Tannen’s (1990) findings on the dialogue between same-sex and mixed-sex groups, we can built complex features, with some of them typical of women’s language according to Lakoff (1975): use of hedges and lack of assertiveness, (im)politeness or emphasizing certain words in text (using italic print or capital letters only) and speech (intonation patterns). To avoid generalizations of language use based on topic bias, we can observe the discursive strategies in more detail by focusing on a single or several topics independently, e.g. politics, family or sports. From the corpus, the documents of the topic in question can be extracted according to the keywords or hashtags (Rao et al., 2010). The topics mutual to both and more exclusive to one gender can be discovered by employing topic ontology construction, e.g. importing the documents into an ontology editor, such as the OntoGen6, a semi-automatic data driven ontology editor.

5 The Janes project: http://nl.ijs.si/janes/ 6 The OntoGen tool is an ontology editor developed at the Jožef Stefan Institute: http://ontogen.ijs.si/

Page 17: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

14

References

Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-Y-Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: Inaoe’s participation at pan’15: Author profiling task—notebook for pan at clef 2015. In: Cappellato, L., Ferro, N., Jones, G., San-Juan, E. (eds.): CLEF 2015 Labs and Workshops, Notebook Papers. CEUR-WS.org vol. 1391 (2015) Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J. 2007. Mining the Blogosphere: Age, gender and the varieties of self-expression. First Monday, September, 2007. Argamon, S., Goulain, J., Horton, R. and Olsen, M. 2009. Vive la Différence! Text Mining Gender Difference in French Literature. Digital Humanities Quarterly, Spring 2009, Volume 3 Number 2. Baker, P. 2014. Using Corpora to Analyze Gender. London: Bloomsbury. Bartle, A. and Zheng, J. 2015. Gender classification with deep learning. Stanford University report. Brank, J., Mladenić, D. and Grobelnik, M. 2010. Feature Construction in Text Mining. Encyclopedia of Machine Learning. Sammut, C. and Webb, G. I. (eds). 397-401. Brglez, L., Umek, P., 2009: Uporabnost spoznanj sociolingvistike in psiholingvistike za kriminalistično preiskovanje. 10. slovenski dnevi varstvoslovja: Varstvoslovje med teorijo in prakso. Maribor: Fakulteta za varnostne vede. Burger, J. D., Henderson, J., Kim, G. and Zarrella, G. 2011. Discriminating gender on Twitter. In: Proceedings of the 2011 Conference on Empirical Methods in NLP. Edinburgh, Scotland, UK. Chua, T., Juanzi, L.; Moens, M. 2014. Mining user generated content. Chapman and Hall/CRC. Crystal, D. 2001. Language and the Internet. Cambridge University Press. Erjavec, T., Fišer, D., Ljubešić, N. 2015. Razvoj korpusa slovenskih spletnih uporabniških vsebin Janes. Fišer, D. (editor). Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 20–26. Inches, G. and Crestani, F. 2012. Overview of the International Sexual Predator Identification Competition at PAN 2012. CLEF 2012. Jeon, J. and Choe, J. 2009. A Key Word Analysis of English Intensifying Adverbs in Male and Female Speech in ICE-GB. PACLIC 2009. 210-219. Kendall, S. and Tannen, D. 2001. Discourse and Gender. In D. Tannen, D. Schiffrin & H. Hamilton (Eds.): Handbook of Discourse Analysis. (pp. 352-371). Oxford: Blackwell, 2001. Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S., Škrjanec, I. 2015. Predicting the level of text standardness in user-generated content. 10th International

Page 18: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

15

Conference on Recent Advances in Natural Language Processing: Proceedings of RANLP 2015 Conference, 7-9 September 2015, Hissar, Bulgaria. Hissar: 371–378. Macaulay, R. K. 2005. Talk that counts: Age, Gender, and Social Class Differences in Discourse. Oxford University Press. Meina, M., Brodzinska, K., Celmer, B. Czokow, M., Patera, M., Pezacki, J. and Wilk, M. Ensemble-based Classification for Author Profiling Using Various Features—Notebook for PAN at CLEF 2013. In: Forner, P., Navigli, R. and Tufis, D. (eds). CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 23-26 September, Valencia, Spain, 2013. Lakoff, R. 1975. Language and Woman’s Place. N. Y.: Harper and Row. Mukherjee, A. and Liu, B. 2010: Improving Gender Classification of Blog Authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. pp. 207-217. Osrajnik, E., Fišer, D., Popič, D. 2015. Primerjava rabe ekspresivnih ločil v tvitih slovenskih uporabnikov in uporabnic. Fišer, D. (editor). Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 50–74. Pastor López-Monroy, A., Montes y Gómez, M., Jair-Escalante, H. and Villase nor Pineda, L. 2014. Using Intra-Profile Information for Author Profiling—Notebook for PAN at CLEF 2014. In Cappellato, L., Ferro, N., Halvey, M. and Kraaij, W. (eds). CLEF 2014 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. Pennebaker J.W., Chung C.K., Ireland M., Gonzales A., Booth R.J. 2007. The development and psychometric properties of liwc2007 the university of Texas at Austin. LIWCNET 1: 1–22. Prabhakaran, V., Reid, E.E., and Rambow, O. 2014. Gender and Power: How Gender and Gender Environment Affect Manifestations of Power. EMNLP. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the Author Profiling Task at PAN 2013—Notebook for PAN at CLEF 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers. CEUR-WS.org vol. 1179 (2013). Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B. and Daelemans, W. 2015. Overview of the Author Profiling Task at PAN 2015. In In: Cappelato, L., Ferro N., Jones, G. and San Juan, E, (eds): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/. Rao, D., Yarowsky, D., Shreevats, A. and Gupta, M. 2010. Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents, New York, NY, USA, 2010, pp. 37–44.

Page 19: Abstract - IJSkt.ijs.si/markodebeljak/Lectures/Seminar_MPS/2012... · Abstract With advances in natural language processing, ... contains more nouns, adjectives, prepositions and

16

Sarawgi, R., Gajulapalli, K. and Choi, Y. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. CONLL. Schmid, H. J. 2003. Do men and women really live in different cultures? Evidence from the BNC. In: Wilson, A., Rayson, R. and McEnery, T. (eds): Corpus Linguistics by the Lune. Lódź Studies in Language 8. Frankfurt: Peter Lang. 185-221. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, et al. (2013): Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PLoS ONE 8(9): e73791. doi:10.1371/journal.pone.0073791. Tannen, D. 1990. You Just Don’t Understand: Women and Men in Conversation. New York: Ballantine Books. Tognini-Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins. Van Dijk, T. A. 2001. Critical Discourse Analysis. In D. Tannen, D. Schiffrin & H. Hamilton (Eds.): Handbook of Discourse Analysis. (pp. 352-371). Oxford: Blackwell, 2001. Vogel, A. and Jurafsky, D. 2012. He said, she said: gender in the ACL anthology. Proceeding ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries. 33-41. Zwitter Vitez, A. Ugotavljanje avtorstva besedil: primer “Trenirkarjev”. In: Erjavec, T., Žganec Gros, J. (eds). Language technologies: proceedings of the 17th International Multiconference Information Society – IS 2014, October 9th – 10th, 2014, Ljubljana, Slovenia: volume G. Ljubljana: Institut Jožef Stefan, 2014, str. 131-134. Zwitter Vitez, A., Fišer, D. 2015. Elementi interakcije v govorjenih in spletnih besedilih. Fišer, D. (editor). Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 87–90.