How Social Q&A Sites are Changing Knowledge …web.cs.ucdavis.edu/~filkov/papers/r_so.pdfHow Social Q&A Sites are Changing Knowledge Sharing in Open Source Software Communities Bogdan

How Social QampA Sites are Changing Knowledge Sharingin Open Source Software Communities

Bogdan Vasilescu12 Alexander Serebrenik1 Premkumar Devanbu2 Vladimir Filkov2

1Eindhoven University of Technology The Netherlands bnvasilescu aserebreniktuenl2University of California Davis USA devanbu filkovcsucdavisedu

ABSTRACTHistorically mailing lists have been the preferred means forcoordinating development and user support activities Withthe emergence and popularity growth of social QampA sitessuch as the StackExchange network (eg StackOverflow)this is beginning to change Such sites offer different socio-technical incentives to their participants than mailing lists doeg rich web environments to store and manage content col-laboratively or a place to showcase their knowledge and ex-pertise more visibly to peers or potential recruiters A keydifference between StackExchange and mailing lists is gam-ification ie StackExchange participants compete to obtainreputation points and badges Using a case study of R apopular data analysis software in this paper we investigatehow mailing list participation has evolved since the launchof StackExchange Our main contribution is assembling ajoint data set from the two sources in which participants inboth the r-help mailing list and StackExchange are identi-fiable This allows for linking their activities across the tworesources and also over time With this data set we foundthat user support activities are showing a strong shift awayfrom r-help In particular mailing list experts are mi-grating to StackExchange where their behaviour is differentFirst participants active both on r-help and on StackEx-change are more active than those who focus exclusively ononly one of the two Second they provide faster answers onStackExchange than on r-help suggesting they are moti-vated by the gamified environment To our knowledge ourstudy is the first to directly chart the changes in behaviour ofspecific contributors as they migrate into gamified environ-ments and has important implications for knowledge man-agement in software engineering

Author KeywordsCrowdsourced knowledge social QampA mailing lists opensource gamification

ACM Classification KeywordsH53 [Information Interfaces and Presentation (eg HCI)]Computer-supported cooperative work

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than theauthor(s) must be honored Abstracting with credit is permitted To copy otherwise orrepublish to post on servers or to redistribute to lists requires prior specific permissionandor a fee Request permissions from permissionsacmorgCSCWrsquo14 February 15ndash19 2014 Baltimore Maryland USACopyright is held by the ownerauthor(s) Publication rights licensed to ACMACM 978-1-4503-2540-01402$1500httpdxdoiorg10114525316022531659

INTRODUCTIONHistorically mailing lists have been the preferred mediumfor coordinating development and user support activities [163132] In particular mailing lists have been viewed as the defacto communication medium between knowledge seekers(eg users of the software asking for support) and knowledgeproviders (eg other users more knowledgeable about thetopic or the developers themselves) in models of knowledgesharing in open source [32] The two categories of knowl-edge actors have been reported to co-exist in a symbiotic re-lationship wherein ldquothe community learns from its partici-pants and each individual learns from the communityrdquo [32]However their motivations for participation may differ Forinstance knowledge seekers may directly benefit from hav-ing their problems solved while knowledge providers maybe motivated intrinsically (eg by altruism) or by learningabout the problems other users are experiencing [20 32]

Recent years have witnessed the emergence and grow-ing popularity of software-development-related social me-dia sites such as GitHub1 (coding) Jira2 (issue tracking)or the StackExchange network (question and answer web-sites eg StackOverflow for ldquoprofessional and enthusiastprogrammersrdquo3 or CrossValidated for ldquostatisticians data an-alysts data miners and data visualization expertsrdquo4) Suchsites are rapidly changing the ways in which developers col-laborate learn and communicate among themselves andwith their users [4 8 9 30 34] Moreover they are offer-ing different socio-technical incentives to their participantseg rich Web 20 platforms to store and manage content col-laboratively or a place to showcase their knowledge and ex-pertise more visibly to peers and potential recruiters [8] Inaddition StackExchange sites employ gamification [11] toengage users more questions and answers are voted uponby the community the number of votes is reflected in theposterrsquos reputation and badges exceeding various reputationthresholds grants access to additional features (eg moder-ation rights on topics and posts) reputation and badges canalso be seen as a measure of onersquos expertise by potential re-cruiters [8] and are known to motivate users to contributemore [1 2 10 42] Activity on StackExchange sites can alsoelevate one to celebrity status within the developer commu-nity (see eg the discussion around Jon Skeet5 the mostprolific contributor to StackOverflow)1httpsgithubcom2httpwwwatlassiancomsoftwarejira3httpstackoverflowcom4httpstatsstackexchangecom5httpmetastackoverflowcomq9134

Naturally the richer user interfaces wider audiences or dif-ferent incentives and motivations for participation inherent insocial QampA sites are challenging the supremacy of mailinglists as the de facto communication medium between knowl-edge seekers and knowledge providers For example Stack-Overflow is known to provide good technical solutions [25]and to provide them fast [23] At the same time mailing listparticipants are signalling the need for more modern support6and are even promoting a transition to StackExchange7 Ourgoal here is to study in detail the effects of such a transitionon contributors and their work Are mailing list participantstransitioning to StackExchange If so do they behave differ-ently on StackExchange than on the mailing list

To study the phenomena associated with such a transitionwe need a longitudinal data set combining mailing list andStackExchange activity wherein participants overlap In thispaper we create such a data set for R [28] a popular dataanalysis software by integrating mailing list activity andStackExchange activity (the latter under the [r] tag) Usingthis data set we find that

bull activity on r-help (the main user support mailing listfor R) has been consistently decreasing since around 2010(ie participants are asking fewer and fewer questions)while at the same time the number of R-related questionsasked on the two main StackExchange sites for R (Cross-Validated and StackOverflow) has been growing at an in-creasing ratebull participants in the two communities overlap but different

categories of r-help contributors are ldquoattractedrdquo differ-ently by StackExchange For example the proportion ofmailing list users active on StackExchange is much higheramong R developers than non-developersbull the levels of activity for r-help participants who are also

active on StackExchange differ relative to those who re-strict themselves to either the mailing list or to StackEx-change those participating in both r-help and StackEx-change are more activebull knowledge providers active in both communities answer

questions significantly faster on StackExchange than onr-help and their total output increases after the transi-tion to StackExchange

Apart from uncovering interesting phenomena these find-ings reveal that knowledge management in open source ischanging and that the different socio-technical incentivesoffered by QampA sites such as StackOverflow positively in-fluence participation in these communities facilitate usercontributions and foster productivity Therefore they couldserve as inspiration for start-up open source or commercialprojects looking to establish user support platforms or bystakeholders interested in knowledge management in gen-eral

The rest of the paper is organised as follows We first focuson the background underlying knowledge-sharing in opensource and present our research questions Then we de-scribe the data set and data gathering process followed byour methods results and concluding sections6httpgoogl0i4j1nhttpgoogl6h8bMa7httpgooglaeoBJAhttpgooglOtzZT5

BackgroundThe importance of user support for the adoption growth andsuccess of open source projects is well-recognized [32035]Traditionally user support was organised through mailinglists forums user groups etc However as new venues ofinformation and tools for information access are emerging(eg weblogs wikis social QampA sites) peoplersquos online in-formation seeking behaviour is also evolving [1229] Whenit comes to user support activities around open source soft-ware though what is common among both traditional andnew venues of information is that this ldquomundane but neces-sary taskrdquo [20] is typically carried out by unpaid volunteersOften the developers themselves take part in these activitiesbut thriving open source projects also succeed in enlistingsome of their users to offer assistance to peers But what mo-tivates knowledge providers to answer other peoplersquos ques-tions Together with online information seeking behavioursthese motives are also evolving

In traditional information-sharing venues knowledge-providers participate for reasons related to albeit differ-ent than those of developers contributing to open sourceDevelopers contribute for reasons such as a direct needfor the software enjoyment of the work itself or theenhanced reputation arising from high-quality contribu-tions [20] Knowledge-providers on the other hand are re-portedly motivated by learning about problems other usersare experiencing an enhanced feeling of being part of a com-munity personal benefits of learning through teaching anenhanced likelihood of receiving help in the future or a senseof obligation from having received help from others in thepast [20 24]

As social QampA sites eg the StackExchange network arebecoming more popular instances of what the economic lit-erature calls ldquosignalling incentivesrdquo start to better explainwhat motivates knowledge providers to help others [21] inaddition to the previous reasons Career concerns (eg on-line activities in social QampA sites are more visible to em-ployers and recruiters who may use them to find qualifiedpeople [8]) or a desire for peer recognition (eg reputationbuilding) are among the listed motives [202124] Howeversuch signalling incentives may not be equally applicable toall knowledge providers For example we can expect thatmore knowledgeable providers would draw greater benefitfrom signalling and thus signal more Therefore to obtaina more fine-grained understanding of the transition from tra-ditional information venues to social QampA it is importantto distinguish between different groups of participants (egdevelopers with likely more knowledge about the softwareversus non-developers)

Another dimension of social QampA participation incentivesarises from gamification ie using elements of game de-sign in non-game contexts [10 11] Gamification has beenwidely used in online platforms in recent years where itwas shown to motivate users to contribute more [121042]and there are many psychological theories that can explainwhy [40] The blueprint8 that all current implementations ofgamification (including all StackExchange sites) follow as-8httpcodingconductccMeaningful-Play

sumes stimulating users to perform a desirable activity byawarding them points Specifically users (i) are rewardedwith points to encourage the desired behaviour (and may besubtracted points to sanction undesired behaviour) (ii) areawarded badges after collecting sufficiently many pointsor when performing certain activities and (iii) have theirprogress tracked and their achievements displayed publiclyin a leaderboard to create competition between them OnStackExchange sites the goal of ldquothe gamerdquo is to have par-ticipants teaching and learning from each other9 by askingrelevant questions and providing helpful well-documentedand clear answers Users receive reputation points when fel-low users vote up their questions and answers In turn rep-utation points or directly generating good content translateto badges (eg ldquoFamous Questionrdquo if a question received10000 views or ldquoLegendaryrdquo if the user earned 200 reputa-tion points daily 150 times) Top performing users in termsof their reputation count (either at week month quarteryear or overall level) are displayed in public leaderboardsExceeding various reputation thresholds grants users addi-tional privileges on StackExchange sites including at thehigher end the privilege to help moderate the site

Therefore coming back to our study of mailing lists it seemsthat the different incentives inherent to social QampA siteshave the potential to ldquodisruptrdquo mailing list activity and cat-alyze a transition to social QampA for knowledge seekers andknowledge providers alike Seekers may turn to StackEx-change expecting faster answers and a wider expert audi-ence Providers may transition to it to satisfy a rising demandfor information or pursuing signaling incentives and recog-nition The latter is for instance supported by ldquoalpha-malerdquobehaviours self-reported by Apache help forum knowledgeproviders interviewed by Lakhani and von Hippel [20] thosewanting to be known as ldquotherdquo expert in a particular aspectof Apache ldquowould strive to answer all questions associatedwith their areardquo and seek to ldquodrive out all other informationproviders from [that] chosen field of expertiserdquo In the pres-ence of gamification such as on StackExchange sites suchbehaviours may be even further accentuated For exampleStackOverflow is said to suffer from the fastest gun in theWest problem10 to maximise their chances of collectingup-votes from their peers participants would race to answerquestions as quickly as possible rather than as correctly oras exhaustively as possible The said problem is based onthe anecdotal belief that given two answers of comparablequality it is the earliest that would typically receive the mostvotes or that one might refrain from answering altogether ifsomeone else already offered a similar solution

Research QuestionsIn this paper our goal is to analyze a mature and vibrantcommunity where knowledge-transfer is transitioning froman older mailing list modality to a new social QampA modal-ity and understand some of the effects of that transition Wechose the R software community

9httpgoogl4kPzBP10httpmetastackoverflowcomq9731182512

Research Question 1 Can we find evidence in the Rcommunity for a decreasing popularity of the mailinglist and an increasing popularity of StackExchange

RQ1 Discussion R [28] a popular data analysis softwaremakes for an interesting case study since there have beeninitiatives by members of its community to promote a tran-sition to StackExchange11However to quantitatively assessthis phenomenon we first need to construct a historical dataset combining mailing list and StackExchange activity forR and identify participant overlap (if any) Given such adata set we can then evaluate activity in the two commu-nities and look for transition phenomena To the best ofour knowledge we are the only ones creating and analysingsuch overlapping data sets between StackExchange and opensource communities [38]

In addition to the potentially different motivations drivingparticipants other factors may influence these phenomenaThe R mailing lists have been around since 1997 whileStackOverflow the first of the StackExchange sites wasonly launched in 2008 The relative maturity of the for-mer compared to the latter as well as the direct contactwith R developers may mean that mailing lists are seen aswell-established ldquoeducational institutionsrdquo by R users iethe default go-to place for requesting R support Moreovernot all mailing list discussions are equally as well suitedfor StackExchange hence not all are at risk of being sub-sumed For example StackExchange discourages questionsthat are too broad or primarily opinion-based12 Such ques-tions may therefore only be suitable for the mailing lists Inother words despite addressing some mailing lists limita-tions (eg pertaining to user interface) StackExchange maynot be a disruptive influence

On the other hand R users asking questions on StackOver-flow may be given answers referring to earlier posts fromr-help13 This suggests that mailing lists may not be aswell indexed by search engines (hence solutions posted theremay not appear as high up in the results list) or that themindset of users (asking directly on a StackExchange site) isaltogether changing in disfavour of the mailing lists In ad-dition to the factors above the two communities also showsigns of symbiosis For example StackOverflow discussionsare being followed by R developers and bug reporters whouse the feedback received from this community to stir updevelopment discussions on r-devel14 the other mainmailing list for R dedicated to developers The varying andsynergistic activities of developers and knowledge-providerson mailing lists and QampA sites suggest that some develop-ers and knowledge-providers in R are splitting their time be-tween the two while others perhaps are not Thus R aproject with a knowledge-exchange in transition includessome participants who stay with the old some who are intransition and some who go entirely with the new It would

11httpgooglaeoBJAhttpgooglOtzZT512httpmetastackoverflowcoma1058213httpstackoverflowcoma1996404httpstackoverflowcoma4947528

14httpgoogl6UAUgrhttpstackoverflowcoma1321491

be important to understand if this transition is in fact tak-ing place and to what degree Community processes for an-swering user questions on-line are a vital component of theindustry and it is important to understand and promote thisimportant process

Research Question 2 How do the contributors activeboth on the mailing list and on StackExchange differfrom those focused on a single community (either of thetwo) in terms of their activity levels

RQ2 Discussion The different socio-technical incentivesinherent in social QampA (eg related to user interface po-tentially wider audiences and gamification) may result inStackExchange becoming more attractive ie taking over(part of) the mailing list activity and engaging (part of) themailing list community Previous studies (eg [8]) arguethat onersquos activity in social media can be used as a signal ofqualifications and similarly onersquos reputation in social mediaas a signal of references from peers Are there mailing listparticipants active also on StackExchange sites Who arethey and how do they differ (in terms of their activity levels)from other mailing list or StackExchange participants whochoose to focus solely on the mailing list or solely on Stack-Exchange Are developers more likely to be active both onr-help and on StackExchange (ie to ldquosignalrdquo more) thannon-developers

Research Question 3 For the contributors active bothon the mailing list and on StackExchange can we finddifferences in their behaviour in one community versusthe other

RQ3 Discussion If StackExchange sites are attracting mail-ing list participants (eg knowledge seekers looking forfaster answers or a wider expert audience or knowledgeproviders looking to satisfy a demand for knowledge ldquosig-nalrdquo their ldquofitnessrdquo as experts or engage in the game for rep-utation and badges) we should observe a decreasing trendin the mailing list activity concomitant with an increasingtrend in StackExchange activity for those participants (espe-cially knowledge providers) active on both Moreover gam-ification one of the key differences between r-help andStackExchange sites might influence r-help participantsactive on StackExchange to adopt a different behaviour onStackExchange [1 2 10 42] For example in order to ldquosur-viverdquo in the game for reputation and badges they might haveto provide faster answers than they do on the mailing listCan we observe such a trend Do participants active bothon r-help and on StackExchange behave differently whenon the mailing list than when on StackExchange sites Ifanswers are indeed quicker and the speed is somewhat at-tributable to ldquogame-playingrdquo this might confirm that gamifi-cation is a useful adjunct in the design of QampA fora

RELATED WORKOur work touches upon different fields of research Firstmail archives have been studied since as early as 2006 (seerecent review by Squire [33]) eg to understand knowledgesharing [32] and how users of the software get help [31]

Bettenburg et al [5] identified challenges that arise when us-ing off-the-shelf techniques for processing mailing list dataSingh et al [31] and Guzzi et al [16] qualitatively studiedthe types of discussions that occur in mailing lists the typesof questions asked and the types of responses that are givenFor example Singh et al [31] reported that about 90 of thequestions being asked in the online support forums of a num-ber of open source projects are related to problem solvingand information seeking while the remaining can be classi-fied as social discussions or feature requests

Second StackExchange (in particular StackOverflow) is re-ceiving increasing attention from researchers as witnessedby the growing number of papers using StackExchange datapublished each year15 as well as the yearly mining chal-lenge of the International Working Conference on MiningSoftware Repositories (MSR) choosing StackOverflow as itstopic in 2013 For example Anderson et al [1] investi-gated the dynamics of the StackOverflow community andfound eg significant assortativity in the reputations of co-answerers or relationships between reputation and answerspeed Vasilescu et al [36 37] studied the representative-ness and activity of genders on StackOverflow compared totraditional mailing lists and found StackOverflow to be a rel-atively ldquounhealthyrdquo community in which women disengagesooner although their activity levels are comparable to menrsquosAlso related to the current topic is our previous study [38]of the interplay between StackOverflow QampA activities andthe development process reflected by code changes commit-ted to GitHub by developers also active on StackOverflowThere we found that the more prolific StackOverflow ex-perts (ie those providing the most answers) are also veryactive GitHub committers and that in general participatingin StackOverflow catalyses committing to GitHub

Third numerous studies focused on the motivations of par-ticipants in knowledge-sharing (online) communities Forexample Lin [22] analysed the effects of extrinsic and intrin-sic factors in explaining the knowledge sharing intentions ofa number of employees from fifty large organisations in Tai-wan Hendriks [17] proposed a theoretical model linking thevariables and motivators involved with sharing knowledgeusing ICT Sowe et al [32] discussed the altruistic sharing ofknowledge between participants in the Debian mailing listsDeterding et al [10] presented a number of views on howgamification impacts participation of users in online commu-nities Zhuolun et al [42] studied the value of using badgesin StackOverflow and found that this reward system helpscultivate usersrsquo loyalty to the community

Finally R has been the object of two recent studies [13 41]German et al [13] found differences in the growth rate aswell as the way in which active contributors are attracted be-tween user-contributed packages and core packages Voul-garopoulou et al [41] analysed the quality of 508 R pack-ages and found eg that changes in social attributes such asthe number of developers do not influence the code quality

METHODOLOGYAs case for our case study we selected R [28] a popular dataanalysis software for several reasons First R is a typical

15httpmetastackoverflowcomq134495

Source Period messages participants Multiplethreads unique aliases

r-help 41997ndash 344854 33338 1532013 97125 28096

StackOverflow [r] 92008ndash 67248 10534 232013 24957 10284

CrossValidated [r] 72010ndash 7351 2312 132013 3208 2285

Table 1 Basic statistics for the two data sources

example of an open-source software ecosystem comprisinga (relatively) closed core of developers providing the basicfunctionality and coordinating new releases various devel-opers contributing patches and bug fixes numerous devel-opers contributing packages (plugins) that extend the func-tionality way beyond that provided with each release anda plethora of users Other examples of such ecosystemsinclude the Eclipse IDE and its third-party plugins or thePythonRubyLaTeX programming languages and their vari-ous contributed packagesgems Second R has been evolv-ing for almost 20 years and its entire history of mailing listcommunication is archived and publicly available ThirdR has been the recent subject of an extensive study of itsevolution [13] Fourth R promises to provide broader rele-vance outside the software developersrsquo community as manyusers and contributors are data analysts from different do-mains such as economics or biology with none or limitedsoftware engineering experience

Data extraction We integrated data from two differ-ent sources the community behind the main R user sup-port mailing list (r-help) and the StackExchange Rsubcommunities behind StackOverflow and CrossValidatedthe StackExchange websites containing the most R-related([r]-tagged) questions and answers

r-help is the principal mailing list for user support in RIt hosts discussions about problems and solutions using Ras well as announcements about the availability of new func-tionality and documentation Moreover it mirrors announce-ments from r-packages on new or enhanced contributedpackages or major developments from r-announce Theother major mailing list for R is r-devel targeting devel-opers testers and bug reporters with topics that are consid-ered ldquotoo technicalrdquo for r-helprsquos audience The r-helparchives can be downloaded in standard mbox format16 Wewrote Python scripts to automatically download extract andparse the archives which date as far back as April 1997

For each r-help participant we recorded their role (i) coredeveloper ie the 22 developers with ldquowrite access tothe R sourcerdquo since its inception17 (ii) peripheral devel-oper ie the 43 developers who ldquocontributed by donat-ing code bug fixes and documentationrdquo17 (iii) packageauthormaintainer ie the 2617 developers maintaining orhaving authored packages on CRAN the largest R packagerepository18 (iv) user ie those not fitting in any of theprevious categories

16httpwwwr-projectorgmailhtml17 httpwwwr-projectorgcontributorshtml18Comprehensive R Archive Network httpgooglhHJROZ

StackExchange is a network of QampA sites started in 2008now comprising more than 100 sites on different topics suchas English language video gaming photography or parent-ing All have similar look and feel and function by the sameprinciples participants can ask and answer questions ques-tions are organised by tags (eg [r] for R-related ques-tions) and voted upon by members of the community withvotes translating to reputation points and badges questionsand answers can be edited and improved by other memberswith sufficiently high reputation each participant has a ded-icated profile page combining the representation of oneself(self-determined eg name location or personal website)with activity-related information (automatically providedeg a list of the answers provided the total reputation countor a list of badges) users can subscribe to tags and receiveemail notifications when new questions are being asked (lesspractical for high-volume tags such as [c] [java] oreven [r] the latter receiving around 500 questions per weekat the time of writing) or simply browse the website

StackExchange releases quarterly data dumps in XML for-mat from all the websites under its umbrella We exploredthe one dated March 201319using Python scripts to parse theXML archives We restricted our analysis to StackOverflow(a generic programming QampA site and the first and largestStackExchange member) and CrossValidated (dedicated tostatistics) This left out a number of other StackExchangewebsites where questions tagged [r] occasionally (ie veryinfrequently) pop up such as TeX or GIS as well as all theactivity on StackOverflow and CrossValidated which did notoccur within the [r] tag The oldest StackOverflow ques-tions tagged [r] are dated September 2008 CrossValidatedhosts [r]-tagged questions dating as far back as July 2010The [r] activity on both websites mainly targets R users asopposed to R developers (hence the comparison to r-helpis sound) although R developers may also follow it as dis-cussed in the introduction

The organization of StackOverflow and CrossValidated interms of questions answers tags badges and reputationpoints is the same with the only difference being the targetaudience programmers in the former vs data analysts in thelatter Therefore we expect that differences between Stack-Overflow and CrossValidated should not affect data collec-tion analysis and interpretation Table 1 lists some basicstatistics about the data sources

Identity merging One of the biggest challenges when min-ing software repositories is identity merging [6 14 19 39]Both within a single repository (eg mailing lists) aswell as across repositories (eg mailing lists and Stack-Exchange) the same person may use different aliases iedifferent (name email) tuples For instance John Smithmay go by (John Smith johnsmithgmailcom) (Johnjohnsmithcom) etc The extent of the problem is unpre-dictable eg one of the Gnome developers reportedly used168 different aliases in the source code repository [39]

A solution is to merge identities and existing approaches arevery diverse For example Bird et al [6] try to match fullnames or email addresses shared by different aliases and

19httpwwwclearbitsnettorrents2121-mar-2013

use heuristics to ldquoguessrdquo email prefixes based on combina-tions of name parts (eg jsmith is likely to belong to JohnSmith) Kouters et al [19] use Latent Semantic Analysis(LSA) a popular information retrieval technique and reportbetter results in presence of very noisy data However all ex-isting approaches are known to produce false positives andfalse negatives [14] We followed different approaches forr-help and for StackExchange

StackExchange The email addresses of the StackExchangeparticipants are not publicly available for privacy reasonsbut their MD5 hashed versions are offered instead Since it ishighly unlikely for two different email addresses to share anMD5 hash we performed identity merging on the StackEx-change data if two MD5 hashes coincided This process re-sulted in a reduction in the number of participants of approx-imately 2 on StackOverflow and 1 on CrossValidated

We decided to limit the identity merging solely to the caseabove due to the following reasons First of all on StackEx-change as opposed to the mailing lists users have accountsand can log in using a password This suggests that the inci-dence of multiple aliases (ie accounts) by the same personshould be much lower than on the mailing lists where one ismore likely to send an email eg from the account which ismost at hand at any given time While it is common for peo-ple to own multiple email accounts (eg employer-relatedopen-source-related private etc) we believe it to be lesscommon for the same person to own multiple accounts onthe same StackExchange site (at least because one cannot in-tegrate the activity and reputation points earned using each)Moreover since the email addresses are not publicly avail-able identity merging would rely solely on names increas-ing the likelihood of false positives

Mailing list For r-help we performed a fixed-point computation for each email address we (i)collected all other email addresses having the sameprefix (eg for johnsmithgmailcom we collectedjohnsmithhotmailcom) as long as the prefix did not con-sist of a single word (ie contained either a dot or anunderscore character) and (ii) collected all the differentnames associated with it (eg John Smith and John if bothused the email address johnsmithgmailcom) Then foreach of these names we collected the different email ad-dresses with which they were associated (eg John Smithmight be associated with both johnsmithgmailcom andjohnnygmailcom) and repeated the process until a fixed-point was reached Checks for a minimal length of the emailprefixes and names were used to limit the number of falsepositives Applying this technique resulted in a reduction inthe number of participants of approximately 15 Amongthe users with the most aliases were eg an active partici-pant with ten different emails under the same name or oneof the peripheral R developers with two email addresses andnine different variations of his name

Intersecting the two data sets A prerequisite for study-ing migration phenomena from the mailing lists to StackEx-change was computing the overlap between the two commu-nities First we made use of the fact that email addresses areavailable for the mailing list participants as opposed to only

MD5 hashes for the StackExchange users As a result wemerged a mailing list user with one on StackExchange if theMD5 hash computed for any of the formerrsquos email addresses(there might have been multiple after identity merging on themailing list) was identical to the MD5 email hash of the lat-ter Then to expand the merges beyond only matching emailaddresses we followed a fixed-point approach similar to theone described above using names and email address prefixes(eg John Smith from StackExchange with John Smith fromr-help despite the latter not having an email address thatmatches the formerrsquos MD5 hash) As a side effect at thisstep two StackExchange users could have been merged tothe same r-help individual hence also among themselves(in addition to the merges using MD5 hashes discussed inthe previous paragraph) if their MD5 hashes matched emailaddresses known to belong to this r-help individual

Using this approach we found that approximately 15 (or3894) of the r-help unique participants (ie after identitymerging) have StackOverflow accounts and 5 (or 1159)have CrossValidated accounts20 These numbers are influ-enced by the difference in age between the two communi-ties since r-help started in 1997 while StackOverflowonly in 2008 and CrossValidated in 2010 part of the over-all mailing list population is not active anymore as observedalso for other open source projects [7] The size of the over-lap increases (eg to slightly over 20 for StackOverflow)when considering only the r-help participants active start-ing with September 2008 Variations may also occur be-tween different user groups (eg knowledge seekers andknowledge providers) For example activity of open sourcecontributors typically follows very skewed distributions [39]ie most have very few contributions Therefore it seemsmore likely that active knowledge providers would partici-pate in StackExchange than one-time knowledge seekers

Comparing apples and oranges One of our goals is under-standing how the amount of activity differs between mail-ing lists and StackExchange and whether mailing list par-ticipants are migrating to StackExchange However activ-ity is organised in different ways in the two communitiesOn StackExchange users ask questions and typically receiveseveral answers from other members of the community An-swerers ldquocompeterdquo with each other at offering good answersas reflected by the up votes received from other members ofthe community If a question was ill-formulated eg as sig-nalled by other users through comments the original posterhas the option of editing and improving it any number oftimes Similarly if the answers were incomplete the answer-ers or any other members of the community have the optionof updating them Questions may remain unanswered Ac-knowledgements for answers that solve the original problemcan be made by the original poster by accepting one answerusing a dedicated button andor posting comments to the oth-ers (we do not consider comments) The standard structureof a QampA post is therefore one question followed by zero ormore answers typically offered by different people

20Having a StackExchange account does not guarantee having en-gaged in QampA there hence a smaller fraction will have actuallybeen active on StackExchange

The equivalent communication structure on mailing lists isrepresented by discussion threads The first new messagewith a particular topic (subject) is the one starting the thread(ie the question) Similarly to StackExchange it can re-main unanswered or it can receive responses identifiable asmessages sent In-Reply-To the original message or to othermessages within the same discussion thread However asopposed to StackExchange the original poster and the an-swerers can all appear multiple times within a thread

Therefore to avoid comparing apples to oranges whencomparing activity on the two communication media we(i) grouped related email messages into discussion threadsoperation known as threading and (ii) discounted multipleanswers by the same person and discounted the originalposter as answerer within the same thread Due to numerouscaveats threading is a non-trivial operation typically takenlightly in the literature We used a slightly modified versionof Zawinskirsquos algorithm21 (based on Kuchlingrsquos Python im-plementation)22 to the best of our knowledge the leadingpublicly-available threading algorithm The number of dis-tinct threads obtained for r-help is displayed in Table 1For the StackExchange websites the number of threads isthe same as the number of questions by definition

User survey To better understand the context of our re-search related to changes in information seeking and infor-mation providing behaviours in the R community we aug-mented the quantitative study with a user survey23used to tri-angulate the quantitative findings We asked the participantsfor background information about their occupation experi-ence with R and involvement in the development of R (egas core developers package maintainers users) their infor-mation seeking behaviour (eg how frequently they need in-formation where and how they search for it how satisfac-tory what they find is what their preferred information seek-ing medium is and whether anything has changed in theirseeking behaviour over the years) and their information pro-viding behaviour (how often and by what means they shareinformation with others and what motivates them to do so)Survey participants were recruited by posting an ad aboutthe questionnaire on various channels such as r-helpthe StackOverflow Meta (the site for meta-discussion of theStackExchange family of QampA websites where the main-tainers of StackExchange also hang out) the StackOverflowR chat room Google groups other French Russian or Ger-man fora our own LinkedIn Twitter Google+ or Facebookprofiles or by contacting R developers directlyRESULTS

Overall knowledge seeking activity We start by analysingthe activity on r-help (Figure 1) After an initial in-crease in the number of threads started (questions asked)each month we observe drops in recent activity (since 2010)It is interesting to note that the 2010 inflection point is syn-chronised with initiatives to promote StackExchange amongR users24 as illustrated by this excerpt from a blog entry25

21httpwwwjwzorgdocthreadinghtml22httpsgithubcomakuchlingjwzthreading23httpgooglmZtz9X24httpgoogl0i4j1nhttpgoogljSZQFZ25httpgoogl12nJxe

1998 2000 2002 2004 2006 2008 2010 2012Date

rminushelp

StackOverflow

CrossValidated

Number of questions per month on rminushelp and StackExchange

Figure 1 Number of questions asked (threads started) each monthon r-help and StackExchange (StackOverflow and CrossValidated)in the [r]-tag The trend curves are Loess curves with 05 span

ldquoYoung experts donrsquot want to have to monitor email all day tobe part of the discussion Their answers belong on a websitewith a normal content management system with good searchfunctions and user interactions Go [to StackExchange and]sign uprdquo On the other hand [r] activity on StackExchangeand in particular StackOverflow (Figure 1) grows at an in-creasing rate The fraction of [r] questions relative to allquestions grows linearly26(ie [r] gains popularity)

RQ1 Knowledge seeking on r-help decreasessharply since around 2010 At the same time the num-ber of R-related questions asked on StackOverflow andCrossValidated grows at an increasing rate

Structure of the community Recall that we have distin-guished between different roles within the r-help popula-tion Here we wish to understand the knowledge seeking andknowledge providing activities of the different roles Devel-opers (mostly package authorsmaintainers) are responsiblefor starting only a small fraction of the discussion threads(approximately one tenth in recent years) as depicted in Fig-ure 2 top Non-developers the vast majority of questionaskers are most responsible for the decreasing activity onr-help since 2010 The situation is reversed for thread an-swerers (Figure 2 bottom) The R core developers and thepackage authorsmaintainers are responsible for most repliesto threads started on r-help However since the decreasein new r-help threads started in 2010 the answering activ-ity of developers (package authorsmaintainers in particular)has also decreased the most Both users and developers maybe tempted by StackExchange eg one in search of betteror faster answers the othermdashof recognition

Activity of contributors engaged in both communitiesIn this section we focus on those mailing list participantswho were also active on StackExchange To control for thedifference in age between r-help (started in April 1997)and StackExchange (started by StackOverflow in September2008) we only consider the mailing list participants startingwith September 2008 While we were able to link approx-imately 20 (from September 2008 or 15 overall) of ther-help participants with StackExchange accounts (either

26httpgooglgWKxbQ

1998 2000 2002 2004 2006 2008 2010 2012Date

)Roles

Others

Package authorsmaintainers

Peripheral developers

Core developers

Breakdown of new rminushelp threads (questions) per role

1998 2000 2002 2004 2006 2008 2010 2012Date

Others

Core developers

Breakdown of rminushelp thread answers per role

Figure 2 Breakdown of new threads (top) and thread answers (bottom)on r-help by initiator role Only the trend component of each time se-ries is displayed after using a seasonal-trend decomposition procedurebased on Loess The top plot corresponds to Figure 1

StackOverflow CrossValidated or both) we also observedthat not all of them have actually engaged in [r]-taggedQampA on the two StackExchange sites Indeed only 78(or 1293 out of 16569) have asked or answered at least one[r]-tagged question on StackOverflow or CrossValidated

Different population subgroups again exhibit different be-haviour (Figure 3) the fraction of participants withStackExchange accounts is much higher among developers(core peripheral or package authorsmaintainers) than non-developers (left) within those with StackExchange accountsdevelopers are also more likely to have been active (ie tohave engaged in QampA) than non-developers (right) Thegroups of core and peripheral developers are too small toenable statistically significant comparisons (eg we linkedonly 8 out the 18 core developers active on r-help sinceSeptember 2008 with StackExchange accounts out of theseonly 5 asked or answered at least one question on Stack-Exchange) However when considering the core periph-eral and package developers together the differences in join-ing and contributing to StackExchange between them andthe users become statistically significant and the effect sizespractically significant (Table 2) For example developers(183 active out of 331 with StackExchange accounts) havebetween a 134 and 166 times higher chance of being activeon StackExchange than users (second row)

The more things you do the more you can do Activelycontributing to both communities requires dividing onersquostime between them Do mailing list participants contributemore or less to r-help if they are also active on Stack-Exchange To answer this question we recorded for eachr-help participant a flag indicating whether or not he orshe was active on StackExchange in the [r] tag the num-

All(333416569)

Core(818)

Package(312899)

Periph(1121)

Users(300315631)

on SE not on SE

All(12933334)

Core(58)

Package(173312)

Periph(511)

Users(11103003)

active inactive

Figure 3 Left Fraction of r-help participants (starting with Septem-ber 2008) with StackExchange accounts by role Right Among thosewith StackExchange accounts the fraction that asked or answered atleast one question The absolute values are displayed under each label

Outcome Devs (D) Users (U) χ2 (p-val) D-U (CI) RR (CI)Have SE 331938 300315631 14128 16 183accounts (3528) (1921) (lt22E-16) (13 19) (167 201)

Active on 183331 11103003 4139 18 149SE (553) (369) (12E-10) (12 24) (134 166)

Table 2 Results of statistical testing for the Figure 3 cases D-U (CI)Estimate and 95 confidence intervals for the difference of propor-tions RR (CI) Risk ratio by unconditional maximum likelihood esti-mation and 95 confidence intervals using normal approximation

ber of threads started and the number of threads answeredon the mailing list To control for differences in age weagain restricted the counts to posts after September 2008Depending on their r-help activity types three groups ofparticipants emerged (Table 3) (i) those who only startedthreads (ii) those who only replied to threads and (iii) thosewho both started and replied to threads For each group wecompared the amount of activity (number of threads startedfor (i) and (iii) and number of threads answered for (ii) and(iii)) between those active and those inactive on StackEx-change using the Wilcoxon-Mann-Whitney test (since activ-ity is not normally distributed) In all cases the results re-veal that the mailing list participants who also contributedto StackExchange are more active than those who did notFor example the median number of threads answered is 4among those who ask and answer on r-help and engageon StackExchange compared to 1 for the others

Next we compared the StackExchange activity for r-helpparticipants also active there to that of the rest of the Stack-Exchange [r] community which were more prolific Sim-ilarly we distinguished between exclusive askers exclusiveanswerers and users who both asked and answered ques-tions (Table 4) Similar comparisons using Wilcoxon-Mann-Whitney tests show that among the StackExchange [r] pop-ulation those who also participate in r-help are more ac-tive For example the median number of [r] answers forthose who ask and answer on StackExchange and contributeto the mailing list is 3 as opposed to just 1 for the others

These results show that the most prominent mailing list par-ticipants ldquosignalrdquo the most both on the mailing list and onStackExchange The group of users who participate in theR mailing lists and engage in [r] QampA on StackExchangeare the most active contributors relative to the other mail-ing list participants and similarly the most active contrib-utors relative to the other StackExchange participants Thisresult is in line with a recent observation made by Posnettet al [27] in their study of expertise on StackExchange

Activity onr-helpSE

No account orinactive (N)

WMW comparison (p-val) Active(A)

Only ask 11724 Asking AgtN (47E-14) 790Only answer 1413 Answering AgtN (42E-3) 123Ask answer 2139 Asking AgtN (283E-8) 380

Answering AgtN (lt22E-16)Total 15276 1293

Table 3 Activity comparisons for three groups of r-help participantsalso active on StackExchange relative to the other r-help participants(Wilcoxon-Mann-Whitney tests with Benjamini-Hochberg correction)StackExchange = StackOverflow + CrossValidated

Activity onSE r-help

Inactive (N) WMW comparison (p-val) Active(A)

Only ask 5464 Asking AgtN (525E-5) 794Only answer 2824 Answering AgtN (lt22E-16) 338Ask answer 1470 Asking AgtN (324E-12) 564

Table 4 Activity comparisons for three groups of StackExchange con-tributors also active on r-help relative to the other StackExchangeparticipants (Wilcoxon-Mann-Whitney tests with Benjamini-Hochbergcorrection) StackExchange = StackOverflow + CrossValidated

ldquoexpertise is present from the beginning [of onersquos participa-tion] and doesnrsquot increase with time spent with the commu-nity [] In other words experts join the community asexperts and provide good answers immediatelyrdquo Assumingthe number of answers one provides to be a proxy for theirexpertise we showed eg that StackExchange experts (ieheavy answerers) stem from mailing list experts Recentlywe made a similar observation about GitHub developers ac-tive on StackOverflow [38] those who provide the most an-swers are also those who perform the most commits

RQ2 The more things you do the more you can doContributors to both communities (more likely devel-opers than non-developers) are more active than thosewho focus on just one

Behavioural differences We compared the question an-swering activity on r-help for two groups of knowl-edge providers those with (denoted ldquor-help and Stack-Exchangerdquo) and those without (denoted ldquoonly r-helprdquo)StackExchange accounts (Figure 4) For each group we plot-ted the total number of answers given to r-help threadseach month (denoted ldquoOn r-helprdquo) for the ldquor-help andStackExchangerdquo group we also plotted the number of an-swers given to StackExchange questions each month (de-noted ldquoOn StackExchangerdquo)

We draw the following observations The inflection point(mid 2010) in the number of answers given to r-helpthreads by the ldquor-help and StackExchangerdquo group coin-cides with the inflection point in general activity trend onr-help depicted in Figure 1 top This should not come as asurprise In the previous sections we have seen that it is thosemailing list participants that are also active on StackEx-change that are most active ie the activity drivers or trendsetters (both relative to the other mailing list participants aswell as to the other StackExchange participants) Thereforeit is natural that they also exhibit a decreasing trend in ac-

2009 2010 2011 2012 2013Date

On rminushelp(rminushelp and StackExchange)

On StackExchange(rminushelp and StackExchange)

On rminushelp(only rminushelp)

Activity of rminushelp answerers

Figure 4 Number of questions answered on r-help (after Septem-ber 2008) and StackExchange each month participants exclusive to themailing list versus those also active on StackExchange Only the trendcomponent of each time series is displayed after using a seasonal-trenddecomposition procedure based on Loess

tivity since mid 2010 However it is interesting to observethat the activity trend of the r-help knowledge providerswithout StackExchange accounts remains relatively constant(or decreases much slower) Corroborated with the increas-ing trend in answering activity on StackExchange by mailinglist participants active on StackExchange this suggests thatR mailing list experts are migrating to StackExchange

However not all members of the ldquor-help and StackEx-changerdquo group exhibit similar patterns (Figure 5) For exam-ple participant 3513 (topmost subfigure) is the norm single-handedly responsible for approximately one fifth of the totalnumber of answers he exhibits a similar decreasing trendin activity as the entire group In contrast participant 9440(second topmost subfigure) has migrated entirely to Stack-Exchange in the second half of 2010 Participant 209 (secondbottommost subfigure) has been active on both r-help andStackExchange ever since the beginning albeit not as inten-sively on StackExchange over the years he has become in-creasingly disinterested in r-help without becoming moreinterested in StackExchange Finally participant 9859 (bot-tommost subfigure) is barely active on StackExchage but isbecoming increasingly more active on r-help

Next we investigated whether there are significant differ-ences between the speed with which r-help participantsalso active on StackExchange answer questions on r-helpversus on StackExchange If the incentives on StackEx-change (eg gamification more attractive user interface)do not influence behaviour then we should not observe anymeaningful differences If instead users are engaging in therace for reputation and badges then they might try to an-swer more questions (we have already seen this in the pre-vious section) as well as answer questions faster on Stack-Exchange than on the mailing list where they are not ldquore-wardedrdquo for their haste To test this hypothesis we com-pute for each member of the ldquor-help and StackExchangerdquogroup the time intervals between their first answer within athread and the thread start (for all threads for which theyprovide answers) on the one hand and between their firstanswer to a StackExchange question and the question date(for all questions they answer) on the other hand Then us-ing a Wilcoxon-Mann-Whitney test we compare two largegroups of r-help and StackExchange time deltas obtainedby concatenating the intervals for all ldquor-help and Stack-Exchangerdquo members (Figure 6)

2008 2009 2010 2011 2012 2013Date

Participant 3513 rminushelp

StackExchange

2009 2010 2011 2012 2013Date

StackExchange

2010 2011 2012 2013Date

StackExchange

2008 2009 2010 2011 2012 2013Date

StackExchange

Figure 5 Different patterns of co-participation in r-help and Stack-Exchange for knowledge providers (Loess curves with 05 span)

Our results show that the StackExchange answers are sig-nificantly faster (p lt 22E-16) with a median of 47 minutesversus 3 hours on r-help This confirms that r-help par-ticipants behave differently when on StackExchange wherethey are rewarded for their efforts more To further put theseresults into context note that on mailing lists a knowledgeprovider is passively (automatically) provided with oppor-tunities to answer questions (by receiving emails with newquestions directly or in the form of a digest) on StackEx-change although subscribing to tags (to receive notificationsof new questions) is possible knowledge providers will fre-quently actively pursue new questions being asked by brows-ing the site27and will rush to answer them

RQ3 Knowledge providers active both on r-helpand on StackExchange (ie the mailing list experts) aremigrating to StackExchange where they answer ques-tions significantly faster than on the mailing list

TriangulationIn this subsection we compare the quantitative findings withthe user survey results We received 115 responses Onerespondent participated twice and two respondents did notconsent to participate in the survey leaving 112 valid re-sponses mostly from academics (32 of the respondents)statisticians (35) students (14) and software engineers

On rminushelp On StackExchange

Speed of answers for rminushelp participants active on StackExchange

1 hour

1 month

1 year5 years

Figure 6 Answer speed for r-help users active on StackExchangeBeans corresponding density shapes Longer horizontal lines mediansper group Dashed line median over all groups

(6) Most respondents described themselves as R users(51) or authors or maintainers of R packages (35) Wealso received two responses from the core developers Westress that the survey was not intended for quantitative anal-ysis rather it served to triangulate the earlier findings

Our first empirical finding was the evidence of a transitionin seeking support from the mailing list to StackExchangewhile activity on r-help decreases sharply since around2010 the number of R-related questions asked on StackEx-change grows at an increasing rate In order not to imposeour perception of transition on the survey participants weopted for an open question and asked whether the partici-pants have experienced changes in their information seekingbehaviour over the years On the one hand we observedthat all survey participants reporting changes related to themailing lists indicated disengagement eg ldquoGoogle is get-ting better at finding answers related to R so I use it moreI rely less on going directly to mailing lists nowrdquo (user anddocumentation contributor statisticiandata analyst 6 yearsof experience) or ldquor-help used to be very helpful Butas the number of posts has gone up I find that reading it isnot as useful as it had beenrdquo (package maintainer academic8 years of experience) On the other hand when the surveyparticipants reported changes related to StackExchange sitesthey are much more positive eg ldquoStackExchange has be-come more prevalent and more useful in the last two yearsrdquo(user academic 5 years of experience) or ldquoI started usingRSeekorg but currently I prefer stackoverflowcom whosequestion database is increasingrdquo (user statisticiandata ana-lyst 6 years of experience) Still these observations shouldbe placed in the right context more than half of the respon-dents are satisfied with the quality of the answers found in themailing list archives and more than a third will occasionallyshare their knowledge through mailing list discussions

Our next empirical finding was that developers are morelikely to be active on StackExchange than non-developersTo triangulate it we asked the survey participants what mo-tivates them to contribute their knowledge While numer-ous reasons such as reciprocity have been named both by theusers and by the developers developers are the only ones thatmentioned StackExchange and gamification explicitly ldquoIncase of StackExchange the reputation ratings are a nice littleincentiverdquo (academic package maintainer 6 years of experi-ence) or ldquo[I am motivated by] peer recognitiongamificationwithin StackOverflowrdquo (academic contributes to documen-tation submits bugs 5 years of experience) Moreover de-

0 5 10 15 20 25SE only

StackExchange activity of the survey participants

0 5 10 15 20 25ML only

Mailing list activity of the survey participants

Figure 7 Number of survey participants frequently (green) occasion-ally (yellow) and rarely (red) sharing knowledge about R on StackEx-change (SE) and mailing lists (ML)

velopers put more stress on being motivated by the desireto enhance onersquos own reputation (ldquowanting to be seen as arelative expert in some sub-domainrdquo academic contributesto documentation submits bugs 5 years of experience) andon evangelisation (ldquoI like to promote R because I think itrsquosa great tool to learn about the principles of statistics andprogrammingrdquo academic submits bugs 4 years of experi-ence) We believe that StackExchange is a better platform forthese goals than mailing lists since StackExchange quanti-fies onersquos reputation and makes it visible to everyone (seeprevious discussion of the gamification blueprint) as op-posed to the mailing lists lacking commonly agreed uponand readily accessible representations of reputation More-over StackExchange is a multi-topic community hence bet-ter suited to evangelisation than ldquopreaching to the choirrdquo ona mailing list

The empirical study also showed that contributors active bothon the mailing list and on StackExchange are more activethan those focused on only one of the two To triangu-late this finding we asked the survey participants how oftenthey share information about R by replying to threads on Rmailing lists and by answering questions on StackExchangeFigure 7 shows that survey participants active in both com-munities are more actively participating in the mailing listdiscussions than those active only on the mailing lists ie itsupports the findings of the empirical study A similar obser-vation can be made for the StackExchange activity althoughdisparity between the numbers of survey respondents activeonly on StackExchange (12) and active in both communities(28) hinders interpretation in this case

Finally in the empirical study we observed that knowledgeproviders active both on r-help and on StackExchange(ie the mailing list experts) are migrating to StackExchangeand are answering questions faster there than on the mailinglist To triangulate we revisited the survey questions aboutchanges in information seeking behaviour and motivationsfor sharing knowledge Half of the survey participants re-porting changes related to the mailing lists explicitly indi-cated their transition to StackExchange websites eg dueto an easier user interface a friendlier community or thesites being better indexed by the search engines In addi-tion although not explicitly commenting on the speed withwhich they provide answers respondents who contributetheir knowledge to StackExchange acknowledged gamifica-tion (ie reputation building) as important eg ldquo[I am moti-

vated to answer questions on StackExchange because] itrsquos agame which also serves a good purposerdquo (academic pack-age maintainer contributes to documentation submits bugs6 years of experience) Not everyone is attracted by theStackExchange incentives though R users might still preferto offer support via the mailing lists eg ldquomailing lists arevery nice to read and to reply to due to their text-only policyrdquo(academic package maintainer 11 years of experience)

IMPLICATIONSIn this section we review the implications of our study forQampA sites design knowledge communities and future re-search

Implications of our work for QampA site designers are twofoldFirst of all we have observed that the movement to socialgamified QampA is correlated with an increase in the engage-ment of knowledge providers and the rapidity of responseThis finding suggests that QampA site designers should con-sider gamification elements to increase engagement of theparticipants and indirectly popularity of their sites Sec-ond we have seen that users and developers exhibit dif-ferent behaviour (eg developers are more likely to be ac-tive on StackExchange) This means that the QampA sitedesigners should cater for different groups of communitymembers (eg developers and users) having different needsand expectations by providing different knowledge sharingchannels involving different participation and reward mech-anisms This is also why we do not believe that r-helpwilleventually die off as StackOverflow and r-help users haveexplained ldquoIf you have a problem and you are completelystuck ask a question on the mailing listrdquo28and ldquoAlthoughmany people there [StackOverflow] gave very detailed an-swers I have the feeling that there is much more wisdom onthe subject that is still only available in this mailing listrdquo29

For knowledge communities our findings suggest that a moveto gamified social QampA can be a beneficial strategy whensearching for a better visibility and contributorsrsquo activityWe believe that gamification may be of particular interestin closed environments such as commercial corporationswhere knowledge is proprietary and the set of potentiallyknowledge providers is closed Furthermore the afore-mentioned distinction between different groups of commu-nity members and different knowledge sharing channels re-quired suggests that knowledge communities might preferto combine different knowledge channels

Finally our findings provide new insights for researchers ofcollaborative software engineering While there is a signifi-cant body of work on why gamification features should posi-tively influence participation in knowledge communities fa-cilitate user contributions and foster productivity our studyadds concrete evidence that gamification features or commu-nity design affect productivity of the contributors

THREATS TO VALIDITYDespite our detailed efforts in experimental design datagathering and data analysis we do note several threats tothe validity of our methodology and conclusions The core

28httpstackoverflowcoma338247729httpgooglCoJIyp

step in the data gathering extracting information from theStackExchange data dump and the mail archives is sub-ject to threats to validity akin to those identified for digitaltrace data [15 18] Following the classification of Howi-son [18] system and practice issues in our work may berelated to communication through other mailing lists thanr-help and other StackExchange sites than StackOverflowand CrossValidated Indeed advanced R packages such aslme4 have separate mailing lists30 and experts workingwith these packages might prefer not to migrate to StackEx-change However r-help StackOverflow and CrossVali-dated are the biggest platforms of their kind We also know-ingly omitted other venues where knowledge exchange hap-pens eg Google+ groups or individual blogs Our resultsare very strong and give us confidence that StackExchangeis a good representative of the missing QampA communities

Moreover the data extracted is subject to potential reliabilitythreats arising from missing messages from mail archivesand questions and answers being deleted from StackEx-change Indeed as indicated by one of the survey partic-ipants StackExchange sites encourage the participants ldquotoask well-formed questions leading to well-formed answersrdquowhile ill-formed questions are being removed Question oranswer removal might lead to underestimating the activity ofthe StackExchange contributors as well as the overlap be-tween StackExchange and r-help Similarly removal ofStackExchange accounts might lead to underestimation ofthe overlap Another reliability threat is related to multiplesystem representations of a single individual which againmight incur an underestimate of the overlap between Stack-Exchange and r-help We have explicitly addressed thisthreat when discussing identity merging in the methodol-ogy section while no data gathering is perfect our effortsare particularly detailed and have been shown elsewhere towork satisfactorily Finally reliability might be threatenedby noise in the data such as spam messages stored in themail archives

The next group of threats to validity refers to representationof the data extracted in the model of one question followedby multiple answers While the distinction and relation be-tween questions and answers on StackExchange is explicitin the data organisation the natural counterpart in the mail-ing lists is the discussion thread (see ldquoComparing apples andorangesrdquo in the methodology section) Recognition of thediscussion thread starter as the asker and other thread par-ticipants as answerers can however be threatened by thethread starter posting an announcement rather than a ques-tion as well as by thread participants trying to clarify theintention of the thread starter rather than answering her ques-tions We have performed an informal evaluation of the dis-cussion threads in r-help and observed that the lionrsquos shareof threads adhere to the ldquoquestion and answersrdquo model simi-lar to StackExchange

Temporal aggregation threats arise when aggregating eventsthat occur at different points in time (cf Figures 1 and 4)Indeed inappropriate choice of the time granularity mighthave made our work subject to ecological fallacy [26] How-

30httpgooglofUT3L

ever our choice for month as the basic time unit stems fromthe way traditional mail archives are presented (per month)as well as the literature

CONCLUSIONSBy means of a longitudinal data set which connects the samepeople over time in different knowledge-transfer venues wehave been able to examine the phenomena associated withthe transition to StackExchange from the mailing lists

Future research directions would involve designing and set-ting up follow-up interviews with R developers and usersto address the causality of user migration between differ-ent mailing lists and StackExchange which incentives suchas more modern user interface potentially wider audiencesand gamification are being perceived as most important fordifferent subgroups within the R community Understand-ing causality could result in quantitative predictive modelsof user behaviour in the presence of various incentives

ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their numerousremarks that helped us improve the paper significantlyVasilescu gratefully acknowledges support from the DutchScience Foundation (NWO) through the project NWO60006512010N235 Filkov gratefully acknowledges sup-port from the US Air Force Office of Scientific Researchthrough the project FA955-11-1-0246 Devanbu gratefullyacknowledges support from the NSF grants CISE SHFMEDIUM 0964703 and CISE SHF EAGER 1247088Part of this research has been carried out during Vasilescursquosvisit at the University of California Davis USA

REFERENCES1 Anderson A Huttenlocher D P Kleinberg J M and Leskovec J

Discovering value from community activity on focused questionanswering sites a case study of Stack Overflow In Proc KDD ACM(2012) 850ndash858

2 Antin J and Churchill E F Badges in Social Media A SocialPsychological Perspective In Proc CHI ACM (2011)

3 Bagozzi R P and Dholakia U M Open source software usercommunities A study of participation in Linux user groupsManagement Science 52 7 (2006) 1099ndash1115

4 Begel A Bosch J and Storey M-A Social networking meetssoftware development Perspectives from GitHub MSDN StackExchange and TopCoder IEEE Software 30 1 (2013) 52ndash66

5 Bettenburg N Shihab E and Hassan A E An empirical study on therisks of using off-the-shelf techniques for processing mailing list dataIn Proc ICSM IEEE (2009) 539ndash542

6 Bird C Gourley A Devanbu P T Gertz M and Swaminathan AMining email social networks In Proc MSR ACM (2006) 137ndash143

7 Bird C Gourley A Devanbu P T Swaminathan A and Hsu GOpen borders Immigration in open source projects In Proc MSRIEEE (2007) 6

8 Capiluppi A Serebrenik A and Singer L Assessing technicalcandidates on the social web IEEE Software 30 1 (2013) 45ndash51

9 Dabbish L A Stuart H C Tsay J and Herbsleb J D Social codingin GitHub transparency and collaboration in an open softwarerepository In Proc CSCW ACM (2012) 1277ndash1286

10 Deterding S Gamification designing for motivation Interactions 19 4(2012) 14ndash17

11 Deterding S Sicart M Nacke L OrsquoHara K and Dixon DGamification Using game-design elements in non-gaming contexts InProc CHI ACM (2011) 2425ndash2428

12 Gentle A Conversation and Community The Social Web forDocumentation XML Press 2009

13 German D M Adams B and Hassan A E The evolution of the Rsoftware ecosystem In Proc CSMR IEEE (2013)

14 Goeminne M and Mens T A comparison of identity merge algorithmsfor software repositories Science of Computer Programming (2011)

15 Goggins S Mascaro C and Valetto G Group informatics Amethodological approach and ontology for understandingsocio-technical groups JASIST 64 3 (2013) 516ndash539

16 Guzzi A Bacchelli A Lanza M Pinzger M and van Deursen ACommunication in open source software development mailing lists InProc MSR IEEE (2013) 277ndash286

17 Hendriks P Why share knowledge The influence of ICT on themotivation for knowledge sharing Knowledge and ProcessManagement 6 2 (1999) 91ndash100

18 Howison J Crowston K and Wiggins A Validity issues in the use ofsocial network analysis with digital trace data Journal of theAssociation for Information Systems 12 (2011)

19 Kouters E Vasilescu B Serebrenik A and van den Brand M G JWhorsquos who in Gnome Using LSA to merge software repositoryidentities In Proc ICSM IEEE (2012) 592ndash595

20 Lakhani K R and von Hippel E How open source software worksldquofreerdquo user-to-user assistance Research Policy 32 6 (2003) 923ndash943

21 Lee S Moisa N and Weiss M Open source as a signalling device -an economic analysis Working Paper Series Finance and Accounting102 Department of Finance Goethe University Frankfurt am Main2003

22 Lin H-F Effects of extrinsic and intrinsic motivation on employeeknowledge sharing intentions J Information Science 33 2 (2007)135ndash149

23 Mamykina L Manoim B Mittal M Hripcsak G and Hartmann BDesign lessons from the fastest QampA site in the west In Proc CHIACM (2011) 2857ndash2866

24 Oram A Why do people write free documentation Results of a surveyhttpwwwonlampcomlpta7062 2007

25 Parnin C Treude C Grammel L and Storey M-A Crowddocumentation Exploring the coverage and the dynamics of APIdiscussions on Stack Overflow Tech rep Georgia Institute ofTechnology 2012

26 Posnett D Filkov V and Devanbu P T Ecological inference inempirical software engineering In ASE (2011) 362ndash371

27 Posnett D Warburg E Devanbu P and Filkov V Mining StackExchange Expertise is evident from earliest interactions In Proc ASESocialInformatics IEEE (2012) 199ndash204

28 R Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical Computing ViennaAustria 2010

29 Shah C Oh S and Oh J S Research agenda for social QampALibrary amp Information Science Research 31 4 (2009) 205ndash209

30 Singer L Filho F Cleary B Treude C Storey M and SchneiderK Mutual assessment in the social programmer ecosystem Anempirical investigation of developer profile aggregators In ProcCSCW ACM (2013)

31 Singh V Twidale M B and Nichols D M Users of open sourcesoftware - How do they get help In Proc HICSS IEEE (2009) 1ndash10

32 Sowe S K Stamelos I and Angelis L Understanding knowledgesharing activities in freeopen source software projects An empiricalstudy JSS 81 3 (2008) 431ndash446

33 Squire M How the floss research community uses email archivesIJOSSP 4 1 (2012) 37ndash59

34 Storey M-A D Treude C van Deursen A and Cheng L-T Theimpact of social media on software engineering practices and tools InProc FoSER ACM (2010) 359ndash364

35 Swisher J Open source user assistance Ensuring that everybody winsOpen Source Business Resource (2010)

36 Vasilescu B Capiluppi A and Serebrenik A Gender representationand online participation A quantitative study of StackOverflow InProc ASE SocialInformatics IEEE (2012) 332ndash338

37 Vasilescu B Capiluppi A and Serebrenik A Gender representationand online participation A quantitative study Interacting withComputers (2013) 1ndash24

38 Vasilescu B Filkov V and Serebrenik A StackOverflow andGitHub Associations between software development and crowdsourcedknowledge In Proc ASE SocialCom IEEE (2013) 188ndash195

39 Vasilescu B Serebrenik A Goeminne M and Mens T On thevariation and specialisation of workloadndashA case study of the Gnomeecosystem community Empirical Software Engineering (2013) 1ndash54

40 Vassileva J Motivating participation in social computing applicationsa user modeling perspective User Modeling and User-AdaptedInteraction 22 1-2 (2012) 177ndash201

41 Voulgaropoulou S Spanos G and Angelis L Analyzingmeasurements of the R statistical open source software In Proc SEWIEEE (2012) 1ndash10

42 Zhuolun L Huang K-W and Cavusoglu H Can we gamifyvoluntary contributions to online QampA communities Quantifying theimpact of badges on user engagement In Proc WISE (2012)

Abstract

Introduction

Background

Research Questions

Related Work

Methodology

Results

Triangulation

Implications

Threats to Validity

Conclusions

Acknowledgements

References