10
Discovering Dynamic Developer Relationships from Software Version Histories by Time Series Segmentation Harvey Siy§, Parvathi Chundi §, Daniel J. Rosenkrantz, Mahadevan Subramaniam § §Computer Science Department, University of Nebraska at Omaha, Omaha, NE 68182 {hsiy,pchundi,msubramaniam}@mail.unomaha.edu Computer Science Department, University at Albany, SUNY, Albany, NY 12222 [email protected] Abstract Time series analysis is a promising approach to dis- cover temporal patterns from time stamped, numeric data. A novel approach to apply time series analysis to discern temporal information from software version repositories is proposed. Version logs containing numeric as well as non- numeric data are represented as an item-set time series. A dynamic programming based algorithm to optimally seg- ment an item-set time series is presented. The algorithm au- tomatically produces a compacted item-set time series that can be analyzed to discern temporal patterns. The effec- tiveness of the approach is illustrated by applying to the Mozilla data set to study the change frequency and devel- oper activity profiles. The experimental results show that the segmentation algorithm produces segments that capture meaningful information and is superior to the information content obtaining by arbitrarily segmenting time period into regular time intervals. 1 Introduction Software version repositories contain an enormous wealth of information regarding development and mainte- nance activities in a project. Recently, there has been a lot of interest in analyzing these repositories to discern informa- tion pertaining to changes performed on software artifacts, study developer profiles, understand the impact of the de- velopment process on software quality by predicting future changes, and so on. The importance of explicitly consider- ing the temporal dimension while analyzing version reposi- Supported in part by NSF Grant IIS-0534616 and by Grant Number P20 RR16469 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH). Supported in part by NSF Grant CCF-0541057. tories based on time stamps associated with the version logs has been well-recognized in these works. Time series analysis is a well-established area of re- search that has been highly successful in discovering nuggets of temporal information from time stamped data. Time series analysis has been traditionally applied to nu- meric measurements/observations performed at regular in- tervals of natural phenomena like rainfall or man-made phe- nomena like stock prices. The overall objective of this paper is to enable the ap- plication of time series analysis techniques to discover in- formation from version logs. This poses several interesting challenges. First, unlike traditional time series data, which is comprised of numeric measurements, version logs typi- cally include both numeric and non-numeric data. Effective analysis of such logs requires that we extend the current time series analysis techniques to handle non-numeric data. Further, version logs of a large project can potentially span a large period of time with rich and varying temporal patterns of activities. These version logs can potentially result in non-numeric time series data containing hundreds to thou- sands of measurements. Methods are needed to compactly represent such time series data to highlight the underlying temporal patterns. One approach that has been successfully used to com- pactly represent time series data is segmentation. Segmen- tation of a time series data automatically partitions the time period associated with a time series into a sequence of time intervals (or segments), which can then be further analyzed to find interesting temporal patterns. The importance of seg- mentation time series is well-established and several studies have demonstrated the advantages of segmentation to com- pactly represent a time series [8, 14, 12]. In this paper, we propose a novel approach that mod- els version logs as time series data and describe a dynamic programming based method to construct optimal segmenta- 1-4244-1256-0/07/$25.00 © 2007 IEEE ICSM 2007 415

[IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

Discovering Dynamic Developer Relationships from Software Version Historiesby Time Series Segmentation

Harvey Siy§, Parvathi Chundi∗§, Daniel J. Rosenkrantz‡, Mahadevan Subramaniam†§

§Computer Science Department, University of Nebraska at Omaha, Omaha, NE 68182{hsiy,pchundi,msubramaniam}@mail.unomaha.edu

‡Computer Science Department, University at Albany, SUNY, Albany, NY [email protected]

Abstract

Time series analysis is a promising approach to dis-cover temporal patterns from time stamped, numeric data.A novel approach to apply time series analysis to discerntemporal information from software version repositories isproposed. Version logs containing numeric as well as non-numeric data are represented as an item-set time series. Adynamic programming based algorithm to optimally seg-ment an item-set time series is presented. The algorithm au-tomatically produces a compacted item-set time series thatcan be analyzed to discern temporal patterns. The effec-tiveness of the approach is illustrated by applying to theMozilla data set to study the change frequency and devel-oper activity profiles. The experimental results show thatthe segmentation algorithm produces segments that capturemeaningful information and is superior to the informationcontent obtaining by arbitrarily segmenting time period intoregular time intervals.

1 Introduction

Software version repositories contain an enormouswealth of information regarding development and mainte-nance activities in a project. Recently, there has been a lot ofinterest in analyzing these repositories to discern informa-tion pertaining to changes performed on software artifacts,study developer profiles, understand the impact of the de-velopment process on software quality by predicting futurechanges, and so on. The importance of explicitly consider-ing the temporal dimension while analyzing version reposi-

∗Supported in part by NSF Grant IIS-0534616 and by Grant NumberP20 RR16469 from the National Center for Research Resources (NCRR),a component of the National Institutes of Health (NIH).

†Supported in part by NSF Grant CCF-0541057.

tories based on time stamps associated with the version logshas been well-recognized in these works.

Time series analysis is a well-established area of re-search that has been highly successful in discoveringnuggets of temporal information from time stamped data.Time series analysis has been traditionally applied to nu-meric measurements/observations performed at regular in-tervals of natural phenomena like rainfall or man-made phe-nomena like stock prices.

The overall objective of this paper is to enable the ap-plication of time series analysis techniques to discover in-formation from version logs. This poses several interestingchallenges. First, unlike traditional time series data, whichis comprised of numeric measurements, version logs typi-cally include both numeric and non-numeric data. Effectiveanalysis of such logs requires that we extend the currenttime series analysis techniques to handle non-numeric data.Further, version logs of a large project can potentially span alarge period of time with rich and varying temporal patternsof activities. These version logs can potentially result innon-numeric time series data containing hundreds to thou-sands of measurements. Methods are needed to compactlyrepresent such time series data to highlight the underlyingtemporal patterns.

One approach that has been successfully used to com-pactly represent time series data is segmentation. Segmen-tation of a time series data automatically partitions the timeperiod associated with a time series into a sequence of timeintervals (or segments), which can then be further analyzedto find interesting temporal patterns. The importance of seg-mentation time series is well-established and several studieshave demonstrated the advantages of segmentation to com-pactly represent a time series [8, 14, 12].

In this paper, we propose a novel approach that mod-els version logs as time series data and describe a dynamicprogramming based method to construct optimal segmenta-

1-4244-1256-0/07/$25.00 © 2007 IEEE ICSM 2007415

Page 2: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

tions of version logs. The effectiveness of the proposed ap-proach is illustrated by applying it to one aspect of softwareversion logs, namely, developer activity profiles. We auto-matically segment a given version history such that eachsegment identifies developers with significant activity inthat time period. This information can then be used to un-derstand core group of developers active at a given time,relate trends in developer activity to global events such asrelease dates, and future fixes. We believe that the pro-posed approach is general enough and can be applied tostudy trends in several other aspects of version logs suchas change-couplings, developer file ownerships etc.

The approach presented in this paper represents a ver-sion history as an item-set time series1, a time series whereeach observation is a set of discrete items. We then showhow an item-set time series can be segmented to obtain acompact representation. There are many ways to segment atime series. One can simply group all observations withina day, a week, a month etc., into a single segment. We callthese segmentations as fixed segmentations. Alternatively,one can divide the time period into k segments of uniformlength. Such a fixed manner of producing segments maynot be well-suited for time series data such as version logs inwhich temporal patterns may occur in bursts. Therefore, au-tomatic segmentation methods that discover variable lengthsegments that closely represent the observations of an item-set time series are essential. We present an approach to au-tomatically generate segmentation of item-set time seriesdata.

Our main contributions are as follows.

• A software version history is modeled as an item-settime series where each observation in the time seriesis a set of user ids of developers who made changesto files at the same time. The notion of an item setof a segment is used to compactly represent the itemset that results in combining consecutive observationsof an item-set time series. A dynamic programmingbased algorithm is presented to construct an optimalsegmentation of an item-set time series that minimizesthe difference between the item set of a segment andthe individual item sets that were combined to generatethat segment.

• The proposed approach has been applied to the Mozilladata set to study change frequency over time and devel-oper activity profiles. The preliminary results are ex-tremely promising and highlight the power of the pro-posed approach in applying time series analysis to ver-sion histories. The data set is represented as an item-

1An item-set time series is much like the sequence data in market-basket analysis. The only difference is that consecutive item sets in anitem-set time series are measured / recorded at regular intervals.

set time series and several optimal as well as fixed seg-mentations were constructed and compared.

• The results show that segments generated capture timedurations involving significant events in the history ofMozilla. The segments identify groups of developersactive over a certain period of time and also reflect thedifference in composition of active developers beforeand after major releases.

• The variable length segments produced by the algo-rithm outperform the fixed segmentations (producedby arbitrarily partitioning the project time period atregular intervals) in terms of active developers iden-tified per segment, the distinctiveness of developersidentified across adjoining segments, as well as qual-ity of changes made by active developers within a seg-ment.

The rest of the paper is organized as follows. Section 2introduces the terms and definitions for the optimal segmen-tation problem and describes a dynamic programming algo-rithm to construct an optimal segmentation. Section 3 de-scribes the Mozilla data and the effectiveness of applyingthe optimal segmentation problem to the Mozilla data. Sec-tion 4 describes the related work and Section 5 concludesthe paper.

2 Segmentation of Software Version History

Let I be a finite set of items d1, d2, . . . , dm. An itemset is a subset of I. The fractional difference between twoitem sets x and y is (|x − y| + |y − x|) / (|x ∪ y|) if x ∪ yis nonempty, and is 0 otherwise.

An item-set time series T consists of a finite sequenceof n samples x1, . . ., xn where each xk is an item set,recorded at successive time points t1, ...., tn.

A time point ti may represent different units of time suchas seconds, minutes, hours, days, etc. Below, for ease ofexposition, we use time points that are at the granularity ofa day.

Example 2.1 Let I be a set of developer ids. A softwareversion history can be modeled as an item set time seriesTD where each sample xi is a set of developers. Item set xi

denotes the set of developers that checked in files on the daydenoted by ti.

A segment s(a, b) (1 ≤ a ≤ b ≤ n) of a time series Tconsists of the consecutive time points ta, . . ., tb. If seg-ments s1 = s(a, b) and s2 = s(b + 1, c), the concatenationof s1 and s2, denoted as s1s2, is the segment s(a, c).

Example 2.2 Suppose the following time points – Jan 21,2003, Jan 22, 2003, Jan 23, 2003, Jan 24, 2003, Jan 25,

416

Page 3: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

2003 appear consecutively in an item set time series. Then,s(1, 3) is a segment containing the three time points Jan 21,2003, Jan 22, 2003, Jan 23, 2003. Segment s(2, 2) containsthe single time point Jan 22, 2003.

A measure function (denoted by f ) is used to assign anumeric value to each item in a segment to capture the rel-evance of an item to that segment2. There are many typesof measure functions that one can formulate to capture therelevance of items in a segment. We define a measure func-tion based on the occurrence frequency of an item in thatsegment.

Definition 2.1 The density measure (fm) takes an item dq

and a segment s(a, b) as input, and returns the fraction ofthe item sets in s(a, b) that contain dq.

The numeric values assigned to items by a measure functionf in a given segment s(a, b) are used to identify items thatare deemed to be significant for that segment, as follows.Let α be a user specified threshold. An item dq is calledsignificant in segment s(a, b) if f(dq, s(a, b)) ≥ α. Theitem set of segment s(a, b), denoted by Iα(s(a, b), f) is theset of all significant items in s(a, b).

Example 2.3 Suppose we have the following item set timeseries TD containing developer sets constructed from a soft-ware version history where each developer set contains alldevelopers that checked in files on the day of the corre-sponding time point. TD = { 〈a, b, c〉, 〈b, c, d〉, 〈a, b, e〉,〈a, b, f〉 } recorded at time points Jan 21, 2003, Jan 22,2003, Jan 23, 2003, and Jan 24, 2003.Consider segment s(1, 2). fm(a, s(1, 2)) = 0.5,fm(b, s(1, 2)) = 1.0, fm(c, s(1, 2)) = 1.0, andfm(d, s(1, 2)) = 0.5.Therefore, developers a, c and d checked in files on half thedays during Jan 21 – 22, 2003.Let α = 0.5. Then, Iα(s(1, 2), fm) = {a, b, c, d}.Iα(s(2, 4), fm) = {a, b}.The item set associated with a segment represents develop-ers that have checked in files on a majority of days in thatsegment and thus are deemed to have been active during thetime period associated with the segment.

A segmentation Π of a time series T is defined as a se-quence s(b0, b1), s(b1 + 1, b2), . . . , s(bl−1 + 1, bl) of seg-ments such that the concatenation s(b0, b1)s(b1 + 1, b2) · · ·s(bl−1 + 1, bl) = T . The size of Π, denoted by |Π|, is l,the number of segments in Π. A segmentation Π of a timeseries T is a fixed segmentation if all of the segments in Πcontain the same number of time points.

Example 2.4 A size 3 segmentation of TD iss(1, 1), s(2, 3), s(4, 4). A size 2 fixed size segmenta-tion of TD is s(1, 2), s(3, 4).

2The term measure function first appeared in [4].

2.1. Non-Homogeneity of a Segmentation

Let s(a, b) be a segment and th be a time point such thata ≤ h ≤ b. Let δh denote the fractional difference be-tween Iα(s(a, b), f) and xh. The segment difference ofsegment s(a, b), denoted by δ(s(a, b)), is

∑a≤h≤bδh. The

segment difference of a segment represents how closely theitem set of the segment captures the item sets of individualtime points in that segment. The following example illus-trates the segment differences for a couple of segments ofthe previous example.

Example 2.5 Consider Iα(s(1, 2), fm) from the previousexample. Iα(s(1, 2), fm) = {a, b, c, d}.δ1 = 0.25 and δ2 = 0.25. There δ(s(1, 2)) = 0.5.

Iα(s(2, 4), fm) = {a, b}. δ2 = 0.5, δ3 = 0.33, δ4 = 0.33.Therefore, δ(s(2, 4)) = 1.16.

A desirable property of a segmentation is that the itemset of each of the segments closely reflects the item sets ofthe time points contained in that segment. The segment dif-ference of a given segment is a measure of how internallyhomogenous that segment is. There are a variety of ways tomeasure the non-homogeneity of a given segmentation of atime series, given the segment difference of each segmentin the segmentation. We describe three such measures here.The summation difference measure, denoted by Δsum, isthe sum of the segment differences of the segments in thesegmentation. The average difference measure, denotedby Δavg , is the average segment difference (ratio of thesummation difference to the size of the segmentation). Themax difference measure, denoted by Δmax, is the maxi-mum segment difference.

2.2. Optimal Segmentation Problem

Segmentation of a time series reduces the number ofsamples to be examined, while hopefully preserving muchof the information of the original time series. For a givenmeasure function and difference measure, the optimal seg-mentation problem takes as input an item-set time series,segment difference values for each of O(n2) segments, andan upper bound p on the size of the desired segmentationof the input time series, and constructs a segmentation ofat most p segments with minimal non-homogeneity. A dualformulation of the optimal segmentation problem is to takeas input an item-set time series and an upper bound on theamount of non-homogeneity, and construct a minimal sizesegmentation whose non-homogeneity does not exceed thegiven limit.

Dynamic programming is typically employed to solvethe optimal segmentation problem [8, 12]. The dynamicprogramming approach uses as input data the segment dif-ference values for each of the O(n2) segments of the input

417

Page 4: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

time series, i.e. the δ(s(i, j)) value for each segment s(i, j)(1 ≤ i ≤ j ≤ n). Prior to carrying out the dynamic pro-gramming algorithm, these segment difference values for allsegments are computed, using the measure function speci-fied by the user.

The dynamic programming algorithm operates as fol-lows. We assume that the item-set time series to be seg-mented begins at index 1 and that p ≤ n. A two-dimensional table R is maintained by the dynamic program-ming algorithm. Entry R[j, k] in the table records the min-imum possible amount of non-homogeneity that can be in-curred in combining time points 1 through j into k segments(j ≥ 1, k ≤ j, k ≤ p). If k is 1, the value of R[j, k] isset to δ(s(1, j)). Otherwise, a recursive equation is usedto compute the value of R[j, k] from previously computedentries in the table. The specifics of the recursive equationdepends on which non-homogeneity measure is being used.Entry R[j, 1] (that is k = 1) is set to δ(s(1, j)) in all cases.For the summation difference measure of non-homogeneity,each entry R[j, k], k > 1 is set to mink−1≤z<j (R[z, k−1]+ δ(s(z+1, j))). In case of max difference measure of non-homogeneity, R[j, k] is set to mink−1≤z<j max(R[z, k −1], δ(s(z+1, j))) if k > 1. Finally, in case of average differ-ence measure of non-homogeneity, R[j, k] = mink−1≤z<j

((k − 1) ∗ R[z, k − 1] + δ(s(z + 1, j)))/k if k > 1.It can be easily seen that the the average difference mea-

sure value of an entry R[j, k] is simply the summation dif-ference measure of R[j, k] divided by k. Similarly, the av-erage difference measure of an R[j, k] entry multiplied byk gives the summation difference measure.

A dynamic programming algorithm based on the aboverecursive equations can be used to construct table R inO(n2p) time. Since p is an upper bound on the numberof segments, the optimum segmentation’s non-homogeneityvalue is given by the minimal value of R[n, k], 1 ≤ k ≤ p.If more than one value of k yields the minimal value, thenthe least k with this minimal value is returned.

The dynamic programming algorithm typically uses anadditional two-dimensional table T , where T [j, k] recordsthe value of z that minimizes the R[j, k] entry. Table T isused to construct an optimal segmentation.

2.3. Examples

The relationship between the size and the amount ofnon-homogeneity of a segmentation may be dependent on agiven data set and the value of α. We now illustrate throughseveral simple examples the interaction between α, p, andan optimal segmentation. We employ the summation mea-sure of non-homogeneity to construct the optimal segmen-tation in all of the examples.

The item-set time series analyzed by these examples con-tain item sets where item sets associated with each timepoint are drawn from the set: I = {a, b, c, d, e, f, g}. We

assume that I is a set of developer ids obtained from a ver-sion history and each item set in TD contains developerswho checked in files on the day of the corresponding timepoint.

The following example considers a scenario the item-settime series contains time periods where distinct set of devel-opers are active. This example illustrates that the segmenta-tion where each distinct set of developers is associated withdifferent segments will be an optimal segmentation and willbe constructed by the algorithm.

Example 2.6 Let TD = {a, b}, {a, b}, {a, b}, {d, e},{d, e}, {f, g}, {f, g}.Suppose α = 0.5, and the specified value of p = 4.All optimal segmentations of size p ≥ 3 will have zero non-homogeneity. Hence, an optimal segmentation of size 3 isreturned by the algorithm. It contains segments s(1, 3),s(4, 5), s(6, 7) where each segment capture a distinct de-veloper set. The item sets corresponding to these segmentsare: {a, b}, {d, e}, and {f, g}.

Consider the following item-set time series TD ={a, b, c}, {a, b, d}, {a, b}, {d, e, h}, {d, e}, {f, g}, {f, g}.Here, consecutive item sets are not exactly identical. Sup-pose α is 0.5. An optimal segmentation of size 3 has a lossof 1, and contains the following segments: s(1, 3), s(4, 5),s(6, 7). In this case, optimal segmentations of sizes greaterthan 3 have a smaller amount of loss. However, the devel-oper activity pattern is not as compactly represented.

In our next example, we consider a time series whereconsecutive item sets are not overlapping.

Example 2.7 Let TD = {a, b}, {c, d}, {e, f}, {g, a}. Sup-pose α = 0.5.

In this case, the algorithm simply merges consecutivetime points to reduce the size of the segmentation. There-fore, a size 3 optimal segmentation would be s(1, 1), s(2, 2),s(3, 4) (non-homogeneity is 1) and a size 2 optimal segmen-tation would be s(1, 2), s(3, 4) (non-homogeneity is 2).

We use the following example to show that the non-homogeneity of a segmentation does not necessarily de-crease as its size increases. In some cases, segmentationswith smaller sizes may have less non-homogeneity.

Example 2.8 Suppose TD contains three time points. Theitems sets in the time series are {a}, {b}, {a}. Suppose α =0.66.

We first consider the segmentation Π1 = s(1, 3) of T , ofsize 1. Iα(s(1, 3), fd) = {a}. The non-homogeneity of Π1

is δ(s(1, 3)), which equals∑

1≤j≤3δj . δ1 = 0, δ2 = 1, and

δ3 = 0. Therefore, Δsum(Π1) = 1.Now consider a segmentation of size 2. There are two

possible size 2 segmentations of T . Let Π2 = s(1, 1),s(2, 3).

418

Page 5: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

(The other segmentation of size 2 is s(1, 2),s(3, 3), and issymmetric to Π2.)

Iα(s(1, 1), fd) = {a} and Iα(s(2, 3), fd) = {}.δ(s(1, 1)) = 0. δ(s(2, 3)) =

∑2≤j≤3

δj = 2.Δsum(Π2) = 2.Therefore, in this case the non-homogeneity of every size

2 segmentation is greater than that of the size 1 segmenta-tion.

The above observation is true even if we use Δmax asthe measure of non-homogeneity. However, if Δavg is used,then, for this example, both Π1 and Π2 have the same non-homogeneity, namely 1.

3. Empirical Studies

To evaluate the effectiveness of the optimal segmenta-tion algorithm, we conducted a case study using data fromthe Mozilla repository. This open source project has beenstudied by many previous researchers, providing a wealthof background information about its history, structure anddevelopement process. Mozilla uses CVS to track changesto its artifacts. Before diving into the empirical studies, wediscuss our analysis strategy.

We first provide an overview of the Mozilla repositorydata. We focus on the temporal aspects of the data to showseveral time-varying characteristics of this project. In par-ticular we examine the change frequency over time as wellas developer turnover. This information sets the backgroundas well as gives some justification of the need for temporalsegmentation.

We then illustrate the use of the temporal segmentationalgorithm by segmenting the entire Mozilla version history.The goal of this activity is to provide confidence in the scal-ability of the proposed algorithm and in the meaningfulnessof the generated results.

Next, we perform segmentation on a smaller time inter-val to identify local temporal patterns and groups of devel-opers who are actively working in close temporal proximityto each other. We then evaluate the constructed segmen-tation by investigating the consistency of the contents andthe well-definedness of the boundaries of the resulting seg-ments. We also perform some preliminary assessment ofthe effect of changes committed by these active developerson the quality of the resulting files.

3.1. Overview of Mozilla Data

The Mozilla CVS repository was used for the MiningChallenge of the 2007 Mining Software Repositories Work-shop [27] and has deltas from 1998 to 2006.

Figure 1 is a SiZeR [3] plot showing the change fre-quency trend over the lifetime of the project. The bold lineshows the trend when the curve is “smoothed” over a 6-month window. This plot shows the time-varying nature of

1998 2000 2002 2004 2006

0.05

0.10

0.15

0.20

0.25

0.30

0.35

TIME

FR

EQ

UE

NC

Y

FAMILY OF SMOOTHS

Figure 1. Mozilla change frequency over time.This is part of a SiZeR visualization of thechange frequency over time. The y-axis isthe change frequency (log-transformed). Theplot shows a superimposed series of smoothingcurves each plotted with a different smoothingwindow. The bold curve was calculated with asmoothing window of approximately 6 months.

the change frequency. Furthermore, we note that the flattercurves resulting from larger smoothing windows show over-all trends at the expense of losing information about localperiods of time with high change frequency. On the otherhand, smaller smoothing windows show too much variationmaking it difficult to visually spot any trend. To provide ameaningful summary of the history, we need a method thatcan show overall trends while preserving information aboutsignificant localized phenomena. This argues for a variablesegmentation approach to analysis of time data.

We also note that the set of active developers steadilychanges over time. Early developers became less active orleft the project altogether while new ones joined the project.Figure 2 highlights this pattern. The darker bars show con-tributions in terms of changes by developers who have beenon the project since the early years. Over time, newer de-velopers (denoted by lighter bars) increase in percentage ofcontribution. This phenomenon lends further support to thesuitability of this project for temporal segmentation.

3.2. Segmenting the Version History

We applied the segmentation algorithm to the developerinformation recorded with the deltas from the CVS repos-itory. In this initial analysis, each time point is a monthand each item set consists of the developers who checked-in code at least 5 times during that month.

There were 106 time points (months) from March 1998

419

Page 6: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

YEAR

AN

NU

AL

CO

NT

RIB

UT

ION

(%

)

020

4060

8010

0

1998 1999 2000 2001 2002 2003 2004 2005 2006

Figure 2. Annual percentage of changes bydevelopers. The darker bars show contributionsby developers who have been on the projectsince the early years.

to December 2006. The resulting segmentations with 10,20, 30 and 40 segments is shown in Figure 3. We used athreshold α = 0.3. This α was chosen because higher val-ues led to many empty segments. The item set associatedwith each segment represents developers who have checkedin files for 30% of the time within that segment. We desig-nate them as the active developers of this segment.

The corresponding non-homogeneity curve shown inFigure 4 gives an indication of the relative increase in thenumber of developers whose presence in a particular timepoint is captured within the segments. The fewer the num-ber of segments, the more information we lose regardingthe presence of developers in a time point. As shown inFigures 3 and 4, as the number of segments increase, theprecision is refined, but it becomes harder to look at trends.Our aim is to come up with a suitable abstraction of the ver-sion history, summarizing the history with the least numberof segments. We selected the 30-segment partition becausethere were large segments in the 20-segment partition thatwere further divided. The 40-segment did not seem to addmore useful insight. In general, as the loss rate decreases,the benefit of using more segments also decreases.

Many of the segmentation dates are meaningful in thehistory of Mozilla. Table 1 shows the approximate corre-spondence between our identified segment boundaries andthe known significant dates from Mozilla history [19].3 Thesegmentation reflects a large enough difference in the com-

3The entry marked “Mozilla Foundation” refers to the creation of theMozilla Foundation on July 15, 2003. This organization oversees the con-tinuing development of the Mozilla suite. Another study [10] also noted adisruption in the group of active developers as the Foundation took over.

1998 2000 2002 2004 2006

050

100

150

200

TIME

#DE

VE

LOP

ER

S

10 TIME SEGMENTS

1998 2000 2002 2004 2006

050

100

150

200

TIME

#DE

VE

LOP

ER

S

20 TIME SEGMENTS

1998 2000 2002 2004 2006

050

100

150

200

TIME

#DE

VE

LOP

ER

S

30 TIME SEGMENTS

1998 2000 2002 2004 2006

050

100

150

200

TIME

#DE

VE

LOP

ER

S

40 TIME SEGMENTS

Figure 3. Segmentation of Mozilla history.Four segmentations are shown, for 10, 20, 30and 40 segments. In each plot, each bar repre-sents one segment. The width of the bar showsthe segment duration while the height of the barshows the number of developers who made thethreshold (α = 0.3).

position of active developers before and after several ma-jor releases. This gives us confidence in the suitability ofthe algorithm for studying change information from versionhistory.

3.3. Analysis of Local Patterns

To identify developers who are likely to communicate,we looked at a smaller period of time. We picked the 200days starting from January 2000. Based on the analysis ofFigure 1, this seems to be a period of relative stability as re-flected by the decreasing trend in the the number of changescommitted to the repository.

For this analysis, each time point is a day and each itemset consists of the developers who checked-in code at leastonce during that day. We conducted a similar analysis ofnon-homogeneity as in Section 3.2. The resulting non-homogeneity curve indicates the point of inflection some-

420

Page 7: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

0 20 40 60 80 100

010

2030

4050

60

NUMBER OF SEGMENTS

NO

N−

HO

MO

GE

NE

ITY

Figure 4. Non-homogeneity curve. The y-axisshows the non-homogeneity of the segmentationfor the given number of segments on the x-axis.The points plot the values for 10, 20, 30 and 40segments.

Seg Seg SegNum Start End Release Dates

11 Nov’00 Dec’00 Mozilla 0.6 (Dec 6)14 Dec’01 Jun’02 Mozilla 1.0 (Jun 5)16 Aug’02 Nov’02 Mozilla 1.2 (Nov 26)17 Dec’02 Mar’03 Mozilla 1.3 (Mar 13)18 Apr’03 Jul’03 Mozilla Foundation23 Sep’04 Dec’04 Firefox 1.0 (Nov 29)26 Jun’05 Dec’05 Firefox 1.5 (Nov 29)29 Jul’06 Oct’06 Firefox 2.0 (Oct 24)

Table 1. Applicable release dates correlatedto some segment boundaries.

where past 6 segments. Therefore, for the remainder of thissection, we partition the time period into 6 segments. Thetop half of Figure 5 shows the results given 6 segments.Note that we used as threshold α = 0.6 to ensure that de-velopers that are counted in the segment actually overlapwith each other at least part of the time during the segmentinterval. Such a group of developers are most likely to beaware of each others’ work and hence, also most likely tocommunicate and coordinate their changes.

In the top half of Figure 5, there was one segment inwhich no developer made the α threshold. The second seg-ment (20 days) had the largest number of developers (11).The third segment shows a period of over a month (34 days)where there is only one dominant developer. This was fol-lowed by a 36-day segment with 8 active developers. Next

1 2 3 4 5 6 7 8

05

1015

MONTH

#DE

VE

LOP

ER

S

VARIABLE−SIZE SEGMENTS

1 2 3 4 5 6 7 8

05

1015

MONTH

#DE

VE

LOP

ER

S

FIXED−SIZE SEGMENTS

Figure 5. Segmentation of Mozilla develop-ment (first 200 days of 2000). Each bar repre-sents one segment. The width of the bar showsthe segment duration while the height of the barshows the number of developers who made thethreshold (α = 0.6). The top plot shows the parti-tioning with 6 segments. The bottom plot showsthe result when the same period of time is parti-tioned using fixed length segments.

was a 20-day segment with an almost completely differentset of 9 active developers.Variable versus fixed segment lengths. For comparison,we analyzed the same 200-day time period using a fixedsegmentation of 33 days per segment. The resulting seg-mentation (also with α = 0.6) is shown in the bottom half ofFigure 5. Clearly, this segmentation was not able to identifymany active developers. In general, due to the arbitrary set-ting of segment boundaries in fixed length segments, manyimportant localized features of the data may be missed.

We repeated this comparison between fixed and variablelength segments for segmentations of 10, 20, 40 and 100segments. In each case, the variable segmentation yieldedmore varied developer sets.Distinctiveness of developer sets. Another question wewanted to answer was, how distinct were the developer setsfrom each other? We answer this question in two ways: 1)We calculated a similarity measure between every pair ofdeveloper sets, and 2) We tracked the contributions of eachdeveloper set over the entire time period.

We calculated the similarity between two developer setsx and y as: |x∩y|/|x∪y|. That is, the similarity measure is avalue between 0 and 1, with 0 being the lowest (both sets aredisjoint) and 1 being the highest (both sets are equal). Themore developers the two sets have in common, the strongerthe similarity. Table 2 shows the similarity measures for

421

Page 8: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

1 2 3 4 5 61 1 0.14 0 0.18 0.08 02 0.14 1 0 0.19 0.43 03 0 0 1 0 0 04 0.18 0.19 0 1 0.06 05 0.08 0.43 0 0.06 1 06 0 0 0 0 0 0

Table 2. Inter-segment similarity matrix. Thistable shows the similarity between the groups ofactive developers within each of the 6-segmentpartitioning.

DevSet Seg 1 Seg 2 Seg 3 Seg 4 Seg 5 Seg 61 64% 54% 51% 47% 36% 36%2 47% 67% 43% 45% 57% 31%3 43% 50% 66% 31% 15% 0%4 38% 51% 45% 65% 40% 27%5 41% 56% 38% 45% 74% 35%6 0% 0% 0% 0% 0% 0%

Table 3. Changes contributed by each devel-oper set in each segment. Each value shows,for each segment, the average number of dayseach developer in a set had committed changes,as a percentage of the number of days in thesegment.

all pairs of the 6 segments. The values indicate that thereis very low similarity for adjacent developer sets, even be-tween the ones with a relatively larger number of developerssuch as segments 4 and 5.

While Table 2 tells us that each segment’s developer setis distinct from the others, it does not tell us if there wereother developers that should have been part of the developerset. Hence, we also tracked the work contribution of eachdeveloper set in every segment. This is calculated by tak-ing the average number of days each developer in a set hadworked (by committing changes) for each of the 6 segments.This average is then divided by the total number of days inthe segment. The results are given in Table 3. To understandthis table, we take as an example the work of Developer Set2 members during Segment 1. On average, each member ofDeveloper Set 2 committed changes in 13.2 days out of the28 days in Segment 1. The percentage work contribution isthen calculated as 13.2/28 ≈ 47%. We note that the high-est values in each column correspond to the contribution ofthe developer set identified for each segment. This value is

above 60% because α = 0.6. Within each column, the con-tribution of other developer sets fall below this threshold.

3.4. Quality Impacts

We also started to investigate the impact of changes madeby active developers on the quality of the resulting files.We hypothesize that within a given segment x, files thatwere changed exclusively by the active developers of seg-ment x will have fewer changes in the immediate future(1-2 weeks). The rationale for this is that active develop-ers know about each others’ activities and would be morelikely to coordinate changes. On the other hand, developerswho are not as active may make some changes that mightnot have been coordinated with other related changes, thusresulting in additional subsequent changes to fix problems.Even though Mozilla follows a 2-stage code inspection pro-cess which requires code submissions to be inspected bymodule owners and, if applicable, by super-reviewers [17],some problems may still get pass the review process and getchecked in.Experimental variables. We analyzed the change data tolook for evidence supporting the the hypothesis. For eachsegment x, we identified the files changed exclusively bythe active developers of segment x on any of the days nearthe end of segment x. We then collected the number of sub-sequent changes 3, 7, 10 and 14 days in the future. From thisprocedure, we have the following independent variables:file, the file that was changed, active, whether or not thefile was changed exclusively by active developers, day, theday in the segment when this file was changed by a groupof developers, and future, the number of days in the future(3, 7, 10 or 14) to count future changes. The dependentvariable is delta, the number of future changes to the file.Analysis and results. For a given day and future, we clas-sified the set of file into two sets, according to active. Wethen used the t-test to check if there is a difference in thedependent variable delta.

We found that, in every segment, there is at least one setof files changed exclusively by active developers which hadsignificantly lower future changes (p < 0.01). This encour-aging result provides preliminary validation to our hypothe-sis. Subsequent work involves refining the future changes toonly those classified as fixes, and refining the changes in thesegment to include only those that are semantically related.

3.5. Threats to Validity

Internal validity problems arise from the presence of con-founding factors thus leading to alternative explanations ofthe results. One of the issues that has also been pointed outin other studies of open source repositories is the presenceof aliasing, in other words, a developer may be checking incode under different usernames [2]. This is especially a risk

422

Page 9: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

for Mozilla whose CVS repository records developer emailaddresses. We manually inspected the top 50 developers ofthe project as given by CVS, and found that at least 5 per-sons have two email addresses. While this may be a prob-lem for the global analysis of the Mozilla version history,this is not an issue for the local analysis. A manual inspec-tion of the 220 email addresses used in the local analysisdid not find any conclusive duplicates.

Another issue pertains to the representativeness of the200-day time period in 2000 used in the local analysis. Thisperiod was chosen for analysis because it appeared to berelatively stable. The SiZeR analysis shows the change fre-quency to be significantly decreasing at this point in time.This is important for the quality analysis which depends onfuture changes not increasing due to significant new codedevelopment. In addition, we are repeating the analysis onother subsets of the data to look for corroboration from ear-lier Mozilla studies. We will analyze the 2003 time periodwhere the change frequency is also decreasing, and alsowhen the Mozilla Foundation took over the project. Theresults will be corroborated with findings from [10]. Wewill also perform the analysis at the module level, wheredevelopers are more likely to have closer interactions. Thecandidate modules will be based on the ones identified in[5] to have many balanced developer contributions.

External validity concerns the ability to generalize the re-sults of the study, and in our case, the representativeness ofthe Mozilla project. As Figure 2 indicates, the project hada lot of developer turnover. This made it a good candidatefor our initial analysis. We plan to replicate this study onother projects where the development team is more stable[21] as well as other well-studied projects such as Eclipseand Apache. The results will be reported in an expandedversion of the paper.

4. Related Work

Data mining of version histories has been an activearea of research (please see [25] for a recent bibliogra-phy). Version histories have been mined to identify relatedpieces of source code, change couplings, to guide futurechanges[7, 6, 29, 18, 26, 23], explain system architecture[28], identify bug-introducing changes [24, 15] and also todiscover partial orderings among change-sets [13]. Therehas been a growing interest in studying the meta-data avail-able in version histories to identify developer activity andrelate this activity to source code in terms of file (and logi-cal entities) ownerships and changes [2, 9, 11].

The specific application considered in this paper – study-ing coordination and communication among developers andrelating it to source code changes is also not new. Inthis context, there is earlier work on studying coordina-tion and communication among developers by mapping

them to social networks, especially for open-source projects[22, 16, 21]. Several techniques including graph-theoreticanalyses [16] and mining of explicit communication logslike emails [2] have been used to construct social networksamong developers. The work described in [21] is perhapsthe most closely related to the application in this paper. In[21], the authors develop methods to identify a core groupof developers that are most active and study how this groupevolves over a project history. In [22], a general overviewof open source projects and communication patterns amongdevelopers is presented. In [16], the authors analyze ver-sion histories from a number of software projects to create anetwork of developers based on the modules committed bythem. An analysis of these networks using social networkmetrics concludes that developers form a small-world net-work in these projects. In [2], email logs from version his-tories of Apache project are mined to study questions like:are frequent email senders also active developers, do highlyactive developers also have high social status, etc.

However, most of these works do not seem to adequatelyaddress the temporal dimension in analyzing version histo-ries. In most cases, either the version history is viewed as asingle time interval [16, 2, 21, 5] or the time interval is ar-bitrarily partitioned based on a fixed interval of time (suchas commits being, at most, 200 seconds apart)[6, 29, 1, 13].This work generalizes our earlier work on parallel changes[20] by considering varying temporally close changes asparallel changes. Our problem of finding changes that mostlikely lead to future fixes is somewhat related to the prob-lem of finding bug-introducing changes in [15, 24]. How-ever, [15, 24] uses origin tracking and semantic analysestechniques to accurately identify such changes.

This paper is our first attempt at applying time seriessegmentation to software repositories. To the best of ourknowledge, this is also the first work to devise time seriessegmentation for non-numeric data and apply it for miningof project version histories. We believe the proposed workis complementary and can be potentially used to enhancethe power of the earlier proposed approaches.

5. Conclusion

Mining of software version histories is an important andactive area of research. This paper proposes a novel ap-proach to represent a version history as an item-set time se-ries. Each measurement in an item-set time series is a setof discrete items and therefore, an item-set time series isespecially suitable to capture the non-numeric aspects of aversion history as a time series. A dynamic-programmingbased method has been described to construct an optimalsegmentation of an item-set time series. Segmentation of atime series represents a time series in a compact manner bypartitioning the time period associated with the time series

423

Page 10: [IEEE 2007 IEEE International Conference on Software Maintenance - Paris, France (2007.10.2-2007.10.5)] 2007 IEEE International Conference on Software Maintenance - Discovering Dynamic

into segments. The effectiveness of segmentation of ver-sion histories is then studied by representing the developeractivity of the Mozilla data set as an item-set time series andconstructing several optimal as well as fixed segmentations(where each segment consists of the same number of timepoints). The preliminary results are extremely encouraging.The segments constructed by the proposed approach capturetime durations involving significant events in the history ofMozilla. It is also shown that optimal segmentations of anitem-set time series are superior to fixed segmentations interms of the number of developers captured, their activity,and the quality of their changes within a segment.

Acknowledgements. We’d like to thank the anonymousreviewers for their helpful comments which improved theclarity of the paper.

References

[1] L. Aversano, L. Cerulo, and M. D. Penta. How clones aremaintained: An empirical study. In Proc. of Conf. on Soft-ware Maintenance and Reengineering (CSMR), 2007.

[2] C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swami-nathan. Mining email social networks. In InternationalWorkshop on Mining Software Repositories (MSR 2006),Shanghai, China, May 2006.

[3] P. Chaudhuri and J. Marron. SiZer for exploration of struc-tures in curves. Journal of the American Statistical Associ-ation, 94(447):807–823, Sept 1999.

[4] P. Chundi and D. Rosenkrantz. Constructing time decompo-sitions for time stamped documents. In 2004 SIAM Interna-tional Conference on Data Mining, pages 57–68, 2004.

[5] M. D’Ambros, M. Lanza, and H. C. Gall. Fractal Figures:Visualizing development effort for CVS entities. In Pro-ceedings of the 3rd International Workshop on VisualizingSoftware For Understanding and Analysis, 2005.

[6] M. Fischer, M. Pinzger, and H. Gall. Populating a releasehistory database from version control and bug tracking sys-tems. In Proc. of Intl. Conference on Software Maintenance(ICSM), 2003.

[7] H. Gall, K. Hajek, and M. Jazayeri. Detection of logicalcoupling based on product release history. In Proc. of Intl.Conf. On Software Maintenance (ICSM), 1998.

[8] A. Gionis and H. Mannila. Segmentation algorithms for timeseries and sequence data. In A Tutorial in the SIAM Interna-tional Conference on Data Mining, 2005.

[9] T. Girba, A. Kuhn, M.Seeberger, and S. Ducasse. How de-velopers drive software evolution. In Proc. of Intl. Workshopon Principles of Software Evolution(IWPSE), 2005.

[10] J. Gonzalez-Barahona, G. Robles, and I. Herraiz. Impact ofthe creation of the Mozilla Foundation in the activity of de-velopers. In Proc. of Workshop on Mining Software Reposi-tories (MSR), 2007.

[11] O. Greevy, T. Girba, and S. Ducasse. How developers de-velop features. In Proc. of Conference on Software Mainte-nance and Reengineering, 2007.

[12] K. K. J. Himberg, H. Mannila, J. Tikanmaki, and H. Toivo-nen. Time series segmentation for context recognition inmobile devices. In IEEE International Conference on DataMining, pages 203–210, 2001.

[13] H. Kagdi, S. Yusuf, and J. Maletic. Mining of sequences ofchanged-files from version histories. In Proc. of Workshopon Mining Software Repositories (MSR), 2006.

[14] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online al-gorithm for segmenting time series. In IEEE InternationalConference on Data Mining, pages 289–296, 2001.

[15] S. Kim, T. Zimmermann, K. Pan, and E. J. Whitehead. Au-tomatic identification of bug-introducing changes. In Proc.of Conference on Automated Software Engineering (ASE2006), 2006.

[16] L. Lopez-Fernandez, G. Robles, and J. Gonzalez-Barahona.Applying social network analysis to the information in CVSrepositories. In International Workshop on Mining SoftwareRepositories (MSR 2004), Edinburgh, Scotland, May 2004.

[17] A. Mockus, R. Fielding, and J. Herbsleb. Two case studiesof open source software development: Apache and Mozilla.ACM Trans. on Software Engineering and Methodology,11(3), July 2002.

[18] A. Mockus and D. Weiss. Globalization by chunking: Aquantitative approach. IEEE Software, 18(2), 2001.

[19] Mozilla.org. Mozilla application suite release history. http://www.mozilla.org/releases/history.html.

[20] D. Perry, H. Siy, and L. Votta. Parallel changes in large scalesoftware development: An observational case study. ACMTrans. on Software Engineering and Methodology, 10(3),2001.

[21] G. Robles and J. Gonzalez-Barahona. Contributor turnoverin libre software projects. In The Second International Con-ference on Open Source Systems (OSS2006), Como, Italy,June 2006.

[22] W. Scacchi. Socio-technical interaction networks in free/open source software development processes. In S. T. Acunaand N. Juristo, editors, Software Process Modeling, pages 1–27. Springer Science+Business Media Inc., 2005.

[23] J. S. Shirabad, T. Lethbridge, and S. Matwin. Supportingof legacy software with data mining techniques. In Proc.of Conf. on Advanced Studies for Collaborative Research,2000.

[24] J. Sliwerski, T. Zimmermann, and A. Zeller. When dochanges induce fixes? In Proc. of Intl. Workshop on Min-ing Software Repositories (MSR 2005), 2005.

[25] T. Xie. Mining software engineering data bibliography.http://ase.csc.ncsu.edu/dmse/setasks.html.

[26] A. Ying, G. Murphy, R. Ng, and M. Chu-Carroll. Predictingsource code changes by mining version history. Trans. ofSoftware Engineering, 30(9), 2004.

[27] T. Zimmermann. MSR mining challenge 2007. http://msr.uwaterloo.ca/msr2007/challenge.

[28] T. Zimmermann, S. Diehl, and A. Zeller. How history justi-fies system architecture (or not). In Proc. of Intl. Symp. OnPrinciples of Software Evolution, 2003.

[29] T. Zimmermann, P. Weiβgerber, S. Diehl, and A. Zeller.Mining version histories to guide software changes. IEEETrans. on Software Engineering, 31(6), 2005.

424