49
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science Foundation – Digital Science & Technology Yongqin Gao

Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

Embed Size (px)

Citation preview

Page 1: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

Topology and Evolution of the Open Source Software Community

Advisors:

Dr. Vincent W. FreehDr. Kevin Bowyer

Supported in part by the National Science Foundation – Digital Science & Technology

Yongqin Gao

Page 2: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

2

Outline

Overview• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

Page 3: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

3

Overview (about OSS)

• What is OSS– Free to use, free to distribute – Unlimited user and usage – Source code available and modifiable

• Potential advantages over commercial software– Higher quality– Faster development– Lower cost– Transparent

Page 4: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

4

Overview (about our research)

• Our goal– Understanding the OSS phenomenon

• Approach– SourceForge is the source of our empirical data– Modeling as a social network– Analysis of topological statistics– Use simulation to verify and validate the model

Page 5: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

5

Outline

• OverviewData collection

• Network modeling

• Topological statistical analysis

• Simulations

• Publications

• Conclusions

Page 6: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

6

Data Collection — Monthly

• Web crawler (scripts)– Python– Shell– AWK– Sed

• Monthly• Since Jan 2001 • ProjectID• DeveloperID• Almost 2 million records• Relational database

PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972

Page 7: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

7

Outline

• Overview

• Data collectionNetwork modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

Page 8: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

8

Modeling as Collaboration Network

• What is a collaboration network?– A social network representing the collaborating

relationships.– Movie actor network and scientist collaboration

network

• Difference of SourceForge collaboration network– Link detachment– Virtual collaboration– Voluntary– Global

• Bipartite property of collaboration networks

Page 9: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

9

Collaboration network - bipartite

Adapted from Newman, Strogatz and Watts, 2001

Page 10: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

10

SourceForge Developer Network

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57]

7597 dev[46]dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

Dev[80]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

OSS Developer Network (Part)Developers are nodes / Projects are links

24 Developers5 Projects

2 hub Developers1 Cluster

Page 11: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

11

Outline

• Overview

• Data collection

• Network modelingTopological statistical analysis (real data)

• Simulations

• Publications

• Conclusion

Page 12: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

12

Topological Analysis

• Statistics inspected– Diameter– Average degree– Clustering coefficient– Degree distribution– Cluster size distribution– Relative size of major cluster– Fitness and life cycle

• Evolution of these statistics• Dual networks

– developer network and project network

Page 13: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

13

Terminology

• Diameter– Average length of shortest paths between all pairs of vertices

• Degree– The count of edges connected to given vertex

• Average degree– Average of the degrees of all vertices in the network

• Cluster– The connected components of the network

• Clustering coefficient (CC)– CCi: Fraction representing the number of links actually present relative t

o the total possible number of links among the vertices in its neighborhood.

– CC: average of all CCi in a network• Degree distribution

– The distribution of degrees throughout a network• Major cluster

– The largest cluster in the network

Page 14: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

14

Diameter of Developer Network vs. Time

• Network size increased from 30,000 to 70,000

Page 15: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

15

Diameter of Project Network vs. Time

• Network size increased from 20,000 to 50,000.

• Diameter decreasing with time both for developer network and project network

Page 16: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

16

Clustering Coefficient of Developer Network vs. Time

Page 17: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

17

Clustering Coefficient of Project Network vs. Time

Page 18: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

18

Degree Distribution (developers)

Page 19: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

19

Degree Distribution (projects)

Page 20: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

20

Cluster Size Distribution

• R2 with major cluster is 0.7426

• R2 without major cluster is 0.9799

Page 21: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

21

Relative Size of Major Cluster vs. Time

• Increase of the relative size of the major cluster

• Increasing rate is decreasing

• May be an indication of the network evolution

Page 22: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

22

Existence of Fitness

• Investigation of development of single project can verify the existence of “newcomer” phenomenon

• We tracked the development of every new project in July 2001 until now (total 1660 projects)

• Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639

Page 23: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

23

Life Cycle of Project

Page 24: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

24

Summary

Page 25: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

25

Summary of Results

• Power law rules– Degree distributions, cluster distribution

• Average degree increasing with time

• Diameter decreasing with time

• Clustering coefficient decreasing with time

• Fitness existed in SourceForge

• Projects have life cycle behaviors

Page 26: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

26

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)Simulations

• Publications

• Conclusion

Page 27: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

27

Conceptual Framework

Empirical data

Adjustment

Generation

Verification

Validation

Cha

ract

eriz

atio

nD

escr

iptio

n

Model

Simulation

Page 28: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

28

Agent-based Modeling

• EBM vs. ABM– Heterogeneous individuals– Complex network

• Experience environment– Hardware: computer cluster– Software:

• Simulation toolkits: Swarm• Database: Oracle• Language: Java, PL/SQL

Page 29: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

29

Model for SourceForge

• ABM based on bipartite graph• Model description

– Agent: developer– Behaviors: Create, join, abandon and idle– Preference: developer’s and project’s– Fitness

• Four models in iterations– ER, BA, BA with constant fitness and BA with dynamic

fitness

• Comparison of empirical and simulated data

Page 30: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

30

ER Model - Diameter

• Average degree is decreasing while it is increasing in empirical data

• Diameter is increasing while it is decreasing in empirical data

Page 31: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

31

ER Model – Clustering Coefficient

• Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data.

Page 32: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

32

ER Model – Degree Distribution

• Degree distribution is normal distribution while it is power law in empirical data

Page 33: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

33

ER Model – Cluster Size Distribution

• power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster)

• The actual distribution is different from empirical data

Page 34: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

34

BA Model – Diameter and Clustering Coefficient

• Small diameter and high clustering coefficient like empirical data

• Diameter and clustering coefficient are both decreasing like empirical data

Page 35: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

35

BA Model – Degree Distribution

• Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).

• For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714.

• For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838.

Page 36: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

36

BA Model with Constant Fitness

• Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).

• For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714.

• For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838.

Page 37: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

37

BA Model with Dynamic Fitness

• Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data).

• For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714.

• For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838.

Page 38: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

38

Advantage of Dynamic Fitness

• Intuition: Fitness should decreasing with time.

• Statistics: project has life cycle behavior which can not be replicated by BA model with constant fitness but can be replicated by BA model with dynamic fitness

Page 39: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

39

Summary

Page 40: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

40

Summary of Results

• We use ABM to model and simulate the SourceForge collaboration network.

• Conceptual framework is proposed for agent-based modeling and simulation.

• Case study of this framework: SourceForge study through ER, BA, BA with constant fitness and BA with dynamic fitness.

Page 41: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

41

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• SimulationsPublications

• Conclusion

Page 42: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

42

Publications To-date

• Yongqin Gao, "Modeling and Simulation of  the OSS Community", Seventh Annual Swarm Researchers Meeting (Swarm2003), Notre Dame, IN, 2003.

• Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis and Modeling of the Open Source Software Community", NAACSOS Conference 2003, Pittsburgh.

• Yongqin Gao, Vince Freeh, and Greg Madey, "Conceptual Framework for Agent-based Modeling and Simulation", NAACSOS Conference 2003, Pittsburgh.

• Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, Chris Hoffman, "Agent-based Modeling and Simulation of Collaborative Social Networks", AMCIS 2003, Tampa, FL.

Page 43: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

43

Possible Journals

• Chapter 3– Physica A: statistical mechanics and its applicatio

ns– Journal of Social Structure (JSS)

• Chapter 4– Journal of Artificial Societies and Social Simulatio

n (JASSS)– Journal of Statistical Computation and Simulation

(JSCS)

Page 44: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

44

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• PublicationsConclusion

Page 45: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

45

Conclusion

• Study of SourceForge collaboration network can help us understanding the OSS community

• We investigate not only the topological statistics but also the evolution of these statistics.

• Simulation is used to investigate of SourceForge collaboration network.

Page 46: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

46

Contribution

• Statistical study of the SourceForge community (snapshot and evolution)

• Verification of the approximate method to calculate the diameter and CC

• Proposal of a model for the SourceForge community

• Improvement of dynamic fitness to BA model

Page 47: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

47

Future Work

• Data collection– Database dump from SourceForge (PostgreSQL 8GB)– All the possible attributes– Database schema in UML

• More topology analysis (with more attributes)– Discussion forum– Task assignment– Project management– Active testing

• Behavior-based analysis– Interaction between agents– H. Beyton Young’s model

• Information entropy analysis

Page 48: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

48

Acknowledgements

• Committee

• Advisors

• Colleagues

• SourceForge

• NSF

• Others

Page 49: Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science

49

Thank you