Navigation
Abstract
Many open-source software projects depend on a few core developers, who take over both the bulk of coordination and programming tasks. They are supported by peripheral developers, who contribute either via discussions or programming tasks, often for a limited time. It is unclear what role these peripheral developers play in the programming and communication efforts, as well as the temporary task-related sub-groups in the projects. We mine code-repository data and mailing-list discussions to model the relationships and contributions of developers in a social network and devise a method to analyze the temporal collaboration structures in communication and programming, learning about the strength and stability of social sub-groups in open-source software projects. Our method uses multi-modal social networks on a series of time windows. Previous work has reduced the network structure representing developer collaboration to networks with only one type of interaction, which impedes the simultaneous analysis of more than one type of interaction. We use both communication and version-control data of open-source software projects and model different types of interaction over time. To demonstrate the practicability of our measurement and analysis method, we investigate 10 substantial and popular open-source software projects, and show that, if sub-groups evolve, modeling these sub-groups helps predict the future evolution of interaction levels of programmers and groups of developers. Our method allows maintainers and other stakeholders of open-source software projects to assess instabilities and organizational changes in developer interaction and can be applied to different use cases in organizational analysis, such as understanding the dynamics of a specific incident or discussion.
Keywords: Coordination Group Structures Open-Source Software Repository Mining Tensor Decomposition
Research Questions and Methodology
![]() |
- RQ1:
- Are there stable group structures in open-source software projects? That is, are there groups of developers that steadily interact with each other during the project's evolution? Or are there no stable group structures, merely developers who just rally round certain tasks and vanish afterwards?
- RQ2:
- Does the communication behavior of developers result in the same group structures as arises from co-editing behavior? To what extent do the group structures that emerge from communication and from co-editing source code overlap in terms of developers who participate?
- RQ3:
- Does considering past activity in co-editing or communication improve the prediction of future co-editing or communication? Following the "mirroring hypothesis", can the prediction on one channel be improved by incorporating past activity on the other channel respectively?
|
Data Extraction and Processing
To assess and process the data for a single project, we performed three steps:
- download the mailing list and the Git repository,
- preprocess the data in both data sources for easier access,
- process the data to extract all relevant variables from the data and aggregate the data needed for our analysis.
Codeface
Codeface is a framework and interactive web frontend for the social and technical analysis of software projects.
nntp2mbox
nntp2mbox is a small Python script to download mailing-list archives from GMane to an mbox file.
codeface-extraction
codeface-extraction is a small extension to Codeface to extract and preprocess the version-control system and e-mail data.
coronet
coronet is a library to construct socio-technical developer networks based on various data sources in a configurable and reproducible way.
We developed a set of R scripts on top of Codeface and coronet for our analysis, which are available in the Downloads section. (Also notice that we use our customized version of Codeface.) In order to run our scripts, you need to clone coronet (we have used our scripts with version 3.7) into a subdirectory of the scripts directory. Further information about the data format und input directories can be found in the README.md of coronet and also in the README.md delivered with our scripts in the Downloads section.
Subject Projects
Project | Time | # 3-month Ranges | # Developers | # Commits | # E-Mails | Mailing List |
---|---|---|---|---|---|---|
Jailhouse | 2013-11-20–2016-08-24 | 11 | 17 | 1459 | 5598 | gmane.linux.jailhouse |
OpenSSL | 2002-04-21–2016-02-19 | 55 | 153 | 10981 | 32642 | gmane.comp.encryption.openssl.devel |
BusyBox | 2003-01-14–2016-02-16 | 52 | 217 | 10799 | 41995 | gmane.linux.busybox |
ownCloud | 2010-03-24–2018-05-20 | 32 | 471 | 27856 | 14384 | gmane.comp.kde.devel.owncloud |
QEMU | 2003-04-29–2016-07-27 | 52 | 919 | 45243 | 430202 | gmane.comp.emulators.qemu |
Git | 2005-04-13–2017-03-12 | 47 | 943 | 34811 | 313795 | gmane.comp.version-control.git |
Wine | 2002-04-06–2017-11-16 | 62 | 1092 | 112509 | 111331 | gmane.comp.emulators.wine.devel |
Django | 2005-08-01–2017-12-04 | 49 | 1131 | 24277 | 51338 | gmane.comp.python.django.devel |
FFmpeg | 2003-01-06–2017-12-12 | 59 | 1256 | 78871 | 242250 | gmane.comp.video.ffmpeg.devel |
U-Boot | 2000-01-01–2017-12-18 | 71 | 1356 | 44674 | 318719 | gmane.comp.boot-loaders.u-boot |
Results
Our results are arranged into four sections:
- Descriptive Insights
- Decomposition Insights
- Predictive Performance by R
- Overall Performance for fixed R
Descriptive Insights
Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-BootN | nMtmax | M | M% | nCtmax | C | C% | n11 | ||
Jailhouse | 17 | 27 | 11.8 | 8.69 | 27 | 9.8 | 7.22 | 4.6 | 0.39 |
OpenSSL | 153 | 159 | 20.3 | 0.17 | 737 | 94.5 | 0.81 | 6.8 | 0.15 |
BusyBox | 217 | 150 | 62.6 | 0.27 | 300 | 110.7 | 0.47 | 16.2 | 0.19 |
ownCloud | 471 | 111 | 29.2 | 0.03 | 1 964 | 929.3 | 0.84 | 12.8 | 0.08 |
QEMU | 919 | 1 651 | 723.3 | 0.17 | 9 888 | 2,586.3 | 0.61 | 368.0 | 0.20 |
Git | 943 | 1 892 | 750.0 | 0.17 | 3 855 | 2,270.1 | 0.51 | 230.8 | 0.18 |
Wine | 1 092 | 912 | 446.5 | 0.07 | 5 567 | 3,671.1 | 0.62 | 218.3 | 0.18 |
Django | 1 131 | 266 | 131.7 | 0.02 | 9 370 | 1,991.5 | 0.31 | 48.4 | 0.18 |
FFmpeg | 1 256 | 1 595 | 569.5 | 0.07 | 8 572 | 3,888.6 | 0.49 | 279.4 | 0.21 |
U-Boot | 1 356 | 1 139 | 455.4 | 0.05 | 3 643 | 1,197.3 | 0.13 | 163.4 | 0.18 |
BusyBox
|
Jailhouse
|
OpenSSL
|
ownCloud
|
QEMU
|
Git
|
Wine
|
Django
|
FFmpeg
|
U-Boot
|
Decomposition Insights
In the following plots, the results of the canonical decomposition for a certain rank R of decomposition are displayed. Within every plot, panel (a) shows the weights of the latent factors, panel (b) shows the developer effects, panel (c) shows the interaction channel effects, and panel (d) shows the dynamic weights and thereby which factor was important at what time.
Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-BootBusyBox
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
Jailhouse
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
OpenSSL
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
ownCloud
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
QEMU
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
Git
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
Wine
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
Django
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
FFmpeg
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
U-Boot
R = 2 |
R = 3 |
R = 4 |
R = 5 |
R = 6 |
R = 7 |
R = 8 |
R = 9 |
Predictive Performance by R
Our results for the predictive performance by R consist of two different kinds of results:
AUC by channel and time
Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-BootBusyBox
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
Jailhouse
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
OpenSSL
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
ownCloud
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
QEMU
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
Git
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
Wine
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
Django
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
FFmpeg
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
U-Boot
R = 2, h = 1 |
R = 3, h = 1 |
R = 4, h = 1 |
R = 5, h = 1 |
R = 6, h = 1 |
R = 7, h = 1 |
R = 8, h = 1 |
R = 9, h = 1 |
AUC by forecast horizon h and rank of reduction R
Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-BootBusyBox
|
Jailhouse
|
OpenSSL
|
ownCloud
|
QEMU
|
Git
|
Wine
|
Django
|
FFmpeg
|
U-Boot
|
Overall Performance for fixed R
Overall AUC measures by interaction channel, subject project, model, and forecast horizon
|
cochange | mail
| |||||||||||
naive | sum | 3d | 4d | 3d-ext | 4d-ext | naive | sum | 3d | 4d | 3d-ext | 4d-ext | |
∅ (h=1) | .70 | .80 | .87 | .85 | .85 | .86 | .68 | .81 | .86 | .83 | .66 | .79 |
∅ (h=5) | .63 | .70 | .90 | .88 | .83 | .88 | .62 | .73 | .89 | .87 | .68 | .83 |
Downloads
Note: For data privacy reasons, we cannot distribute the raw data that we gathered using our data-extraction tools (e.g., Codeface). Please refer to the respective tools to produce a set of data for yourself. You can find more information on the analyzed time ranges and all needed further information in our subject projects above in Table 1.
Contact
If you have any questions regarding this paper or any other related project, please do not hesitate to contact us:
- Thomas Bock (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
- Angelika Schmid (IBM, München, Germany)
- Sven Apel (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)