Measuring and Modeling Group Dynamics in Open-Source Software Development: A Tensor Decomposition Approach

— Supplementary Website

Thomas Bock, Angelika Schmid, Sven Apel

Abstract

Many open-source software projects depend on a few core developers, who take over both the bulk of coordination and programming tasks. They are supported by peripheral developers, who contribute either via discussions or programming tasks, often for a limited time. It is unclear what role these peripheral developers play in the programming and communication efforts, as well as the temporary task-related sub-groups in the projects. We mine code-repository data and mailing-list discussions to model the relationships and contributions of developers in a social network and devise a method to analyze the temporal collaboration structures in communication and programming, learning about the strength and stability of social sub-groups in open-source software projects. Our method uses multi-modal social networks on a series of time windows. Previous work has reduced the network structure representing developer collaboration to networks with only one type of interaction, which impedes the simultaneous analysis of more than one type of interaction. We use both communication and version-control data of open-source software projects and model different types of interaction over time. To demonstrate the practicability of our measurement and analysis method, we investigate 10 substantial and popular open-source software projects, and show that, if sub-groups evolve, modeling these sub-groups helps predict the future evolution of interaction levels of programmers and groups of developers. Our method allows maintainers and other stakeholders of open-source software projects to assess instabilities and organizational changes in developer interaction and can be applied to different use cases in organizational analysis, such as understanding the dynamics of a specific incident or discussion.

Keywords: Coordination Group Structures Open-Source Software Repository Mining Tensor Decomposition

Research Questions and Methodology

GraphicalAbstract

RQ1:
Are there stable group structures in open-source software projects? That is, are there groups of developers that steadily interact with each other during the project's evolution? Or are there no stable group structures, merely developers who just rally round certain tasks and vanish afterwards?
RQ2:
Does the communication behavior of developers result in the same group structures as arises from co-editing behavior? To what extent do the group structures that emerge from communication and from co-editing source code overlap in terms of developers who participate?
RQ3:
Does considering past activity in co-editing or communication improve the prediction of future co-editing or communication? Following the "mirroring hypothesis", can the prediction on one channel be improved by incorporating past activity on the other channel respectively?

Canonical Decomposition

Data Extraction and Processing

To assess and process the data for a single project, we performed three steps:

  1. download the mailing list and the Git repository,
  2. preprocess the data in both data sources for easier access,
  3. process the data to extract all relevant variables from the data and aggregate the data needed for our analysis.
For these three steps, we have used the following tools:

Codeface

Codeface logo

Codeface is a framework and interactive web frontend for the social and technical analysis of software projects.

nntp2mbox

nntp2mbox logo

nntp2mbox is a small Python script to download mailing-list archives from GMane to an mbox file.

codeface-extraction

codeface-extraction logo

codeface-extraction is a small extension to Codeface to extract and preprocess the version-control system and e-mail data.

coronet

coronet logo

coronet is a library to construct socio-technical developer networks based on various data sources in a configurable and reproducible way.



We developed a set of R scripts on top of Codeface and coronet for our analysis, which are available in the Downloads section. (Also notice that we use our customized version of Codeface.) In order to run our scripts, you need to clone coronet (we have used our scripts with version 3.7) into a subdirectory of the scripts directory. Further information about the data format und input directories can be found in the README.md of coronet and also in the README.md delivered with our scripts in the Downloads section.

Subject Projects

Table 1. List of subject projects.
Project Time # 3-month Ranges # Developers # Commits # E-Mails Mailing List
Jailhouse2013-11-20–2016-08-24 11 17 14595598gmane.linux.jailhouse
OpenSSL2002-04-21–2016-02-19 55 153 1098132642gmane.comp.encryption.openssl.devel
BusyBox2003-01-14–2016-02-16 52 217 1079941995gmane.linux.busybox
ownCloud2010-03-24–2018-05-20 32 471 2785614384gmane.comp.kde.devel.owncloud
QEMU2003-04-29–2016-07-27 52 919 45243430202gmane.comp.emulators.qemu
Git2005-04-13–2017-03-12 47 943 34811313795gmane.comp.version-control.git
Wine2002-04-06–2017-11-16 62 1092 112509111331gmane.comp.emulators.wine.devel
Django2005-08-01–2017-12-04 49 1131 2427751338gmane.comp.python.django.devel
FFmpeg2003-01-06–2017-12-12 59 1256 78871242250gmane.comp.video.ffmpeg.devel
U-Boot2000-01-01–2017-12-18 71 1356 44674318719gmane.comp.boot-loaders.u-boot

Results

Our results are arranged into four sections:

Descriptive Insights

Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-Boot


Table 2: Characterization of mail and cochange activity in 10 subject projects. N is the total number of developers involved in each project. nMtmax is the maximum number of mail edges per time window, nM is the average of the number of mail edges over time, and nM% is the average density of the mail network. The definitions are equivalent for nCtmax, nC, and nC%. n11 is the average number of intersecting edges, ϕ is the average ϕt coefficient.
NnMtmax nMnM%nCtmax nCnC% n11 ϕ
Jailhouse 17 27 11.8 8.69 27 9.8 7.22 4.60.39
OpenSSL 153 159 20.3 0.17 737 94.5 0.81 6.80.15
BusyBox 217 150 62.6 0.27 300 110.7 0.47 16.20.19
ownCloud 471 111 29.2 0.03 1 964 929.3 0.84 12.80.08
QEMU 919 1 651723.3 0.17 9 8882,586.3 0.61368.00.20
Git 943 1 892750.0 0.17 3 8552,270.1 0.51230.80.18
Wine 1 092 912446.5 0.07 5 5673,671.1 0.62218.30.18
Django 1 131 266131.7 0.02 9 3701,991.5 0.31 48.40.18
FFmpeg 1 256 1 595569.5 0.07 8 572 3,888.6 0.49279.40.21
U-Boot 1 356 1 139455.4 0.05 3 643 1,197.3 0.13163.40.18

BusyBox

busybox-plot-edges

Jailhouse

jailhouse-plot-edges

OpenSSL

openssl-plot-edges

ownCloud

owncloud-plot-edges

QEMU

qemu-plot-edges

Git

git-plot-edges

Wine

wine-plot-edges

Django

django-plot-edges

FFmpeg

ffmpeg-plot-edges

U-Boot

uboot-plot-edges

Decomposition Insights

In the following plots, the results of the canonical decomposition for a certain rank R of decomposition are displayed. Within every plot, panel (a) shows the weights of the latent factors, panel (b) shows the developer effects, panel (c) shows the interaction channel effects, and panel (d) shows the dynamic weights and thereby which factor was important at what time.

Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-Boot

BusyBox

R = 2
busybox-channel-effect-7-2
busybox-channel-time-7-2
R = 3
busybox-channel-effect-7-3
busybox-channel-time-7-3
R = 4
busybox-channel-effect-7-4
busybox-channel-time-7-4
R = 5
busybox-channel-effect-7-5
busybox-channel-time-7-5
R = 6
busybox-channel-effect-7-6
busybox-channel-time-7-6
R = 7
busybox-channel-effect-7-7
busybox-channel-time-7-7
R = 8
busybox-channel-effect-7-8
busybox-channel-time-7-8
R = 9
busybox-channel-effect-7-9
busybox-channel-time-7-9

Jailhouse

R = 2
jailhouse-channel-effect-7-2
jailhouse-channel-time-7-2
R = 3
jailhouse-channel-effect-7-3
jailhouse-channel-time-7-3
R = 4
jailhouse-channel-effect-7-4
jailhouse-channel-time-7-4
R = 5
jailhouse-channel-effect-7-5
jailhouse-channel-time-7-5
R = 6
jailhouse-channel-effect-7-6
jailhouse-channel-time-7-6
R = 7
jailhouse-channel-effect-7-7
jailhouse-channel-time-7-7
R = 8
jailhouse-channel-effect-7-8
jailhouse-channel-time-7-8
R = 9
jailhouse-channel-effect-7-9
jailhouse-channel-time-7-9

OpenSSL

R = 2
openssl-channel-effect-7-2
openssl-channel-time-7-2
R = 3
openssl-channel-effect-7-3
openssl-channel-time-7-3
R = 4
openssl-channel-effect-7-4
openssl-channel-time-7-4
R = 5
openssl-channel-effect-7-5
openssl-channel-time-7-5
R = 6
openssl-channel-effect-7-6
openssl-channel-time-7-6
R = 7
openssl-channel-effect-7-7
openssl-channel-time-7-7
R = 8
openssl-channel-effect-7-8
openssl-channel-time-7-8
R = 9
openssl-channel-effect-7-9
openssl-channel-time-7-9

ownCloud

R = 2
owncloud-channel-effect-7-2
owncloud-channel-time-7-2
R = 3
owncloud-channel-effect-7-3
owncloud-channel-time-7-3
R = 4
owncloud-channel-effect-7-4
owncloud-channel-time-7-4
R = 5
owncloud-channel-effect-7-5
owncloud-channel-time-7-5
R = 6
owncloud-channel-effect-7-6
owncloud-channel-time-7-6
R = 7
owncloud-channel-effect-7-7
owncloud-channel-time-7-7
R = 8
owncloud-channel-effect-7-8
owncloud-channel-time-7-8
R = 9
owncloud-channel-effect-7-9
owncloud-channel-time-7-9

QEMU

R = 2
qemu-channel-effect-7-2
qemu-channel-time-7-2
R = 3
qemu-channel-effect-7-3
qemu-channel-time-7-3
R = 4
qemu-channel-effect-7-4
qemu-channel-time-7-4
R = 5
qemu-channel-effect-7-5
qemu-channel-time-7-5
R = 6
qemu-channel-effect-7-6
qemu-channel-time-7-6
R = 7
qemu-channel-effect-7-7
qemu-channel-time-7-7
R = 8
qemu-channel-effect-7-8
qemu-channel-time-7-8
R = 9
qemu-channel-effect-7-9
qemu-channel-time-7-9

Git

R = 2
git-channel-effect-7-2
git-channel-time-7-2
R = 3
git-channel-effect-7-3
git-channel-time-7-3
R = 4
git-channel-effect-7-4
git-channel-time-7-4
R = 5
git-channel-effect-7-5
git-channel-time-7-5
R = 6
git-channel-effect-7-6
git-channel-time-7-6
R = 7
git-channel-effect-7-7
git-channel-time-7-7
R = 8
git-channel-effect-7-8
git-channel-time-7-8
R = 9
git-channel-effect-7-9
git-channel-time-7-9

Wine

R = 2
wine-channel-effect-7-2
wine-channel-time-7-2
R = 3
wine-channel-effect-7-3
wine-channel-time-7-3
R = 4
wine-channel-effect-7-4
wine-channel-time-7-4
R = 5
wine-channel-effect-7-5
wine-channel-time-7-5
R = 6
wine-channel-effect-7-6
wine-channel-time-7-6
R = 7
wine-channel-effect-7-7
wine-channel-time-7-7
R = 8
wine-channel-effect-7-8
wine-channel-time-7-8
R = 9
wine-channel-effect-7-9
wine-channel-time-7-9

Django

R = 2
django-channel-effect-7-2
django-channel-time-7-2
R = 3
django-channel-effect-7-3
django-channel-time-7-3
R = 4
django-channel-effect-7-4
django-channel-time-7-4
R = 5
django-channel-effect-7-5
django-channel-time-7-5
R = 6
django-channel-effect-7-6
django-channel-time-7-6
R = 7
django-channel-effect-7-7
django-channel-time-7-7
R = 8
django-channel-effect-7-8
django-channel-time-7-8
R = 9
django-channel-effect-7-9
django-channel-time-7-9

FFmpeg

R = 2
ffmpeg-channel-effect-7-2
ffmpeg-channel-time-7-2
R = 3
ffmpeg-channel-effect-7-3
ffmpeg-channel-time-7-3
R = 4
ffmpeg-channel-effect-7-4
ffmpeg-channel-time-7-4
R = 5
ffmpeg-channel-effect-7-5
ffmpeg-channel-time-7-5
R = 6
ffmpeg-channel-effect-7-6
ffmpeg-channel-time-7-6
R = 7
ffmpeg-channel-effect-7-7
ffmpeg-channel-time-7-7
R = 8
ffmpeg-channel-effect-7-8
ffmpeg-channel-time-7-8
R = 9
ffmpeg-channel-effect-7-9
ffmpeg-channel-time-7-9

U-Boot

R = 2
uboot-channel-effect-7-2
uboot-channel-time-7-2
R = 3
uboot-channel-effect-7-3
uboot-channel-time-7-3
R = 4
uboot-channel-effect-7-4
uboot-channel-time-7-4
R = 5
uboot-channel-effect-7-5
uboot-channel-time-7-5
R = 6
uboot-channel-effect-7-6
uboot-channel-time-7-6
R = 7
uboot-channel-effect-7-7
uboot-channel-time-7-7
R = 8
uboot-channel-effect-7-8
uboot-channel-time-7-8
R = 9
uboot-channel-effect-7-9
uboot-channel-time-7-9

Predictive Performance by R

Our results for the predictive performance by R consist of two different kinds of results:

AUC by channel and time

Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-Boot

BusyBox

R = 2, h = 1
busybox-plot-result-cv-7-2
R = 3, h = 1
busybox-plot-result-cv-7-3
R = 4, h = 1
busybox-plot-result-cv-7-4
R = 5, h = 1
busybox-plot-result-cv-7-5
R = 6, h = 1
busybox-plot-result-cv-7-6
R = 7, h = 1
busybox-plot-result-cv-7-7
R = 8, h = 1
busybox-plot-result-cv-7-8
R = 9, h = 1
busybox-plot-result-cv-7-9

Jailhouse

R = 2, h = 1
jailhouse-plot-result-cv-7-2
R = 3, h = 1
jailhouse-plot-result-cv-7-3
R = 4, h = 1
jailhouse-plot-result-cv-7-4
R = 5, h = 1
jailhouse-plot-result-cv-7-5
R = 6, h = 1
jailhouse-plot-result-cv-7-6
R = 7, h = 1
jailhouse-plot-result-cv-7-7
R = 8, h = 1
jailhouse-plot-result-cv-7-8
R = 9, h = 1
jailhouse-plot-result-cv-7-9

OpenSSL

R = 2, h = 1
openssl-plot-result-cv-7-2
R = 3, h = 1
openssl-plot-result-cv-7-3
R = 4, h = 1
openssl-plot-result-cv-7-4
R = 5, h = 1
openssl-plot-result-cv-7-5
R = 6, h = 1
openssl-plot-result-cv-7-6
R = 7, h = 1
openssl-plot-result-cv-7-7
R = 8, h = 1
openssl-plot-result-cv-7-8
R = 9, h = 1
openssl-plot-result-cv-7-9

ownCloud

R = 2, h = 1
owncloud-plot-result-cv-7-2
R = 3, h = 1
owncloud-plot-result-cv-7-3
R = 4, h = 1
owncloud-plot-result-cv-7-4
R = 5, h = 1
owncloud-plot-result-cv-7-5
R = 6, h = 1
owncloud-plot-result-cv-7-6
R = 7, h = 1
owncloud-plot-result-cv-7-7
R = 8, h = 1
owncloud-plot-result-cv-7-8
R = 9, h = 1
owncloud-plot-result-cv-7-9

QEMU

R = 2, h = 1
qemu-plot-result-cv-7-2
R = 3, h = 1
qemu-plot-result-cv-7-3
R = 4, h = 1
qemu-plot-result-cv-7-4
R = 5, h = 1
qemu-plot-result-cv-7-5
R = 6, h = 1
qemu-plot-result-cv-7-6
R = 7, h = 1
qemu-plot-result-cv-7-7
R = 8, h = 1
qemu-plot-result-cv-7-8
R = 9, h = 1
qemu-plot-result-cv-7-9

Git

R = 2, h = 1
git-plot-result-cv-7-2
R = 3, h = 1
git-plot-result-cv-7-3
R = 4, h = 1
git-plot-result-cv-7-4
R = 5, h = 1
git-plot-result-cv-7-5
R = 6, h = 1
git-plot-result-cv-7-6
R = 7, h = 1
git-plot-result-cv-7-7
R = 8, h = 1
git-plot-result-cv-7-8
R = 9, h = 1
git-plot-result-cv-7-9

Wine

R = 2, h = 1
wine-plot-result-cv-7-2
R = 3, h = 1
wine-plot-result-cv-7-3
R = 4, h = 1
wine-plot-result-cv-7-4
R = 5, h = 1
wine-plot-result-cv-7-5
R = 6, h = 1
wine-plot-result-cv-7-6
R = 7, h = 1
wine-plot-result-cv-7-7
R = 8, h = 1
wine-plot-result-cv-7-8
R = 9, h = 1
wine-plot-result-cv-7-9

Django

R = 2, h = 1
django-plot-result-cv-7-2
R = 3, h = 1
django-plot-result-cv-7-3
R = 4, h = 1
django-plot-result-cv-7-4
R = 5, h = 1
django-plot-result-cv-7-5
R = 6, h = 1
django-plot-result-cv-7-6
R = 7, h = 1
django-plot-result-cv-7-7
R = 8, h = 1
django-plot-result-cv-7-8
R = 9, h = 1
django-plot-result-cv-7-9

FFmpeg

R = 2, h = 1
ffmpeg-plot-result-cv-7-2
R = 3, h = 1
ffmpeg-plot-result-cv-7-3
R = 4, h = 1
ffmpeg-plot-result-cv-7-4
R = 5, h = 1
ffmpeg-plot-result-cv-7-5
R = 6, h = 1
ffmpeg-plot-result-cv-7-6
R = 7, h = 1
ffmpeg-plot-result-cv-7-7
R = 8, h = 1
ffmpeg-plot-result-cv-7-8
R = 9, h = 1
ffmpeg-plot-result-cv-7-9

U-Boot

R = 2, h = 1
uboot-plot-result-cv-7-2
R = 3, h = 1
uboot-plot-result-cv-7-3
R = 4, h = 1
uboot-plot-result-cv-7-4
R = 5, h = 1
uboot-plot-result-cv-7-5
R = 6, h = 1
uboot-plot-result-cv-7-6
R = 7, h = 1
uboot-plot-result-cv-7-7
R = 8, h = 1
uboot-plot-result-cv-7-8
R = 9, h = 1
uboot-plot-result-cv-7-9
AUC by forecast horizon h and rank of reduction R

Jump to results for BusyBox | Jailhouse | OpenSSL | ownCloud | QEMU | Git | Wine | Django | FFmpeg | U-Boot

BusyBox

busybox-plot-AUC-by-R-7

Jailhouse

jailhouse-plot-AUC-by-R-7

OpenSSL

openssl-plot-AUC-by-R-7

ownCloud

owncloud-plot-AUC-by-R-7

QEMU

qemu-plot-AUC-by-R-7

Git

git-plot-AUC-by-R-7

Wine

wine-plot-AUC-by-R-7

Django

django-plot-AUC-by-R-7

FFmpeg

ffmpeg-plot-AUC-by-R-7

U-Boot

uboot-plot-AUC-by-R-7

Overall Performance for fixed R

Overall AUC measures by interaction channel, subject project, model, and forecast horizon
R = 3
heatmap-all-results-7
Table 3: indicates the arithmetic mean of all AUC measures by model and forecast horizon h = 1 or h = 5, for R = 3. Model 3d performs best for all four combinations of interaction mode and forecast horizon.
cochange
mail
naive sum 3d 4d 3d-ext 4d-ext naive sum 3d 4d 3d-ext 4d-ext
(h=1) .70 .80 .87 .85 .85 .86 .68 .81 .86 .83 .66 .79
(h=5) .63 .70 .90 .88 .83 .88 .62 .73 .89 .87 .68 .83

Downloads

Note: For data privacy reasons, we cannot distribute the raw data that we gathered using our data-extraction tools (e.g., Codeface). Please refer to the respective tools to produce a set of data for yourself. You can find more information on the analyzed time ranges and all needed further information in our subject projects above in Table 1.

Contact

If you have any questions regarding this paper or any other related project, please do not hesitate to contact us: