Automatic Core-Developer Identification on GitHub:
A Validation Study

— Supplementary Website

Thomas Bock, Nils Alznauer, Mitchell Joblin, Sven Apel

Abstract

Many open-source software (OSS) projects are self-organized and do not maintain official lists with information on developer roles. So, knowing which developers take core and maintainer roles is, despite being relevant, often tacit knowledge. We propose a method to automatically identify core developers based on role permissions of privileged events triggered in GitHub issues and pull requests. In an empirical study on 25 GitHub projects, we (1) validate the set of automatically identified core developers with a sample of project-reported developer lists, and we (2) use our set of identified core developers to assess the accuracy of state-of-the-art unsupervised developer classification methods. Our results indicate that the set of core developers, which we extracted from privileged issues events, is sound and the accuracy of state-of-the-art unsupervised classification methods depends mainly on the data source (commit data vs. issue data) rather than the network-construction method (directed vs. undirected, etc.). In perspective, our results shall guide research and practice to choose appropriate unsupervised classification methods, and our method can help create reliable ground-truth data for training supervised classification methods.

Keywords: open-source software projects developer classification developer networks

Subject Projects

Subject
Project
Investigated
Time Period
# Commit
Authors
# Issue
Participants
# Commits # Issues
incl. PRs
Project
Domain
Programming
Language
Angular 2014-09--2020-09 667 22859 12349 38502 Web development platform TypeScript
Atom 2012-01--2020-12 298 21047 15627 21138 Text editor JavaScript
Bootstrap 2011-08--2020-12 219 24744 2266 31735 Web front-end framework JavaScript, HTML
Deno 2018-05--2020-12 348 3070 3417 8760 Runtime for JavaScript Rust, JavaScript, TypeScript
DTP 2018-01--2020-04 16 73 633 859 Framework for data transfer Java
Electron 2013-05--2020-12 392 15559 10664 26733 Application development framework C++, TypeScript
Flutter 2015-03--2020-12 683 34460 13367 72504 UI development kit Dart
jQuery 2010-09--2020-12 244 3118 2675 4723 JavaScript library JavaScript
Keras 2015-03--2019-11 716 12688 3471 13468 Deep learning API Python
Kubernetes 2014-06--2020-12 2408 23220 38619 97218 Container management Go
Moby 2013-01--2020-12 1154 29083 14072 41731 Software containerization Go
Nextcloud 2016-06--2020-09 355 9510 9718 22689 Cloud server PHP, JavaScript
Next.js 2016-10--2020-12 867 11087 3891 15344 React framework JavaScript, TypeScript
Node.js 2014-11--2020-02 1793 13190 12118 31372 JavaScript runtime environment JavaScript, C++, Python
OpenSSL 2013-05--2019-12 400 3303 8722 10639 Crypto library C, Perl
ownCloud 2012-08--2019-10 393 10141 18274 36178 Cloud server PHP, JavaScript
React 2013-05--2020-12 796 16056 6921 20252 JavaScript library JavaScript
Redux 2015-06--2020-12 228 4123 701 3931 Container for JavaScript TypeScript, JavaScript
reveal.js 2011-06--2020-10 141 2861 1090 2762 HTML presentation framework JavaScript, HTML
TensorFlow 2015-11--2020-12 1519 35781 55499 45652 Machine learning framework C++, Python
three.js 2010-04--2020-12 954 8280 15999 20845 JavaScript library JavaScript, HTML
TypeScript 2014-07--2020-12 467 18397 17934 40973 JavaScript language TypeScript
VS Code 2015-11--2020-12 1001 67882 49814 111073 Integrated development environment TypeScript
Vue 2016-04--2020-11 217 8754 2256 9325 JavaScript UI framework JavaScript
webpack 2012-05--2020-12 501 13091 5671 11710 Bundler for modules JavaScript

Results

Our results are arranged according to our three research questions:

RQ1: How long is the typical time difference between a developer's events that require, at least, write permission?

Time differences between privileged or privileged+extended events

RQ2: Is the set of privileged developers Dpriv a sound approximation for the set of core developers?

Validation of our set of core developers

RQ3: Which metrics and network-construction methods are most accurate in classifying developers into core and peripheral?


Using privileged events
Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged developers, classified core developers, and classified peripheral developers Percentage of privileged developers not part of classification
Using privileged+extended events
Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged+extended developers, classified core developers, and classified peripheral developers Percentage of privileged+extended eevelopers not part of classification

Tools Used for Data Extraction from GitHub

Codeface

Codeface logo

Codeface is a framework and interactive web frontend for the social and technical analysis of software projects.

GitHubWrapper

GitHubWrapper logo

GitHubWrapper is a tool to extract information from the GitHub issues API.

codeface-extraction

codeface-extraction logo

codeface-extraction is a small extension to Codeface to extract and preprocess commit, e-mail, and issue data.


coronet

coronet logo

coronet is a library to construct socio-technical developer networks based on various data sources in a configurable and reproducible way.



We developed a set of R scripts on top of Codeface and coronet for our analysis, which are available in the Downloads section. In order to run our scripts, you need to clone coronet (we have used our scripts with version 4.0) into a subdirectory of the scripts directory. Further information about the data format und input directories can be found in the README.md of coronet and also in the README.md delivered with our scripts in the Downloads section.


Steps to replicate the study:
  1. Obtain the input data for our study. There are two independent options how you can achieve that:
    • Either run the tools Codeface and GitHubWrapper on your own to collect commit and issue data from GitHub, and run BoDeGHa afterwards to identify bots. Thereafter, you need to run several parts of the codeface-extraction (issue_processing, bot_processing, codeface_extraction, author_postprocessing, and anonymization). Note that each of these tools might run several hours or even multiple days (depending on the subject project) to obtain the data for a project. For large projects, a high amount of RAM is necessary (we recommend > 250 MB).
    • Or use our pseudonymized, sanity-checked, and manually corrected input data that we have used in our study, which we provide in the Downloads section below (see item 2).
  2. Run our analysis scripts (available below in the Downloads section, see item 3) in R (used with R version 4.1.1). As our scripts use the library coronet (version 4.0), coronet needs to be availble in a subdirectory of the analysis scripts. A documentation of how our analysis scripts work, how they can be configured, etc., is contained in a README.md file that we deliver together with our scripts. For large projects, a high amount of RAM is necessary (we recommend > 250 MB). Our analysis scripts process the issue events, extract the set of privileged developers, construct developer networks, apply the state-of-the-art developer classification methods, and assess their accurcy, etc. As output, our scripts create many plots and data, as well as the resulting classification data as .csv files. The resulting classification data can also be downloaded below in our Downloads section (see item 4).
To replicate step II, we provide a Dockerfile below in the Downloads section (together with a README.md file which describes how to use the Dockerfile). If you set up a docker container based on this Dockerfile, the pseudonymized input data, our analysis scripts, and coronet are already downloaded in the container and the necessary dependencies get installed, such that you can directly run our analysis scripts within the container. For large projects, a high amount of RAM is necessary (we recommend > 250 MB). More details (also regarding the required R libraries, etc.) can be found in the respective README.md files.

Downloads

Note: For data privacy reasons, we cannot distribute the raw data that we gathered using our data-extraction tools. Instead, we only provide pseudonymized raw data as input data for our scripts. Please refer to the respective tools to produce a non-pseudonymized set of raw data for yourself, as described above. You can find more information on the analyzed time ranges and all needed further information about our subject projects above in the corresponding table.

  1. Our tool GitHubWrapper for getting issue data from GitHub (available on GitHub)
  2. Pseudonymized input data (containing commit data and issue data)
  3. Our analysis scripts to perform the analyses
  4. Resulting pseudonymized classification data (values of the different classification methods for each developer for each project, as well as the set of privileged and privileged+extended developers)
  5. Setup analysis via Dockerfile

Contact

If you have any questions regarding this paper, please do not hesitate to contact us: