Automatic Core-Developer Identification on GitHub: A Validation Study

Abstract

Many open-source software (OSS) projects are self-organized and do not maintain official lists with information on developer roles. So, knowing which developers take core and maintainer roles is, despite being relevant, often tacit knowledge. We propose a method to automatically identify core developers based on role permissions of privileged events triggered in GitHub issues and pull requests. In an empirical study on 25 GitHub projects, we (1) validate the set of automatically identified core developers with a sample of project-reported developer lists, and we (2) use our set of identified core developers to assess the accuracy of state-of-the-art unsupervised developer classification methods. Our results indicate that the set of core developers, which we extracted from privileged issues events, is sound and the accuracy of state-of-the-art unsupervised classification methods depends mainly on the data source (commit data vs. issue data) rather than the network-construction method (directed vs. undirected, etc.). In perspective, our results shall guide research and practice to choose appropriate unsupervised classification methods, and our method can help create reliable ground-truth data for training supervised classification methods.

Keywords: open-source software projects developer classification developer networks

Subject Projects

Subject Project	Investigated Time Period	# Commit Authors	# Issue Participants	# Commits	# Issues incl. PRs	Project Domain	Programming Language
Angular	2014-09--2020-09	667	22859	12349	38502	Web development platform	TypeScript
Atom	2012-01--2020-12	298	21047	15627	21138	Text editor	JavaScript
Bootstrap	2011-08--2020-12	219	24744	2266	31735	Web front-end framework	JavaScript, HTML
Deno	2018-05--2020-12	348	3070	3417	8760	Runtime for JavaScript	Rust, JavaScript, TypeScript
DTP	2018-01--2020-04	16	73	633	859	Framework for data transfer	Java
Electron	2013-05--2020-12	392	15559	10664	26733	Application development framework	C++, TypeScript
Flutter	2015-03--2020-12	683	34460	13367	72504	UI development kit	Dart
jQuery	2010-09--2020-12	244	3118	2675	4723	JavaScript library	JavaScript
Keras	2015-03--2019-11	716	12688	3471	13468	Deep learning API	Python
Kubernetes	2014-06--2020-12	2408	23220	38619	97218	Container management	Go
Moby	2013-01--2020-12	1154	29083	14072	41731	Software containerization	Go
Nextcloud	2016-06--2020-09	355	9510	9718	22689	Cloud server	PHP, JavaScript
Next.js	2016-10--2020-12	867	11087	3891	15344	React framework	JavaScript, TypeScript
Node.js	2014-11--2020-02	1793	13190	12118	31372	JavaScript runtime environment	JavaScript, C++, Python
OpenSSL	2013-05--2019-12	400	3303	8722	10639	Crypto library	C, Perl
ownCloud	2012-08--2019-10	393	10141	18274	36178	Cloud server	PHP, JavaScript
React	2013-05--2020-12	796	16056	6921	20252	JavaScript library	JavaScript
Redux	2015-06--2020-12	228	4123	701	3931	Container for JavaScript	TypeScript, JavaScript
reveal.js	2011-06--2020-10	141	2861	1090	2762	HTML presentation framework	JavaScript, HTML
TensorFlow	2015-11--2020-12	1519	35781	55499	45652	Machine learning framework	C++, Python
three.js	2010-04--2020-12	954	8280	15999	20845	JavaScript library	JavaScript, HTML
TypeScript	2014-07--2020-12	467	18397	17934	40973	JavaScript language	TypeScript
VS Code	2015-11--2020-12	1001	67882	49814	111073	Integrated development environment	TypeScript
Vue	2016-04--2020-11	217	8754	2256	9325	JavaScript UI framework	JavaScript
webpack	2012-05--2020-12	501	13091	5671	11710	Bundler for modules	JavaScript

Results

Our results are arranged according to our three research questions:

RQ₁: How long is the typical time difference between a developer's events that require, at least, write permission?

Time differences between privileged or privileged+extended events

RQ₂: Is the set of privileged developers D_priv a sound approximation for the set of core developers?

Validation of our set of core developers

RQ₃: Which metrics and network-construction methods are most accurate in classifying developers into core and peripheral?

Using privileged events

Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged developers, classified core developers, and classified peripheral developers Percentage of privileged developers not part of classification

Using privileged+extended events

Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged+extended developers, classified core developers, and classified peripheral developers Percentage of privileged+extended eevelopers not part of classification

Tools Used for Data Extraction from GitHub

Codeface

Codeface logo

Codeface is a framework and interactive web frontend for the social and technical analysis of software projects.

https://github.com/se-sic/codeface

GitHubWrapper

GitHubWrapper is a tool to extract information from the GitHub issues API.

https://github.com/se-sic/GitHubWrapper

codeface-extraction

codeface-extraction is a small extension to Codeface to extract and preprocess commit, e-mail, and issue data.

https://github.com/se-sic/codeface-extraction

coronet

coronet logo

coronet is a library to construct socio-technical developer networks based on various data sources in a configurable and reproducible way.

https://github.com/se-sic/coronet/

We developed a set of R scripts on top of Codeface and coronet for our analysis, which are available in the Downloads section. In order to run our scripts, you need to clone coronet (we have used our scripts with version 4.0) into a subdirectory of the scripts directory. Further information about the data format und input directories can be found in the README.md of coronet and also in the README.md delivered with our scripts in the Downloads section.

Steps to replicate the study:

Obtain the input data for our study. There are two independent options how you can achieve that:
- Either run the tools Codeface and GitHubWrapper on your own to collect commit and issue data from GitHub, and run BoDeGHa afterwards to identify bots. Thereafter, you need to run several parts of the codeface-extraction (issue_processing, bot_processing, codeface_extraction, author_postprocessing, and anonymization). Note that each of these tools might run several hours or even multiple days (depending on the subject project) to obtain the data for a project. For large projects, a high amount of RAM is necessary (we recommend > 250 MB).
- Or use our pseudonymized, sanity-checked, and manually corrected input data that we have used in our study, which we provide in the Downloads section below (see item 2).
Run our analysis scripts (available below in the Downloads section, see item 3) in R (used with R version 4.1.1). As our scripts use the library coronet (version 4.0), coronet needs to be availble in a subdirectory of the analysis scripts. A documentation of how our analysis scripts work, how they can be configured, etc., is contained in a README.md file that we deliver together with our scripts. For large projects, a high amount of RAM is necessary (we recommend > 250 MB). Our analysis scripts process the issue events, extract the set of privileged developers, construct developer networks, apply the state-of-the-art developer classification methods, and assess their accurcy, etc. As output, our scripts create many plots and data, as well as the resulting classification data as .csv files. The resulting classification data can also be downloaded below in our Downloads section (see item 4).

To replicate step II, we provide a Dockerfile below in the Downloads section (together with a README.md file which describes how to use the Dockerfile). If you set up a docker container based on this Dockerfile, the pseudonymized input data, our analysis scripts, and coronet are already downloaded in the container and the necessary dependencies get installed, such that you can directly run our analysis scripts within the container. For large projects, a high amount of RAM is necessary (we recommend > 250 MB). More details (also regarding the required R libraries, etc.) can be found in the respective README.md files.

Downloads

Note: For data privacy reasons, we cannot distribute the raw data that we gathered using our data-extraction tools. Instead, we only provide pseudonymized raw data as input data for our scripts. Please refer to the respective tools to produce a non-pseudonymized set of raw data for yourself, as described above. You can find more information on the analyzed time ranges and all needed further information about our subject projects above in the corresponding table.

Contact

If you have any questions regarding this paper, please do not hesitate to contact us:

Thomas Bock (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Nils Alznauer (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Mitchell Joblin (Siemens AG, Munich, Germany & Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Sven Apel (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)

RQ1: How long is the typical time difference between a developer's events that require, at least, write permission?

RQ2: Is the set of privileged developers Dpriv a sound approximation for the set of core developers?

RQ3: Which metrics and network-construction methods are most accurate in classifying developers into core and peripheral?