Navigation
Abstract
Many open-source software (OSS) projects are self-organized and do not maintain official lists with information on developer roles. So, knowing which developers take core and maintainer roles is, despite being relevant, often tacit knowledge. We propose a method to automatically identify core developers based on role permissions of privileged events triggered in GitHub issues and pull requests. In an empirical study on 25 GitHub projects, we (1) validate the set of automatically identified core developers with a sample of project-reported developer lists, and we (2) use our set of identified core developers to assess the accuracy of state-of-the-art unsupervised developer classification methods. Our results indicate that the set of core developers, which we extracted from privileged issues events, is sound and the accuracy of state-of-the-art unsupervised classification methods depends mainly on the data source (commit data vs. issue data) rather than the network-construction method (directed vs. undirected, etc.). In perspective, our results shall guide research and practice to choose appropriate unsupervised classification methods, and our method can help create reliable ground-truth data for training supervised classification methods.
Keywords: open-source software projects developer classification developer networks
Subject Projects
Subject Project |
Investigated Time Period |
# Commit Authors |
# Issue Participants |
# Commits | # Issues incl. PRs |
Project Domain |
Programming Language |
Angular | 2014-09--2020-09 | 667 | 22859 | 12349 | 38502 | Web development platform | TypeScript |
Atom | 2012-01--2020-12 | 298 | 21047 | 15627 | 21138 | Text editor | JavaScript |
Bootstrap | 2011-08--2020-12 | 219 | 24744 | 2266 | 31735 | Web front-end framework | JavaScript, HTML |
Deno | 2018-05--2020-12 | 348 | 3070 | 3417 | 8760 | Runtime for JavaScript | Rust, JavaScript, TypeScript |
DTP | 2018-01--2020-04 | 16 | 73 | 633 | 859 | Framework for data transfer | Java |
Electron | 2013-05--2020-12 | 392 | 15559 | 10664 | 26733 | Application development framework | C++, TypeScript |
Flutter | 2015-03--2020-12 | 683 | 34460 | 13367 | 72504 | UI development kit | Dart |
jQuery | 2010-09--2020-12 | 244 | 3118 | 2675 | 4723 | JavaScript library | JavaScript |
Keras | 2015-03--2019-11 | 716 | 12688 | 3471 | 13468 | Deep learning API | Python |
Kubernetes | 2014-06--2020-12 | 2408 | 23220 | 38619 | 97218 | Container management | Go |
Moby | 2013-01--2020-12 | 1154 | 29083 | 14072 | 41731 | Software containerization | Go |
Nextcloud | 2016-06--2020-09 | 355 | 9510 | 9718 | 22689 | Cloud server | PHP, JavaScript |
Next.js | 2016-10--2020-12 | 867 | 11087 | 3891 | 15344 | React framework | JavaScript, TypeScript |
Node.js | 2014-11--2020-02 | 1793 | 13190 | 12118 | 31372 | JavaScript runtime environment | JavaScript, C++, Python |
OpenSSL | 2013-05--2019-12 | 400 | 3303 | 8722 | 10639 | Crypto library | C, Perl |
ownCloud | 2012-08--2019-10 | 393 | 10141 | 18274 | 36178 | Cloud server | PHP, JavaScript |
React | 2013-05--2020-12 | 796 | 16056 | 6921 | 20252 | JavaScript library | JavaScript |
Redux | 2015-06--2020-12 | 228 | 4123 | 701 | 3931 | Container for JavaScript | TypeScript, JavaScript |
reveal.js | 2011-06--2020-10 | 141 | 2861 | 1090 | 2762 | HTML presentation framework | JavaScript, HTML |
TensorFlow | 2015-11--2020-12 | 1519 | 35781 | 55499 | 45652 | Machine learning framework | C++, Python |
three.js | 2010-04--2020-12 | 954 | 8280 | 15999 | 20845 | JavaScript library | JavaScript, HTML |
TypeScript | 2014-07--2020-12 | 467 | 18397 | 17934 | 40973 | JavaScript language | TypeScript |
VS Code | 2015-11--2020-12 | 1001 | 67882 | 49814 | 111073 | Integrated development environment | TypeScript |
Vue | 2016-04--2020-11 | 217 | 8754 | 2256 | 9325 | JavaScript UI framework | JavaScript |
webpack | 2012-05--2020-12 | 501 | 13091 | 5671 | 11710 | Bundler for modules | JavaScript |
Results
Our results are arranged according to our three research questions:
RQ1: How long is the typical time difference between a developer's events that require, at least, write permission?
Time differences between privileged or privileged+extended eventsRQ2: Is the set of privileged developers Dpriv a sound approximation for the set of core developers?
Validation of our set of core developersRQ3: Which metrics and network-construction methods are most accurate in classifying developers into core and peripheral?
Using privileged events
Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged developers, classified core developers, and classified peripheral developers Percentage of privileged developers not part of classificationUsing privileged+extended events
Accuracy of the classification methods (across all projects) Accuracy of the classification methods (individually for each project) Distribution of the ranks of the classification methods Sizes of the sets privileged+extended developers, classified core developers, and classified peripheral developers Percentage of privileged+extended eevelopers not part of classificationTools Used for Data Extraction from GitHub
Codeface
Codeface is a framework and interactive web frontend for the social and technical analysis of software projects.
GitHubWrapper
GitHubWrapper is a tool to extract information from the GitHub issues API.
codeface-extraction
codeface-extraction is a small extension to Codeface to extract and preprocess commit, e-mail, and issue data.
coronet
coronet is a library to construct socio-technical developer networks based on various data sources in a configurable and reproducible way.
We developed a set of R scripts on top of Codeface and coronet for our analysis, which are available in the Downloads section. In order to run our scripts, you need to clone coronet (we have used our scripts with version 4.0) into a subdirectory of the scripts directory. Further information about the data format und input directories can be found in the README.md of coronet and also in the README.md delivered with our scripts in the Downloads section.
Steps to replicate the study:
-
Obtain the input data for our study. There are two independent options how you can achieve that:
- Either run the tools Codeface and GitHubWrapper on your own to collect commit and issue data from GitHub, and run BoDeGHa afterwards to identify bots. Thereafter, you need to run several parts of the codeface-extraction (issue_processing, bot_processing, codeface_extraction, author_postprocessing, and anonymization). Note that each of these tools might run several hours or even multiple days (depending on the subject project) to obtain the data for a project. For large projects, a high amount of RAM is necessary (we recommend > 250 MB).
- Or use our pseudonymized, sanity-checked, and manually corrected input data that we have used in our study, which we provide in the Downloads section below (see item 2).
- Run our analysis scripts (available below in the Downloads section, see item 3) in R (used with R version 4.1.1). As our scripts use the library coronet (version 4.0), coronet needs to be availble in a subdirectory of the analysis scripts. A documentation of how our analysis scripts work, how they can be configured, etc., is contained in a README.md file that we deliver together with our scripts. For large projects, a high amount of RAM is necessary (we recommend > 250 MB). Our analysis scripts process the issue events, extract the set of privileged developers, construct developer networks, apply the state-of-the-art developer classification methods, and assess their accurcy, etc. As output, our scripts create many plots and data, as well as the resulting classification data as .csv files. The resulting classification data can also be downloaded below in our Downloads section (see item 4).
Downloads
Note: For data privacy reasons, we cannot distribute the raw data that we gathered using our data-extraction tools. Instead, we only provide pseudonymized raw data as input data for our scripts. Please refer to the respective tools to produce a non-pseudonymized set of raw data for yourself, as described above. You can find more information on the analyzed time ranges and all needed further information about our subject projects above in the corresponding table.
- Our tool GitHubWrapper for getting issue data from GitHub (available on GitHub)
- Pseudonymized input data (containing commit data and issue data)
- Our analysis scripts to perform the analyses
- Resulting pseudonymized classification data (values of the different classification methods for each developer for each project, as well as the set of privileged and privileged+extended developers)
- Setup analysis via Dockerfile
Contact
If you have any questions regarding this paper, please do not hesitate to contact us:
- Thomas Bock (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
- Nils Alznauer (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
- Mitchell Joblin (Siemens AG, Munich, Germany & Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
- Sven Apel (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)