Understanding the Low Inter-Rater Agreement on Aggressiveness on the Linux Kernel Mailing List

Abstract

Communication among software developers plays an essential role in open-source software (OSS) projects. Not unexpectedly, previous studies have shown that the conversational tone and, in particular, aggressiveness influence the participation of developers in OSS projects. Therefore, we aimed at studying aggressive communication behavior on the Linux Kernel Mailing List (LKML), which is known for aggressive e-mails of some of its contributors. To that aim, we attempted to assess the extent of aggressiveness of 720 e-mails from the LKML with a human annotation study, involving multiple annotators, to select a suitable sentiment analysis tool.

The results of our annotation study revealed that there is substantial disagreement, even among humans, which uncovers a deeper methodological challenge of studying aggressiveness in the software-engineering domain. Adjusting our focus, we dug deeper and investigated why the agreement among humans is generally low, based on manual investigations of ambiguously rated e-mails. Our results illustrate that human perception is individual and context dependent, especially when it comes to technical content. Thus, when identifying aggressiveness in software-engineering texts, it is not sufficient to rely on aggregated measures of human annotations. Hence, sentiment analysis tools specifically trained on human-annotated data do not necessarily match human perception of aggressiveness, and corresponding results need to be taken with a grain of salt. By reporting our results and experience, we aim at confirming and raising additional awareness of this methodological challenge when studying aggressiveness (and sentiment, in general) in the software-engineering domain.

Keywords: Software Developer Communication Sentiment Analysis Human Annotation

Data Extraction and Processing

We downloaded the e-mails of the Linux Kernel Mailing List (LKML) from the mailing-list archive Gmane using the tool https://github.com/xai/nntp2mbox/, providing the list name gmane.linux.kernel.

Afterwards, we processed the header and the content of these e-mails, using our script "list-mbox.py" (see directory "data_collection" in our downloadable archive that contains our preprocessing and evaluation scripts in the section on Downloads below). Then, we extracted the names from the header and generated a list of potential names, to be used for anonymizing the e-mail content and replacing names with tokens that indicate what role the respective person played (sender, recipient, cc recipient). For data-privacy reasons, we cannot publish the names files. Finally, we removed all citations and formatted the e-mail, using our script "dictify.py" (see also directory "data_collection").

All the scripts mentioned here can be found in the downloadable zip archive below in the section Downloads.

Annotation Study

In the following, we provide images of the tutorial e-mails and all the 720 e-mails that were part of our annotation study. The formatting and layouting of the e-mails is conform with the formatting and layouting in which we had shown the e-mails to the annotators. The order of the e-mails in the following is in line with the order in which the e-mails were shown to the annotators. Each of these e-mails was annotated by 6 to 9 annotators. (For data privacy and copyright reasons, we cannot distribute the raw e-mail data as plain text, which is why we generated images out of it.)

Show tutorial e-mails

The following images display the 10 e-mails that were part of our tutorial, which was mandatory for all annotators before the actual annotation study began. The e-mails' contents have been provided together with an ID and an example rating:

Text

Show all 720 e-mails of our annotation study

The following images display the 720 e-mails that were part of our annotation study in the actual order in which the e-mails were shown to the annotators:

Text

Show the 92 e-mails where the annotators disagreed on the overall label

The following images display the 92 e-mails of our annotation study for which we received inconclusive annotation labels (i.e., more than one annotator disagreed with the annotation result of the rest of the annotators.) These e-mails were part of our detailed manual investigations on why the inter-rater agreement on these e-mails is considerably low:

Text

Show background information about the annotators

experience: 1 (novice) to 10 (expert)
experience compared to colleagues: 1 (junior) to 5 (senior)
paradigms: 0 (very unfamiliar) to 4 (very familiar)
num opensource: 0 (contributions to no OSS project), 1 (contributions to 1-2 OSS projects), 2 (contributions to more than 2 OSS projects)

user_id	age	gender	experience	experience compared to colleagues	familiar coding languages	paradigm objectoriented	paradigm functional	paradigm imperative	paradigm logic	num opensource
14	25-34	female	3	2	R, Python, SQL	3	0	0	0	1
29	18-24	male	4	2	Java;R;C;Python	3	2	1	1	0
31	-	female	4	3	Java; Haskell; R	3	3	3	1	1
28	25-34	female	5	3	python	2	2	2	2	0
30	18-24	male	6	2	Java; Python; R	3	1	0	2	0
15	25-34	female	6	3	R; python; Java; Haskell	3	3	2	2	1
16	25-34	male	8	3	Java; R; Python; Scala; Haskell; Bash;	4	4	4	2	2
32	18-24	male	8	3	Java; R; Python	4	2	3	1	0
27	25-34	male	8	4	Java;Python;C;Bash;awk	4	2	4	4	2
24	25-34	male	9	4	R;Python;Java	3	3	3	3	2

Show screenshot of annotation tool and annotation guidelines

Here is an example screenshot from our annotation tool (not showing any guidelines, but there is a "guidelines" button at the top of the e-mail to be annotated:

The following annotation guidelines have been shown to the annotators before the start of the annotation study, and were also available during annotation via a button on the top of the annotation e-mail:

When clicking on the link "annotated samples" at the end of the annotation guidelines (as can be seen in the screenshot above), further examples were shown to the annotators, which we also provide here:

Check out these annotated samples.

Design Choices for Our Annotation Study

Overview of our design choices and the corresponding papers from the literature.

Design choice	Papers found in our literature review
Oversampling of aggressive comments	El Asri et al. [2019], Sarker et al. [2020]
E-mail preprocessing: removing URLs, etc.	Biswas et al. [2020], Calefato et al. [2018], Imran et al. [2022], Sanei et al. [2021]
Replacing names and e-mail addresses by tokens	Klünder et al. [2020]
Removing auto-generated e-mails	Tourani et al. [2014]
Removing citations	Ferreira et al. [2019a], Garcia et al. [2013], Lin et al. [2018], Rousinopoulos et al. [2014], Sanei et al. [2021], Tourani et al. [2014]
Selection of annotation labels	see the different options in Table 2 in the paper
Annotation guidelines	Biswas et al. [2020], Calefato et al. [2018], Cassee et al. [2022], El Asri et al. [2019], Fucci et al. [2021], Hata et al. [2022], Herrmann et al. [2022], Imtiaz et al. [2018], Novielli et al. [2018a], Sarker et al. [2020]
Tutorials with supervised annotation	Blaz and Becker [2016], Calefato et al. [2018]
Number of annotators (6–9)	e.g., Mansoor et al. [2021], Murgia et al. [2018], Ortu et al. [2016b], Park and Sharif [2021]
Disagreement resolution	see the different options in Table 2 in the paper

Results

Our results are arranged into three sections:

General Overview of the Annotation Results
Inter-Rater Agreement
Annotation Insights

General Overview of the Annotation Results

The plot shows the overall annotation results (i.e., the assigned label) for the 720 annotated e-mails, which have been annotated by 6 to 9 annotators each. We excluded 28 e-mails that have been labeled as spam, corrupted, or auto-generated.
From the remaining 692 e-mails, for 548 e-mails, at least, all but one of the annotators labeled it as non-aggressive. For 52 e-mails, at least, all but one annotators labeled them as aggressive. The remaining 92 e-mails have received inconclusive annotations (i.e., at least, two annotators selected a different label than the other annotators, in many cases even half of the annotators disagreed with the annotation label of the other half of the annotators).

Inter-Rater Agreement

K's α: Krippendorff's alpha (0 perfect disagreement, 1 perfect agreement, customary to require 0.8)
ICC: Intraclass Correlation Coefficient (0 perfect disagreement, 1 perfect agreement, 0.75 good reliability)

Group		Binary Label						Multi Label
	annotators per group*	all annotations		excl. unsure (>= 4 sure)		excl. unsure		all annotations		excl. unsure (>= 4 sure)		excl. unsure
		ICC	K's α	ICC	K's α	ICC	K's α	ICC	K's α	ICC	K's α	ICC	K's α
all	10	0.49	0.49	0.53	0.52	0.53	0.52	0.50	0.43	0.52	0.44	0.52	0.44
male	6	0.51	0.51	0.53	0.53	0.53	0.53	0.51	0.43	0.52	0.44	0.53	0.44
female	4	0.33	0.37	0.33	0.37	0.41	0.43	0.46	0.45	0.46	0.45	0.52	0.49
contribution to no OSS project	4	0.57	0.57	0.57	0.57	0.60	0.60	0.61	0.57	0.61	0.57	0.63	0.58
contribution to >= 1 OSS projects	6	0.44	0.44	0.46	0.45	0.48	0.47	0.42	0.33	0.43	0.34	0.45	0.35
contribution to <= 2 OSS projects	7	0.53	0.53	0.55	0.55	0.56	0.56	0.57	0.52	0.58	0.53	0.60	0.55
contribution to > 2 OSS projects	3	0.45	0.46	0.45	0.46	0.49	0.48	0.43	0.35	0.43	0.35	0.45	0.34
experience: 1-5	4	0.55	0.46	0.55	0.46	0.62	0.51	0.58	0.48	0.58	0.38	0.64	0.52
experience: 6-10	6	0.50	0.51	0.52	0.52	0.53	0.52	0.48	0.40	0.49	0.41	0.50	0.51
experience compared to colleagues: 1-2	3	0.54	0.52	0.54	0.52	0.59	0.56	0.60	0.55	0.60	0.55	0.63	0.58
experience compared to colleagues: 3-5	7	0.45	0.46	0.47	0.47	0.49	0.49	0.45	0.38	0.45	0.38	0.47	0.39

* not all annotators have annotated all e-mails (we had 6-9 annotators per e-mail)

Each circle represents one single e-mail. An e-mail is put into the column matching the most prominent target category. The height of a circle displays the Krippendorff's Alpha inter-rater agreement coefficient of the corresponding e-mail.

Annotation Insights

The following plots give insights into the annotation data.

Each bar on the x-axis represents one of the 76 ambiguous e-mails. The bars show the distribution of the overall label; the total height of each bar matches with the number of reviewers that annotated the respective e-mail.

The plot shows how many of the 76 e-mails categorized as "ambiguous" are put in one of five sub-categories, depending on the content and based on manual investigation.

Each bar on the x-axis represents one of the 92 e-mails with disagreement among the annotators (more than one deviates from the annotation result of the majority of annotators). The bars show the number of aggression targets for each category. Hereby, the targets have been summarized to only four categories in order to increase readability. The total number of targets can exceed the number of raters, as each rater can give multiple targets.

Each bar on the x-axis represents one e-mail that is not categorized unsure and rated aggressive (by all or all but one raters). The bars show the number of aggression targets for each category.

Each bar on the x-axis represents one e-mail that is not categorized unsure and rated not aggressive (by all or all but one raters). The bars show the number of aggression targets for each category. E-mails only having "none" or "other" targets are omitted for the sake of readability.

Each marker represents a single e-mail (there are 360 e-mails in the first batch), the x-axis represents the order of the e-mails from left to right. The height of the points shows the relative amount of aggressive votes for this e-mail.

Each marker represents a single e-mail (there are 360 e-mails in the second batch), the x-axis represents the order of the e-mails from left to right. The height of the points shows the relative amount of aggressive votes for this e-mail.

Explorative Experiment Using Different Potential Ground Truths and Tools

To demonstrate how the selection of a potential ground truth for aggressiveness biases the comparison of human annotation data and tool results, we conducted a small explorative experiment (see also Section 7.3 in the paper). In particular, for our explorative experiment, we used the four sentiment-analysis tools Perspective API [Google and Jigsaw, 2017], Stanford CoreNLP [Manning et al., 2014], VADER [Hutto et al., 2014], and SentiStrength-SE [Islam and Zibran, 2018d] on our 720 sampled e-mails. While the first three tools are standard sentiment analysis tools, the last one is a software-engineering-specific sentiment analysis tool. However, there is no specific reason for selecting exactly these four tools, except for the fact that all of them are widely used and established sentiment analysis tools. With our explorative experiment, our aim is to show that, depending on the ground truth that is selected (i.e., the aggregation method of annotation data), also the tools' accuracies are different, and also many other factors, such as text preprocessings, could potentially matter.

For our explorative experiment, we used the following four different e-mail preprocessings: (1) removing only citations, (2) additionally removing URLs, e-mail addresses, and signatures, (3) additionally removing developer names, and, finally, (4) also removing code snippets, in addition. After running the sentiment analysis tools on the e-mail texts (using the different preprocessings), we mapped the tools' results to binary labels, computed the tools' agreement, and evaluated their performance against several potential ground truths that we derived from the human annotations via different methods for averaging labels across the annotators. In what follows we provide information on the tools' agreement as well as on the tools' accuracy with respect to human-annotated data (which needs to be taken with a grain of salt, due to the generally low agreement among the humans). More information how we have used the tools can be found in the downloadable zip archive in the section on Downloads below (see the "tools" directory).

Inter-Tool Agreement

First of all, we run the four selected sentiment analysis tools on the 720 e-mails with four different preprocessings and computed the agreement of the tools (for which we mapped the resulting scores to binary labels). The resulting inter-tool agreement can be seen in the following table, separately for each kind of preprocessing:

Preprocessing	Binary Label
	ICC	K's α
Remove citations	0.03	0.03
+ URLs, e-mail addr., signatures	0.00	0.00
+ developer names	-0.01	-0.01
+ code snippets	-0.07	-0.07

The results in the table above show that, independent of the preprocessing method, there is almost no agreement between the classifications of the tools, as Krippendorff's alpha is between -0.07 and 0.03, which is even lower than the inter-rater agreement of our human annotators. This also holds when using the ICC as measure of agreement.

Tool Accuracy Based on Different Potential Ground Truths Derived From Human Annotation

Then, we compared the tools' results against human annotations. As the human inter-rater agreement was generally low, we constructed three different potential ground truths for comparison by aggregating the annotation data in different ways:

GT1: An e-mail is aggressive if the mode of its annotations is aggressive (considering a tie as non-aggressive, and excluding unsure annotations if there are ≥ 4 sure annotations for an e-mail).
GT2: An e-mail is aggressive if, at least, one person has labeled it as aggressive (excluding all unsure annotations).
GT3: Only e-mails where all but one persons agreed on the same binary label (excluding all unsure annotations) are considered.

Whereas GT1 uses a measure of central tendency, GT3 uses only the subset of e-mail for which the annotators agreed on the binary label. Consequently, for comparing the tools' results with human-annotated data, GT3 is the most reliable one of our three potential ground truths, as it only considers e-mails with human agreement. Notice, however, that all three potential ground truths are not reliable, in general, according to the generally low agreement or due to ignoring e-mails with disagreement. We computed precision, recall, and F1 score for the binary labels of each combination of tools and preprocessings with respect to each of our potential ground truths. In the following, we provide the accuracy of the tools with respect to the different potential ground truths for the four different methods of e-mail preprocessing:

Accuracy (F1 score) of Sentiment Analysis Tools when evaluated against GT1:

Preprocessing	Perspective API	Stanford CoreNLP	VADER	SentiStrength-SE
Remove citations	0.55	0.20	0.30	pos* 0.34 neg* 0.58
+ URLs, e-mail addr., signatures	0.57	0.18	0.30	pos* 0.33 neg* 0.56
+ developer names	0.57	0.20	0.30	pos* 0.33 neg* 0.56
+ code snippets	0.56	0.20	0.30	pos* 0.34 neg* 0.56

* Notice: SentiStrength-SE provides two separate scores, a positive score and a negative score. We calculated the accuracy separately for each of them.

Accuracy (F1 score) of Sentiment Analysis Tools when evaluated against GT2:

Preprocessing	Perspective API	Stanford CoreNLP	VADER	SentiStrength-SE
Remove citations	0.35	0.46	0.47	pos* 0.34 neg* 0.58
+ URLs, e-mail addr., signatures	0.39	0.43	0.48	pos* 0.33 neg* 0.56
+ developer names	0.39	0.43	0.48	pos* 0.33 neg* 0.56
+ code snippets	0.39	0.4e	0.48	pos* 0.34 neg* 0.56

* Notice: SentiStrength-SE provides two separate scores, a positive score and a negative score. We calculated the accuracy separately for each of them.

Accuracy (F1 score) of Sentiment Analysis Tools when evaluated against GT3:

Preprocessing	Perspective API	Stanford CoreNLP	VADER	SentiStrength-SE
Remove citations	0.68	0.15	0.27	pos* 0.24 neg* 0.28
+ URLs, e-mail addr., signatures	0.67	0.14	0.27	pos* 0.25 neg* 0.29
+ developer names	0.70	0.15	0.27	pos* 0.25 neg* 0.28
+ code snippets	0.70	0.15	0.27	pos* 0.25 neg* 0.29

* Notice: SentiStrength-SE provides two separate scores, a positive score and a negative score. We calculated the accuracy separately for each of them.

Summary of Our Explorative Experiment

Independent of the preprocessing method, there is almost no agreement between the classifications of the tools, as Krippendorff's alpha is between -0.07 and 0.03, which is lower than the inter-rater agreement of our human annotators. This, again, also holds when using the ICC as measure of agreement. When comparing the tools' classifications against the human annotations, we made the following observations: When considering only the e-mails for which, at least, all but one annotators agreed on the same binary label (GT3), Stanford CoreNLP performs worst (F1 ∼ 0.15), whereas Perspective API performs best (F1 ∼ 0.70). Interestingly, the tool SentiStrength-SE (F1 ∼ 0.27), which was specifically designed for the software-engineering domain [Islam and Zibran, 2018d], performs not better than tools designed for outside the domain (e.g., VADER, F1 ∼ 0.27), which is a result that we did not expect. Nevertheless, as already stated, such results need to be taken with a grain of salt, given that the different potential ground truths led to way different tool accuracies. The different e-mail preprocessings, however, had almost no effect for the tools' accuracies. This, again, shows, that further research is needed to develop appropriate aggregation methods and ways to handle disagreement among humans.

Literature Review

A complete overview of our literature review (i.e., a table of the papers that fulfilled our inclusion criteria, as well as a table of the papers that we excluded and the reasons why we excluded them) can be found here on a separate page:

List of papers that fulfilled our inclusion criteria + list of papers that did not fulfill our inclusion criteria
(separate page)

In the following, we provide additional information about the papers that we have found in our literature review and that fulfilled our inclusion criteria (see Section 3 in the paper). First of all, we provide a list of abbreviations of the venues of these papers (which we also use in Table 6 in the paper). Afterwards, we provide more details on the approaches of these papers, may it either be tool development and tool evaluation, or tool application and tool usage.

List of Abbreviations of Conferences and Journals
Literature on SE-specific Tool Development (see also Section 3.1.1 and Appendix A.1 in the paper)
Literature on SE-specific Tool Evaluation (see also Section 3.1.1 and Appendix A.1 in the paper)
Literature on SE-specific Tool Application & Tool Usage (see also Section 3.1.2 and Appendix A.2 in the paper)

List of Abbreviations of Conferences and Journals

The following list contains the abbreviations of conferences and journals of all the papers that fulfilled our inclusion
criteria in our literature review:

ACIIW:	Int. Conf. Affective Computing and Intelligent Interaction Workshops and Demos
ACIS:	Proc. Australasian Conf. Information Systems
AffectRE:	Proc. Int. Workshop on Affective Computing for Requirements Engineering
APSEC:	Proc. Asia-Pacific Software Engineering Conf.
APSECW:	Proc. Asia-Pacific Software Engineering Conf. Workshops
ASE:	Proc. Int. Conf. Automated Software Engineering
ASEW:	Int. Conf. Automated Software Engineering Workshops
Big Data:	Int. Conf. Big Data
CASCON:	Proc. Int. Conf. Computer Science and Software Engineering
CGC:	Proc. Int. Conf. Cloud and Green Computing
CHASE:	Proc. Int. Workshop on Cooperative and Human Aspects of Software Engineering
CIC:	Proc. Int. Conf. Collaboration and Internet Computing
CONISOFT:	Int. Conf. Software Engineering Research and Innovation
DASC:	Proc. Int. Conf. Dependable, Autonomic and Secure Computing, Proc. Int. Conf. Pervasive Intelligence and Computing, Proc. Int. Conf. Cloud and Big Data Computing, Proc. Int. Conf. Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)
DEXA:	Int. Conf. Database and Expert Systems Applications
DTGS:	Int. Conf. Digital Transformation and Global Society
EASE:	Evaluation and Assessment in Software Engineering
EDM:	Proc. Int Conf. Educational Data Mining
EISEJ:	e-Informatica Software Engineering Journal
EMSE:	Empirical Software Engineering
ENASE:	Proc. Int. Conf. Evaluation of Novel Approaches to Software Engineering
ESEC/FSE:	Proc. Europ. Software Engineering Conf. and the Int. Sympos. Foundations of Software Engineering
ESEM:	Proc. Int. Sympos. Empirical Software Engineering and Measurement
ESSoS:	Int. Sympos. Engineering Secure Software and Systems
HCI:	Proceedings of the ACM on Human-Computer Interaction, CSCW2
HCSE:	Proc. Int. Working Conf. Human-Centered Software Engineering
HICSS:	Proc. Hawaii Int. Conf. System Sciences
HotStorage:	Proc. Workshop on Hot Topics in Storage and File Systems
ICACI:	Int. Conf. Advanced Computational Intelligence
ICAT:	Int. Conf. Applied Technologies
ICCSAW:	Int. Conf. Computational Science and Its Applications Workshops
ICECA:	Proc. Int. Conf. Electronics, Communication and Aerospace Technology
ICEIS:	Int. Conf. Enterprise Information Systems
ICITA:	Proc. Int. Conf. Information Technology and Applications
ICNGIoT:	Proc. Int. Conf. Next Generation of Internet of Things
ICPC:	Proc. Int. Conf. Program Comprehension
ICSE:	Proc. Int. Conf. Software Engineering
ICSE-NIER:	Proc. Int. Conf. Software Engineering: New Ideas and Emerging Results
ICSE-SEIS:	Proc. Int. Conf. Software Engineering: Software Engineering in Society
ICSESS:	Int. Conf. Software Engineering and Service Science
ICSEW:	Proc. Int. Conf. Software Engineering Workshops
ICSME:	Proc. Int. Conf. Software Maintenance and Evolution
ICSS:	Int. Conf. Service Science
ICT Express:	Information & Communications Technology Express
IEEE Access:	IEEE Access
IEEE Software:	IEEE Software
IJSEA:	International Journal of Software Engineering & Applications
Inf. Syst.:	Information Systems
Internetware:	Proc. Asia-Pacific Sympos. Internetware
IST:	Information and Software Technology
I3E:	Conf. e-Business, e-Services, e-Society
JSS:	Journal of Systems and Software
J. Softw.: Evol. Process	Journal of Software: Evolution and Process
KBS:	Knowledge-Based Systems
Mathematics:	Mathematics
MSR:	Proc. Int. Workshop on Mining Software Repositories
OpenSym:	Proc. Int. Sympos. Open Collaboration
OSS:	Int. Conf. Open Source Systems
PeerJ Comp. Sci.:	PeerJ Computer Science
PLOS ONE:	PLOS ONE
PROFES:	Int. Conf. Product-Focused Software Process Improvement
PROMISE:	Proc. Int. Conf. Predictive Models and Data Analytics in Software Engineering
RE Journal:	Requirements Engineering
REW:	Int. Requirements Engineering Conf. Workshops
RESI:	Revista Electrônica de Sistemas de Informatição
SAC:	Proc. Sympos. Applied Computing
SANER:	Int. Conf. Software Analysis, Evolution, and Reengineering
SCAM:	Int. Workshop on Source Code Analysis and Manipulation
SEDE:	Proc. Int. Conf. Software Engineering and Data Engineering
SEKE:	Proc. Int. Conf. Software Engineering & Knowledge Engineering
SEmotion:	Proc. Int. Workshop on Emotion Awareness in Software Engineering
SERA:	Proc. Int. Conf. Software Engineering Research, Management and Applications
SSE:	Proc. Int. Workshop on Social Software Engineering
TOSEM:	ACM Transactions on Software Engineering and Methodology
Trans. Info. Syst.:	IEICE Transactions on Information and Systems
TRel:	IEEE Transactions on Reliability
TSE:	IEEE Transactions on Software Engineering
VISSOFT:	Proc. Working Conf. Software Visualization
XP:	Proc. Int. Conf. Agile Processes in Software Engineering and Extreme Programming

Literature on SE-specific Tool Development

Jongeling et al. [2015, 2017] investigated whether existing sentiment analysis tools from outside the software-engineering domain agree with each other when used on technical texts (e.g., StackOverflow posts or issue trackers). In particular, they compared the tools SentiStrength, Alchemy, NLTK, and StanfordNLP, resulting in different sentiment classifications for the different tools. In addition, they compared the classifications' outcomes also against a human-annotated emotion dataset from Murgia et al. [2014], also resulting in a disagreement between tools and humans for up to 60% of the analyzed texts on which the humans have agreed themselves.

Novielli et al. [2015] manually annotated a StackOverflow dataset regarding emotions and opinions. They found that sentiment polarity is a complex phenomenon, which varies depending on recipients and technical matters. In later studies, in which they used four different technical datasets, they came to the conclusion that “reliable sentiment analysis in software engineering is possible” when existing tools are specifically tuned to the software-engineering domain [2018a, 2018b].

Blaz and Becker [2016] manually annotated technical tickets from a ticketing system and developed three customized sentiment-analysis methods based on self-created dictionaries containing IT vocabulary and specific templates. Their methods have a low accuracy on negative sentiment. Mäntylä et al. [2017] created a lexicon to detect emotional arousal in software-engineering texts based on manually scored issue data. Ahmed et al. [2017] manually labeled 2000 code review comments and developed the tool SentiCR specifically tuned to code reviews. Islam and Zibran [2017b] manually labeled 5600 JIRA issue comments and proposed the tool SentiStrength-SE, which is an adaption of SentiStrength specifically trained on issue comments [Islam and Zibran, 2018d]. In additional studies, they compared their tool against existing tools and showed that the tuning for software-engineering texts significantly improves classification accuracy [Islam and Zibran, 2017a, 2018a]. Moreover, they developed the tool DEVA, which not only detects sentiment, but also emotional states (e.g., excitement or stress) in software-engineering texts [Islam and Zibran, 2018b]. Later on, they developed the tool MarValous to detect emotional states using machine learning, which significantly outperformed DEVA [Islam et al., 2019]. Calefato et al. [2017] first developed the tool EmoTxt to extract emotions from texts and later proposed the tool Senti4SD, which they specifically trained and validated on StackOverflow posts [Calefato et al., 2018]. Subsequently, they proposed the toolkit EMTk [2019], which is an refactoring of the tools EmoTxt and Senti4SD that is faster than these tools and can be installed easier than them. Ding et al. [2018] manually labeled 3000 issue comments and developed the domain-specific sentiment analysis tool SentiSW. Efstathiou et al. [2018] developed the tool word2vec based on StackOverflow data, and Gachechiladze et al. [2017] manually annotated 723 sentences from Apache issue reports to develop a tool to identify anger direction (i.e., whether anger is directed against the commenters themselves, against other people, or against objects). Werder and Brinkkemper [2018] proposed the tool MEME in which they focused on dealing with error messages and negations, as existing tools use these aspects as indicators for negative sentiment even if the actual message is not perceived negatively. Murgia et al. [2018] manually analyzed 792 issues of 117 projects of the Apache Server Foundation with regard to emotions and proposed the machine-learning classifier ESEM-E for identifying the emotions gratitude, joy and sadness. Lin et al. [2019] manually analyzed 4346 sentences from Stack Overflow regarding different aspects and designed the tool POME to extract sentiment polarity through using pattern-matching techniques. Uddin and Khomh [2021] manually annotated 4522 sentences from 1338 StackOverflow posts regarding polarity and developed two algorithms for sentiment analysis, namely OpinerDSO, which is a lexicon-based approach that is enriched by software-engineering-specific words), and a combination of SentiStrength and OpinerDSO. They compared their approaches with state-of-the art sentiment analysis tools and showed that OpinerDSO outperforms the other tools and approaches on positive and negative sentences, but not on neutral ones. In a later study, Uddin et al. [2022b] compared five state-of-the-art software-engineering-specific sentiment analysis tools on six different datasets and observed a substantial disagreement between these tools. Therefore, they developed the tool Sentisead, which combines different polarity and bag-of-words approaches. Sentisead outperforms other state-of-the-art tools and has a similar precision than pre-trained transformer models such as RoBERTa.

Sun et al. [2021] developed the tool SESSION as an improvement over other existing tools by identifying polysemous words in software-engineering texts. They demonstrated that their tool outperforms two other approaches. Sarker et al. [2020, 2023b] manually labeled 6533 code review comments and 4140 Gitter messages: They concluded that there is room for improvement of existing tools and developed the tool ToxiCR to identify toxic comments in code reviews. In addition, Sarker et al. [2023a] developed the tool ToxiSpanSE which is able to detect which parts of a toxic review comment are the actual toxic part. They suggest that their tool could be used in realtime to detect toxic phrases while they are being typed.

As Lin et al. [2018] had negative experiences with state-of-the-art sentiment analysis tools, they retrained the deep-learning sentiment analysis tool Stanford CoreNLP with a new model on StackOverflow data and called their retrained tool Stanford CoreNLP SO. Unfortunately, they did not achieve a significant improvement over the state-of-the-art tools. Biswas et al. [2019] built their own neutral-network-based sentiment classifier RNN4SentiSE based on word embeddings. Later on, Biswas et al. [2020] they customized the BERT language model using sentences from StackOverflow and propose the model BERT4SentiSE. Using a larger dataset than in previous studies, namely 4000 manually labeled StackOverflow sentences, they showed that BERT4SentiSE “achieves reliable performance” on software-engineering texts. Maipradit et al. [2019] proposed a machine-learning-based classification approach based on n-grams with SMOTE*, which resulted in a comparably good performance. Klünder et al. [2020] used different machine learning classifiers (random forest, support vector machine, naive bayes) and trained them on different statistical and content-related metrics from different communication channels of developers in an industrial case study. They evaluated their approach on 3778 manually labeled sentences and claimed that their approach has a similar accuracy than a human classifier. Zhang et al. [2020] fine-tuned existing transformer models (e.g., BERT, RoBERTa, XLNet) for software-engineering texts and compared them with four standard sentiment analysis tools for software-engineering texts and six different datasets. They showed that the fine-tuned transformer models outperform all the other tools on all datasets. However, they also noted that training transformer models is more expensive than performing the predictions. Wu et al. [2021] also fine-tuned BERT for software-engineering texts and developed the tool BERT-FT, which also outperforms state-of-the-art software-engineering-specific sentiment analysis tools on GitHub, StackOverflow, JIRA, and Gerrit datasets. Similarly, Batra et al. [2021] compared three different BERT-based models on GitHub, JIRA, and StackOverflow datasets and showed that the BERT-based models perform better than prevailing sentiment analysis tools for software-engineering texts. Zhang et al. [2021] developed the sentiment analysis tool SentiLog based on different machine learning models to detect sentiment in log statements. Bleyl and Buxton [2022] trained BERT with vocabulary from StackOverflow and emojis, resulting in a better performance than the state-of-the-art tool EmoTxt. Prenner and Robbes [2022] compared standard BERT models, that have been trained on english text, with their own software-engineering-specific model StackOBERTflow on multiple software-engineering datasets, showing that all the transformer models outperform the state-of-the-art sentiment analysis tools. Sun et al. [2022] developed the tool EASTER, which is a combination of the deep-learning frameworks TextCNN and RoBERTa, which outperforms state-of-the-art approaches. Von der Mosel et al. [2023] compared transformer models that have been trained with software-engineering texts and generally trained transformer models with regard to vocabulary, understanding missing words, and classification tasks. They found that for understanding tasks on software-engineering texts, generally trained models are sufficient, but for more complex tasks, models trained on software-engineering texts are beneficial. For that reason, they proposed the tool seBERT, a BERT-based model trained on StackOverflow, GitHub, and JIRA data.

Raman et al. [2020] manually labeled 386 “too heated” locked issues and 300 randomly chosen issues. Using these data, they trained a toxicity classifier called STRUDEL toxicity detector based on comment length and word frequencies, also using a combination of various sentiment analysis tools. Qiu et al. [2022] used their toxicity detection tool together with a pushback detection tool. They found that the combination of the two tools performs better on identifying inter-personal conflicts than the individual tools. In a similar way, Cheriyan et al. [2021] explored offensive language on four social coding and communication platforms based on manual annotation. They proposed an offensive-language detection approach as a combination of various existing tools. Also Sayago-Heredia et al. [2022a] built their own toxicity detection tool based on various existing approaches.

Chen et al. [2019] developed the tool SEntiMoji, which detects the sentiment in software-engineering related texts based on the used emojis. In contrast, Venigalla and Chimalakonda [2021a] developed StackEmo to detect the sentiment in StackOverflow posts and augment them with appropriate emojis.

To investigate how developers' emotions change over time, Neupane et al. [2019] developed the tool EmoD, which uses a combination of already existing sentiment analysis tools. Similarly, Cagnoni et al. [2020] used multiple machine-learning algorithms to detect joy, love, surprise, fear, anger, and sadness in StackOverflow posts regarding different programming languages.

Literature on SE-specific Tool Evaluation

Ferreira et al. [2021] compared multiple of the aforementioned tools with human-annotated data and found that these tools have a low overall accuracy and are not suited to detect incivility in software-engineering texts. Shen et al. [2019] compared machine-learning-based and lexicon-based tools and proposed an approach combining the different methods. Serva et al. [2015] used bigrams and unigrams to extract negative code examples (i.e., code examples that occur with negative sentiment in the corresponding comments) and showed that these tools have a precision of 75% and a recall of 74% in a sample set of 40 questions on StackOverflow. Biswas et al. [2019] compared word embeddings derived from StackOverflow posts with word embeddings derived from GoogleNews, resulting in a better performance of the embeddings derived from GoogleNews even on software-engineering texts. Already in 2018, Lin et al. [2018] created a human-annotated dataset derived from 5 annotators and showed that even software-engineering related sentiment analysis tools do not work sufficiently well and “warn[ed] the research community about the strong limitations” of such tools. Novielli et al. [2021] compared four sentiment analysis tools from the software-engineering domain against each other on 600 discussions on GitHub and StackOverflow and observed that the different tools led to contradictory results. They hypothesized that such tools need to be adjusted to platform-specific conventions and jargons to deliver reliable results. In 2020, Novielli et al. [2020] evaluated four sentiment analysis tools for software-engineering texts on 6000 JIRA comments, 4000 StackOverflow posts, and a manually annotated dataset of 7000 GitHub pull-request comments (annotated by three persons, comments with disagreement have been removed from the dataset), showing that lexicon-based approaches outperform supervised training approaches. They also derived guidelines how these tools could be used “reliably” in the software-engineering domain (e.g., choose a tool that is appropriate for the respective purpose and tune the tool to the data source). Fucci et al. [2021] manually annotated 1038 texts from self-admitted technical debt (SATD) data and tried three sentiment analysis from the software-engineering domain. However, all of them had a lower performance (i.e., a higher disagreement with the humans) than they had expected and concluded, that these tools are not appropriate for analyzing such texts. Consequently, in a subsequent study [Cassee et al., 2022], instead of using automated tools, they simply used the manually annotated dataset to further investigate SATD comments. According to Ahasanuzzaman et al. [2020], there is a “substantial agreement” among the tools that are specifically developed for the software-engineering domain. Nevertheless, the agreement among the tools is higher than the agreement between tools and humans [2018]. Mansoor et al. [2021] compared human annotations on StackOverflow posts with different sentiment analysis tools, partly from the software-engineering domain and partly not from the software-engineering domain. They found that, in general, the software-engineering-related tools performed better than the general-purpose tools. Disagreement between humans and tools was mostly caused by emojis, simple words, grammar, missing clarity, or excitement [Mansoor et al., 2021]. Wang [2019] noticed in weekly-collected emotions from project teams that there are differences between the team's view and an individual's view, which could lead to huge challenges for automatic emotion analysis tools.

Many tools ignore emojis, which are used by authors of a comment to explicitly express their sentiment in a “self-reported” way [Chen et al., 2021]. Park and Sharif [2021] compared human annotations of GitHub pull requests from six participants against five state-of-the-art sentiment analysis tools and observed an agreement between humans and tools between 6% and 56%. In particular, via eye tracking they noticed that the humans had a high focus on emojis, but many tools ignore them. Chen et al. [2021] analyzed how good the state-of-the-art software-engineering-related sentiment analysis tools can detect emojis in JIRA and StackOverflow datasets. In particular, they noticed that the main difficulties are implicit sentiment, complex content, politeness, data preprocessing, or figurative language.

Robbes and Janes [2019] showed that pre-training neural network models specifically for software-engineering specific texts is an improvement compared to state-of-the-art neural network models. Cabrera-Diego et al. [2020] performed 16 experiments using different multi-label classifiers on StackOverflow and JIRA datasets and found that these multi-label classifiers outperform state of the art sentiment-analysis-tools in the software-engineering domain. Imran et al. [2022] use data-augmentation techniques to obtain additional training data to improve emotion recognition on GitHub pull-request data. This way, they were able to improve the accuracy of state-of-the-art sentiment analysis tools for the software-engineering domain. However, they also noted that fear and anger emotions are most difficult to identify. Kadhar and Kumar [2022] compared various deep-learning approaches and found that classifiers that were trained on Google News performed better on a JIRA dataset than classifiers trained on software-engineering datasets, as the Google News dataset is large compared to software-engineering datasets.

Obaidi et al. [2022] combined three different sentiment analysis tools that have been trained in different domains and evaluate them on five established GitHub datasets. They found that using the majority vote of the three different tools does not necessarily lead to better results (especially when one tool performs bad on a specific dataset). They conclude that the annotators as well as their way of annotating the different datasets were different, which may have led to different tool performances on different datasets. Similarly, Sun et al. [2022] pointed out that the way how humans have annotated sentiments in software-engineering-related texts “plays a more important role for the importance of automated sentiment analysis” on such texts. Also, in a similar way, Mula et al. [2022] compared the performances of 14 different classifiers on JIRA and StackOverflow datasets and found that, while some models are quite accurate, others perform comparably bad. Moreover, adding or removing context (e.g., quotes of previous messages) can affect the performance of the sentiment analysis tools, but does not lead to a general improvement in their accuracy, as Ferreira et al. [2024] showed on a cross-platform study with 1533 e-mails from the LKML and with 5511 issue comments from GitHub.

Closest to our paper, after manually annotating 589 GitHub comments, Imtiaz et al. [2018] came to the conclusion that sentiment analysis in the software-engineering domain is unreliable, as “human raters also have a low agreement among themselves”. They evaluated 6 sentiment analysis tools and observed that neither the tools agreed among each other nor did they agree with the consensus (which was achieved after discussing the disagreeing annotations) of their human raters. Their results are in line with our study. In fact, we are able to confirm their general results on a different dataset, which is a valuable contribution on its own. In addition to that, the main difference to our study (except for data source and data sampling) is that, in their study, only two human raters annotated each comment, whereas we put our analysis on a broader basis by having 6 to 9 human annotators per text.

Similarly to our study, Herrmann et al. [2022] analyzed the human perception of 100 statements from pre-labeled and widely-used datasets from GitHub, JIRA, or StackOverflow. To that aim, they asked 94 participants to label each of these statements, resulting in a huge difference between the labels assigned by the different participants. Only in 7 statements they achieved a substantial agreement between the pre-defined labels from the datasets and the majority vote of their participants. Noteworthy, none of their 94 participants agreed with all of the 100 pre-defined labels from the datasets. In our study, we use a different sofware-engineering-related dataset (e-mails from the LKML), a larger number of texts (720 e-mails), but a lower number of annotators (6 to 9) per text. Despite these deviations between their study and our annotation study, we can confirm their results on the subjectivity of human perception, that is, the occurrence of not neglectable disagreement between human annotators. Even more, in a qualitative study, we investigate why the disagreement on specific texts is particularly high.

Literature on SE-specific Tool Application and Tool Usage

There are lots of sentiment analysis tools specifically designed for software engineering. Beside their development and evaluation, these tools have also been used in various studies to empirically answer specific research questions. In what follows, we provide an overview of relevant studies, but due to the concerns raised above, their results have to be taken with a grain of salt.

In general, emotions and sentiment polarity are present in user and developer mailing lists [Tourani et al., 2014; Ferreira et al., 2019c] as well as in many GitHub projects [Jurado and Rodriguez, 2015] and JIRA issue comments [Murgia et al., 2014]. However, only a small fraction of the e-mails and issue comments expresses a positive or negative sentiment [Sengupta and Haythornthwaite, 2020; Hata et al., 2022; Valdez et al., 2020; Ferreira et al., 2019c; Skriptsova et al., 2019]. Sengupta and Haythornthwaite [2020] showed that 90% of the comments that contain emotions are answers to previous comments. However, the initial comments, in which developers ask for other developers' opinions, are mostly neutral [Chatterjee et al., 2021]. In addition to open-source software projects, also the communication of student teams in software-engineering courses [Marshall et al., 2016] and the communication in scratch projects, where children are communicating, contains sentiment polarity and emotions [Graßl et al., 2022]. Hence, extracting emotions could be useful for improving emotional awareness in software projects [Guzman and Bruegge, 2013]. Claes et al. [2020] created a dataset of 17,300 issue comments from Apache and Mozilla projects and ran sentiment analysis tools on it. Also, GitHub issue comments are often analyzed by researchers. Whereas most GitHub projects are neutral, there are 10% more projects with negative sentiment than projects with positive sentiment [Sinha et al., 2016]. Projects using Java tend to be more negative than projects using other programming languages [Guzman and Bruegge, 2013].

Texts written by female developers seemed to be more positive than texts written by males (based on three months of data from a company's e-mails, forum discussions, bug reports, and review comments) [Patwardhan, 2017]. In contrast, Imtiaz et al. [2019] found that woman are rather cautious with expressing sentiment and have a lower probability to express politeness than men have. Paul et al. [2019] made similar observations than Imtiaz et al. [2019].

The sentiment of developers in software projects is also subject to constant change. Werder [2018] investigated 1121 GitHub projects and found that the amount of positive sentiment decreases over time. Robinson et al. [2016] also investigated how developer sentiment varies in certain situations. Rousinopoulos et al. [2014] investigated 4176 messages from the top ten contributors within two years of the OpenSUSE Factory mailing list. In general, the amount of positive sentiment decreases over time. In particular, there are months with more positive and months with more negative sentiment. Directly before a new release, there is an increase in detected sentiment (either positive or negative). Moreover, Guzman [2013] noticed in 354 e-mails from a developer mailing list that e-mails are more neutral and shorter at the beginning of a project and become more emotional and longer during later steps of a project. This is corroborated by the empirical observation that positive and negative comments tend to be longer than neutral comments [Lanovaz and Adams, 2019]. Ortu et al. [2018] investigated the politeness of about 650,000 comments and found that users and developers communicate differently. Even among developers, there seems to be a difference in the communication style: Developers who only contributed one commit were more polite than developers how contributed regularly. Also, developers who have a significantly higher commenting activity than their peers often tend to an increased amount of negative sentiment in their comments [Sarker et al., 2019]. On the other hand, also people that never created an issue and never contributed to the source code but just comment on issues are less polite than others [Destefanis et al., 2018]. Developers often became inactive in the bug reports and mailing lists of Gentoo after they had expressed (positively or negatively) strong emotions [Garcia et al., 2013]. Also, special kinds of events, such as disagreement with other developers on how to implement a specific feature, are related to negative sentiment [Li et al., 2021]. Freira et al. [2018] analyzed about 268,000 comments on GitHub to detect changes in developer's mood. They identified a mood variation in 31% of the cases within one hour after a developer had received feedback and a mood variation of 18% between they before and after feedback arrived. Up to 24% of the developers in a project never contributed again after they had received negative feedback. In particular, the sentiment that is prevalent in the replies that newcomers (i.e., new developers joining a project) receive influences whether they continue contributing to the project [Mahbub et al., 2021]. Also, special kinds of events, such as disagreement with other developers on how to implement a specific feature, are related to negative sentiment [Li et al., 2021]. Nevertheless, developers seem to be less influenced by negative sentiment than users, and replies often continue the emotion of the initial message [Lanovaz and Adams, 2019]. However, many developers also try to resolve conflicts and reply in a neutral or polite manner after receiving a comment that contains negative sentiment [Ortu et al., 2016a]. Even comments with negative sentiment are not only criticizing, but often also constructive [Assavakamhaenghan et al., 2023]. In general, conversations among developers are more neutral compared to conversations between developers and users [Robe et al., 2022]. Moreover, positive or negative sentiment in organizational discussions seems to be related to changes in the socio-technical structure of a software project and, thus, also has an impact on the sustainability of the project [Yin et al., 2023].

As another factor, the discussion topic seems to impact developers' sentiment. Rahman et al. [2015] show that bugs or warnings are often associated with negative emotions (e.g., due to annoyance or frustration), whereas thankful comments are mostly positive. Similarly, bug reports of bugs that are not reproducible contain more negative sentiment than bug reports of reproducible bugs [Goyal and Sardana, 2017]. Security-related discussions on GitHub contain more negative sentiment than discussions that are not related to security [Pletea, 2014]. Hence, multiple studies detect the sentiment in bug reports to predict the bug severity in order to automatically prioritize bug reports according to their severity [Yang et al., 2017; Yang et al., 2018; Ramay et al., 2019; Umer et al., 2020; Dao and Yang, 2021]. In a similar vein, Ahasanuzzaman et al. [2018, 2020] use sentiment information to classify whether a StackOverflow post describes an issue or not, and Werner et al. [2018, 2019] show that sentiment can be used to identify escalated support tickets.

Islam and Zibran [2016] studied more than 490,000 commit messages and observed different sentiment polarity in different types of commits: The messages of bug-fixing or refactoring commits tend to be more positive, whereas the commit messages related to new features tend to be more negative. In particular, according to their study, developers who are more emotional tend to create longer commit messages. Researchers found that commit messages of bug-introducing and also of bug-fixing commits have a more positive sentiment than other commit messages [Islam and Zibran, 2018c]. In contrast, other studies reveal that commit messages of commits that introduce, precede, or fix bugs are more negative than other commit messages [Huq et al., 2020]. Sentiment analysis can also be used to distinguish between buggy and correct commits [Huq et al., 2020]. Buggy changes are often preceded by negative commits but by positive reviews [Huq et al., 2019]. Venigalla and Chimalakonda [2021b] analyzed the sentiment of commit messages and found that 45% of them express trust, whereas only 2% of them express disgust. In general, 78% contain positive emotions, whereas 21.5% contain negative emotions. Kaur et al. [2022] show that also the team size and the commit activity in a project influence the sentiment of commit messages. In addition, they claim that the launch of GitHub has led to more negative sentiment among developers. Moreover, negative sentiment is affected by continuous-integration build processes (e.g., failing builds), but negative sentiment also affects the build process (e.g., leads to failing builds) [Souza and Silva, 2017]. Madampe et al. [2020] investigate how developers' sentiment changes with respect to requirements changes. They found that receiving requirements changes leads to negative sentiment, whereas delivering the new requirements changes leads to positive sentiment.

Whereas two studies found that commit messages written on Mondays are more negative than on other days [Guzman et al., 2014, Kumar et al., 2022], other researchers found that the most negative sentiment is present in comments written on Tuesdays [Sinha et al., 2016], and still another study reports that comments written on Tuesdays and Fridays are least negative but those that are written on Sundays are most negative [Valdez et al., 2020]. Also, positive sentiment increases between Wednesday and Saturday [Valdez et al., 2020]. Also the time of day can affect the sentiment of a comment: Whereas positive sentiment seems to occur most frequently in the morning and least frequently shortly before midnight, negative sentiment is prevalent throughout the whole day (and there is only a slight increase of negative sentiment at night) [Valdez et al., 2020].

Munaiah et al. [2017] investigated the sentiment in code reviews and found that the more emotional and the less complex the code changes are, the more likely a code review fails to notice vulnerabilities. Furthermore, Tourani and Adams [2016] showed that the more negative sentiment occurs in a code review, the more defect-prone the code changes are. In addition, pull requests that contain anger or sadness have a lower probability to be merged than pull requests that contain positive emotions [Ortu et al., 2019]. According to El Asri et al. [2019], code reviews with negative comments also need more time to be addressed by the developer than code reviews with positive comments. In particular, newcomers react more emotional to code reviews than core members of a project do [El Asri et al., 2019; Skriptsova et al., 2019]. However, whether developers perceive code reviews as toxic seems to be subjective [Chouchen et al., 2021]. Nevertheless, when multiple developers post comments to the same issues, their comments often share the same sentiment [Li et al., 2020]. In addition, there seems to be a correlation between toxic commit messages and various code-quality metrics [In addition, there seems to be a correlation between toxic commit messages and various code-quality metrics [Sayago-Heredia et al., 2022a, 2022b].

When comparing different platforms, there appears to be more positive sentiment on GitHub issue discussions than in StackOverflow posts [Hata et al., 2022]. Novielli et al. [2014] investigate which factors affect the change to get an answer accepted. When writing an answer to StackOverflow posts, avoiding negative attitude increases one's chances to get the answer accepted [Calefato et al., 2015]. Both, positive and negative sentiment in discussions, seem to increase developer productivity, though [Kuutila et al., 2020]. Licorish and MacDonell [2014, 2018] analyzed the attitudes in software development teams and found that they express different attitudes when working on different tasks, but the attitude is not related to task completion or productivity. However, the teams become more emotional when they remedy defects. Gao et al. [2022] explored the impact of bots on developers' sentiment and found that there is significantly less sentiment in pull requests and issues that are created by bots. Patnaik and Padhy [2022] analyzed sentiment with respect to code refactoring and software quality. Singh and Singh [2017] found that developers' commit messages are more negative than positive while refactoring. Swillus and Zaidman [2023] investigate how developers express sentiment in StackOverflow posts that are related to software testing. They found that lack of experience are often indicated by negative sentiment, whereas trust and confidence are indicated by positive sentiment. Also, unexpected behavior leads to negative sentiment, but inspiring books or blog posts lead to positive sentiment. Politeness has also a positive influence on developer attraction and project attractiveness [2015b]. To further investigate the relationship between project attractiveness and sentiment, Brisson et al. [2020] compared issue discussions among several forks of 385 software projects and found that sentiment is not related to the number of stars a project has.

Sentiment also seems to affect the issue fixing time [Yang et al., 2017], as issue discussions that contain negative sentiment tend to have a longer fixing time. Destefanis et al. [2016] studied JIRA comments and stated that politeness has a positive effect on issue fixing time and a project's attractiveness. On the contrary, extremely polite or extremely impolite issues both end up in a shorter issue fixing time than in average time [Ortu et al., 2015a, 2015b, 2016b]. As different studies targeting the same research question lead to contrary results, this again encourages us that simply applying sentiment analysis tools without further validation is unreliable. Mäntylä et al. [2016] showed that prioritized issues contain a higher amount of arousal, issues that are resolved faster are correlated with a higher arousal of the assigned developer, and fast issue resolution increases the valence of the issue reporter. Sanei et al. [2021] observed that negative sentiment and bad or polite tone lead to shorter response times, as well as when there is excited tone and substantial code changes are necessary. However, negative sentiment and sad tone lead to longer discussions. Interestingly, they found that the more positive/neutral and polite comments occur, the longer the discussions lasted. Also, positive comments or sad tone, in general, led to shorter resolution time [Sanei et al., 2021].

Furthermore, there are also studies that cover very specific aspects of sentiment in software engineering. For example, Claes et al. [2018] investigated the use of emojis in JIRA comments and found that they mostly express positive sentiment. They also noted that the use of emojis depends on projects, weekdays, and location. Batoun et al. [2023] investigated the emoji use on GitHub pull requests and found that emojis are mostly used to express positive reactions. Rong et al. [2022] showed that developers use emojis for various intentions and during every phase of a project. They also showed that core developers use emojis to a similar amount than other developers. However, sometimes there is a contradiction between the used emojis and the sentiment present in a comment. Wang et al. [2023] determined such inconsistencies in 11.8% of their investigated comments on GitHub pull requests. Using open coding, they searched for potential reasons of such inconsistencies. In 23% of the inconsistencies, they found that the sentiment analysis tool has detected the sentiment wrongly. Another prevalent reason for the inconsistencies was the acknowledgment of a mistake [Wang et al., 2023]. Investigating tweets about software applications using sentiment analysis tools could be helpful to search for specific information that could be used to further improve these applications [Guzman et al., 2017]. Mostafa and Abd Elghany [2018] particularly investigated the emotions of game developers to find out whether game developers feel guilty because of the negative effects game addiction can cause.

Further studies on GitHub issues showed that there are various forms of toxic comments: In most cases, the toxic comment in an issue is the first comment which opens the issue. Often, there is no (concrete) target, but also targeting at people or at source code happens. Though most of the toxic comments contain a complaint, only little of them appear to be aggressive. The authors of toxic comments come from both sides, external people and project members [Miller et al., 2022; Cohen, 2021]. The first reply (i.e., the second comment), however, in general, is neutral [Assavakamhaenghan, 2023]. Moreover, the form and intensity of toxicity varies between projects and decreased over time from 2012 to 2018 [Raman, 2020]. In general, “heated” locked issues have the same number of participants and comments as regular issues; only about 9% of the comments are uncivil [Ferreira et al., 2022]. Issues that contain comments with negative sentiment tend to be reopened more often than issues without negative sentiment [Cheruvelil and da Silva, 2019]. Brisson et al. [2020] compared issue discussions among several forks of 385 software projects and found that sentiment is not related to the number of stars a project has.

Sokolovsky et al. [2021] use sentiment analysis to predict software releases, since emotions change during the course of a release cycle (e.g., there is more negative sentiment in the days prior to a release) [Alesinloye et al., 2019, Ferreira et al., 2019b]. Sapkota et al. [2019] and da Cruz et al. [2016] use sentiment analysis to automatically detect trust between developers, as the sentiment in a comment on GitHub may represent the developer's opinion towards the activity the developer is commenting on. Almarimi et al. [2023] build different machine learning models to detect community smells in open-source software projects and found that models that use sentiment information from commits, issues, or pull requests have a higher performance in detecting smells than models without sentiment information, which shows that there is a relationship between community smells and developers' sentiment. In particular, developers who exhibit smelly behavior seem to be comparably less polite but write rather positive issue comments [Huang et al., 2021]. Other researchers investigate whether dependencies between non-functional software requirements can be derived from the sentiments that occur in issue comments [Portugal et al., 2018]. Zhang and Hou [2013] proposed the tool haystack to extract problematic API features from online forums based on negative sentiment in the corresponding discussions. Moreover, Uddin et al. [2020, 2021] identify developers' sentiment in StackOverflow comments and use this information to generate API documentation therefrom. Especially during the first months of the COVID-19 pandemic, developers often complained about missing documentation, which was reflected in negative sentiment within developer discussions at that point in time [2022a]. To explore future usage scenarios of sentiment analysis in software engineering, Schroth et al. [2022] tried out the concept of “realtime sentiment analysis”, (i.e., visualizing sentiment scores to developers while they type a message). In a subsequent developer survey, some developers considered this useful, whereas others voiced misgivings due to being observed while typing the message.

To be able to identify sentiment in oral developer communication, Herrmann et al. [2021] used speech recognition to transcribe the oral conversations and applied sentiment analysis tools to the textual representation of the conversation afterwards. To detect “speech acts” (i.e., communication that should affect other people's believes or behavior) in issue trackers, also sentiment analysis tools can be used to identify positive or negative opinions [Morales-Ramirez et al., 2019].

Closest to our aim, Ferreira et al. [2019a] assessed on the LKML whether the maintainers' sentiment changed after Linus Torvalds's temporary break. Similar to our methodology, they analyzed only e-mail threads that had, at least, two e-mails and excluded patches, and they removed citations and greetings, and ignored e-mail addresses. In summary, they did not find any significant changes in the maintainers' sentiment between 2017 and 2019. Later on, Ferreira et al. [2021] collected 1545 e-mails from the LKML that were related to rejected patches. In their study, two persons manually coded the e-mails with respect to incivility, ending up in a substantial agreement between the two coders. Afterwards, they identified potential causes for incivility in these e-mails and evaluated how much incivility does exist. In contrast to their study, we sampled 720 e-mails from the LKML and had 6 to 9 humans who annotated each e-mail with respect to aggressiveness. Moreover, opposed to their manual coding, we observe a disagreement between our human annotators on a substantial number of e-mails. Another difference is that we identify potential causes for the different perceptions of aggressiveness among humans (i.e., among the annotators) and discuss why individual perceptions matter and why human-labeled data as well as tool-annotated data are not reliable, whereas Ferreira et al. [2021] focused on understanding the communication between the developers (who authored the collected e-mails) and analyzed causes for incivility in general (not causes for different perceptions of aggressiveness).

Noteworthy, some researchers already pointed out that the existing sentiment analysis tools face additional challenges. For example, Kritikos et al. [2020] analyzed the sentiment on 6960 bug comments on Bugzilla using sentiment analysis tools and stated that irony detection does not work properly. Murgia et al. [2018] analyzed 792 issue comments regarding emotions. They concluded that specific keywords, such as “thanks” or “sorry”, are important for sentiment analysis tools to detect sentiment polarity correctly. Further, Ferreira et al. [2019b] mentioned another challenge: Sometimes, a single sentence contains contradicting sentiments, which makes it difficult for sentiment analysis tools to determine an overall sentiment for the whole sentence.

Complete Literature Overview

A complete overview of our literature review (i.e., papers that fulfilled our inclusion criteria, as well as papers that we excluded) can be found here.

Downloads

Annotation data, annotation tool, scripts for preprocessing and evaluation, and scripts and data for the explorative experiment

Literature data (i.e., lists of included and excluded papers from our literature review) in csv format

Note: For data privacy reasons, we cannot distribute the complete raw data that we gathered using our data-extraction tools. Names and message ids have been anonymized. Please refer to the respective tools to produce a set of data for yourself.

Contact

If you have any questions regarding this paper or any other related project, please do not hesitate to contact us:

Thomas Bock (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Niklas Schneider (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Angelika Schmid (IBM, Nürnberg, Germany)
Sven Apel (Saarland University, Saarland Informatics Campus, Saarbrücken, Germany)
Janet Siegmund (University of Technology Chemnitz, Chemnitz, Germany)