se-sic/coronet

coronet

coronet––The network library

Have you ever wanted to build socio-technical developer networks the way you want? Here, you are in the right place. Using this network library, you are able to construct such networks based on various data sources (commits, e-mails, issues) in a configurable and modular way. Additionally, we provide, e.g., analysis methods for network motifs, network metrics, and developer classification.

The network library coronet can be used to construct analyzable networks based on data extracted from Codeface [https://github.com/siemens/codeface] and its companion tool codeface-extraction [https://github.com/se-sic/codeface-extraction]. The library reads the written/extracted data from disk and constructs intermediate data structures for convenient data handling, either data containers or, more importantly, developer networks.

If you wonder: The name coronet derives as an acronym from the words “configurable”, “reproducible”, and, most importantly, “network”. The name says it all and very much conveys our goal.

Exemplary plot of multi network

Table of contents

Integration

Requirements

While using the package, we require the following infrastructure.

R

Minimum requirement is R version 3.3.1. Hence, later R versions also work.

We currently recommend version 4.1.1 or 3.6.3 for reliability reasons and packrat compatibility, but also later R versions should work (and are tested using our CI script).

The local package manager of R enables the user to store all needed R packages for this repository inside the repository itself. All R tools and IDEs should provide a more sophisticated interface for the interaction with packrat(RStudio does).

Folder structure of the input data

To use this network library, the input data has to match a certain folder structure and agree on certain file names. The data folder – which can result from consecutive runs of Codeface [https://github.com/se-sic/codeface] (branch infosaar-updates) and codeface-extraction [https://github.com/se-sic/codeface-extraction] – needs to have the following structure (roughly):

  codeface-data
  ├── configurations
  │   ├── threemonth
  │   │     └──{project-name}_{tagging}.conf
  │   ├── releases
  │   │     └──{project-name}_{tagging}.conf
  │   ├── ...
  │
  └── results
      ├── threemonth
      │     └──{project-name}_{tagging}
      │           └──{tagging}
      │                ├── authors.list
      │                ├── bots.list
      │                ├── commits.list
      │                ├── commitMessages.list
      │                ├── emails.list
      │                ├── issues-github.list
      │                ├── issues-jira.list
      │                └── revisions.list
      ├── releases
      │     └──{project-name}_{tagging}
      │           └──{tagging}
      │                ├── authors.list
      │                ├── ...
      ├── ...

The names “threemonth” and “releases” correspond to selection processes that are used inside Codeface and describe the notation of the revs key in the Codeface configuration files. Essentially, these are arbitrary names that are used internally for grouping. If you are in doubt, just pick a name and you are fine (you just need to take care that you give Codeface the correct folders!). E.g., if you use “threemonth” as selection process, you need to give Codeface and codeface-extraction the folder “releases/threemonth” as results folder (resdir command-line parameter of Codeface).

{tagging} corresponds to the different Codeface commit-analysis types. In this network library, {tagging} can be either proximity or feature. While proximity triggers a file/function-based commit analysis in Codeface, feature triggers a feature-based analysis. When using this network library, the user only needs to give the artifact parameter to the ProjectConf constructor, which automatically ensures that the correct tagging is selected.

The configuration files {project-name}_{tagging}.conf are mandatory and contain some basic configuration regarding a performed Codeface analysis (e.g., project name, name of the corresponding repository, name of the mailing list, etc.). For further details on those files, please have a look at some example files in the Codeface repository.

All the *.list files listed above are output files of codeface-extraction and contain meta data of, e.g., commits or e-mails to the mailing list, etc., in CSV format. This network library lazily loads and processes these files when needed.

Needed R packages

To manage the following packages, we recommend to use packrat using the R command install.packages("packrat"); packrat::on(). This will automatically detect all needed packages and install them. Alternatively, you can run Rscript install.R to install the packages.

Submodule

Please insert the project into yours by use of git submodules. Furthermore, the file install.R installs all needed R packages (see below) into your R library. Although, the use of packrat with your project is recommended.

This library is written in a way to not interfere with the loading order of your project’s R packages (i.e., library() calls), so that the library does not lead to masked definitions.

To initialize the library in your project, you need to source all files of the library in your project using the following command:

source("path/to/util-init.R", chdir = TRUE)

It may lead to unpredictable behavior, when you do not do this, as we need to set some system and environment variables to ensure correct behavior of all functionality (e.g., parsing timestamps in the correct timezone and reading files from disk using the correct encoding).

Note: If you have used this library as a submodule already before it was renamed as coronet, you need to ensure that the right remote URL is used. The best way to do that is to remove the current submodule and re-add it with the new URL.

Selecting the correct version

When selecting a version to work with, you should consider the following points:

Functionality

Configuration

There are two different classes of configuration objects in this library:

You can find an overview on all the parameters in these classes below in this file.

Data sources

There are two distinguishable types of data sources that are both handled by the class ProjectData (and possibly its subclass RangeData):

The important difference is that the main data sources are used internally to construct artifact vertices in relevant types of networks. Additionally, these data sources can be used as a basis for splitting ProjectData in a time-based or activity-based manner – obtaining RangeData instances as a result (see file split.R and the contained functions). Thus, RangeData objects contain only data of a specific period of time.

The additional data sources are orthogonal to the main data sources, can augment them by additional information, and, thus, are not split at any time.

All data sources are accessible from the ProjectData and RangeData objects through their respective getter methods. For some data sources, there are additional methods available to access, for example, a more aggregated version of the data.

Network construction

When constructing networks by using a NetworkBuilder object, we basically construct igraph objects. You can find more information on how to handle these objects on the igraph project website.

Data sources for network construction

For the construction to work, you need to pass an instance of each the classes ProjectData and NetworkConf as parameters when calling the NetworkBuilder constructor. The ProjectData object holds the data that is used as basis for the constructed networks, while the NetworkConf object configures the construction process in detail (see below and also Section NetworkConf for more information).

Beware: The ProjectData instance passed to the constructor of the class NetworkBuilder is getting cloned inside the NetworkBuilder instance! The main reason is the latent ability to cut data to unified date ranges (the parameterunify.date.ranges in the class NetworkConf) which would compromise the original given data object; consequently, data cutting is only performed on the cloned data object. Further implications are:

Types of networks

There are four types of networks that can be built using this library: author networks, artifact networks, bipartite networks, and multi networks (which are a combination of author, artifact, and bipartite networks). In the following, we give some more details on the various types. All types and their incorporated relations can be configured using a NetworkConf object supplied to an NetworkBuilder object. The respective relations and their meaning are explained in the next section in more detail.

Relations

Relations determine which information is used to construct edges among the vertices in the different types of networks. In this network library, you can specify, if wanted, several relations for a single network using the corresponding NetworkConf attributes mentioned in the following.

Edge-construction algorithms for author networks

When constructing author networks, we use events in time (i.e., commits, e-mails, issue events) to model interactions among authors on the same artifact as edges. Therefore, we group the events on artifacts, based on the configured relation (see the previous section).

We have four different edge-construction possibilities, based on two configuration parameters in the NetworkConf:

In the following, we illustrate the edge construction for all combinations of temporally (un-)ordered data and (un-)directed networks on an example with one mail thread:

Consider the following raw e-mail data for one thread (i.e., one group of events), temporally ordered from the first to the last e-mail:

Author Date (Timestamp) Artifact (Mail Thread)
A 1 <thread-1>
A 2 <thread-1>
B 3 <thread-1>

Based on the above raw data, we get the following author networks with relation mail:

respect temporal order without respecting temporal order
network directed A ←(2)– A
A ←(3)– B
A ←(3)– B
A –(1)→ B
A –(2)→ B
A ←(3)– B
network undirected A –(2)– A
A –(3)– B
A –(3)– B
A –(1)– B
A –(2)– B
A –(3)– B

When constructing author networks with respecting the temporal order, there is one edge for each answer in a mail thread from the answer’s author to the senders of every previous e-mail in this mail thread. Note that this can lead to duplicated edges if an author has sent several previous e-mails to the mail thread (see the duplicated edges A –(3)– B in the above example). This also leads to loop edges if an author of an answer has already sent an e-mail to this thread before (see the edge A –(2)– A).

If the temporal order is not respected, for each e-mail in a mail thread, there is an edge from the sender of the e-mail to every other author participating in this mail thread (regardless of in which order the e-mails were sent). In this case, no loop edges are contained in the network. However, it is possible that there are several edges (having different timestamps) between two authors (see the edges A –(1)– B and A –(2)– B in the example above). If directedness is configured, the edges are directed from the sender of an e-mail to the other authors.

Analogously, these edge-construction algorithms apply also for all other relations among authors (see the Section Relations).

Vertex and edge attributes

There are some mandatory attributes that are added to vertices and edges in the process of network construction. These are not optional and will be added in all cases when using instances of the class NetworkBuilder to obtain networks.

To add further edge attributes, please see the parameter edge.attributes in the NetworkConf class. To add further vertex attributes – which can only be done after constructing a network –, please see the functions add.vertex.attribute.* in the file util-networks-covariates.R for the set of corresponding functions to call.

Further functionalities

Splitting data and networks based on defined time windows

Often, it is interesting to build the networks not only for the whole project history but also to split the data into smaller ranges. One’s benefit is to observe changes in the network over time. Further details can be found in the Section Splitting information.

Cutting data to unified date ranges

Since we extract the data for each data source independently, the time ranges for available data can be quite different. For example, there may be a huge amount of time between the first extracted commit and the first extracted e-mail (and also analogously for the last commit resp. e-mail). This circumstance can affect various analyses using this network library.

To compensate for this, the class ProjectData supplies a method ProjectData$get.data.cut.to.same.date(), which returns a clone of the underlying ProjectData instance for which the data sources are cut to their common latest first entry date and their common earliest last entry date.

Analogously, the NetworkConf parameter unify.date.ranges enables this very functionality latently when constructing networks with a NetworkBuilder instance. Note: Please see also Section Data sources for network construction for further information on data handling inside the class NetworkBuilder!

Handling data independently

In some cases, it is not necessary to build a network to get the information you need. Therefore, please remember that we offer the possibility to get the raw data or mappings between, e.g., authors and the files they edited. The data inside an instance of ProjectData can be accessed independently. Examples can be found in the file showcase.R.

How-to

In this section, we give a short example on how to initialize all needed objects and build a bipartite network.

Disclaimer: The following code is configured to use sample data shipped with this repository. If you want to use the network library with a real-world project such as BusyBox, you need actual data and adjust the variables in the first block of the code to the existing data.

CF.DATA = "./sample/" # path to codeface data
CF.SELECTION.PROCESS = "testing" # selection process
CASESTUDY = "sample" # project name
ARTIFACT = "feature" # the source-code artifact to use

## configuration of network relations
AUTHOR.RELATION = "mail"
ARTIFACT.RELATION = "cochange"

## initialize network library
source("./util-init.R", chdir = TRUE)

## create the configuration objects
proj.conf = ProjectConf$new(CF.DATA, CF.SELECTION.PROCESS, CASESTUDY, ARTIFACT)
net.conf = NetworkConf$new()

## update the values of the NetworkConf object to the specific needs
net.conf$update.values(list(author.relation = AUTHOR.RELATION,
                            artifact.relation = ARTIFACT.RELATION,
                            simplify = TRUE))

## get project-folder information from project configuration
cf.project.folder = proj.conf$get.entry("project") # obtaining: "sample_feature"

## create data object which actually holds and handles data
data = ProjectData$new(proj.conf)

## create network builder to construct networks from the given data object
netbuilder = NetworkBuilder$new(data, net.conf)

## create and get the bipartite network
## (construction configured by net.conf's "artifact.relation")
bpn = netbuilder$get.bipartite.network()

## plot the retrieved network
plot.network(bpn)

Please also see the other types of networks we can construct. For more information on how to use the configuration classes and how to construct networks with them, please see the corresponding section. Additionally, for more examples, the file showcase.R is worth a look.

File/Module overview

Configuration classes

ProjectConf

In this section, we give an overview on the parameters of the ProjectConf class and their meaning.

All parameters can be retrieved with the method ProjectConf$get.entry(...), by passing one parameter name as method parameter. There is no way to update the entries, except for the revision-based parameters.

Basic information

Note: This data is updated after performing a data-based splitting (i.e., by calling the functions split.data.*(...)). Note: These parameters can be updated using the method ProjectConf$set.splitting.info(), but you should not do that manually!

Data paths

Splitting information

Note: This data is added to the ProjectConf object only after performing a data-based splitting (by calling the functions split.data.*(...)). Note: These parameters can be updated using the method ProjectConf$set.splitting.info(), but you should not do that manually!

Note: These parameters can be configured using the method ProjectConf$update.values().

NetworkConf

In this section, we give an overview on the parameters of the NetworkConf class and their meaning.

All parameters can be retrieved with the method NetworkConf$get.variable(...), by passing one parameter name as method parameter. Updates to the parameters can be done by calling NetworkConf$update.variables(...) and passing a list of parameter names and their respective values.

Note: Default values are shown in italics.

The class NetworkBuilder holds an instance of the NetworkConf class, just pass the object as parameter to the constructor. You can also update the NetworkConf object at any time by calling NetworkBuilder$update.network.conf(...), but as soon as you do so, all cached data of the NetworkBuilder object are reset and have to be rebuilt.

For more examples, please have a look into the file showcase.R.

Changelog

For the most recent changes and releases, please have a look at our NEWS.

Contributing

If you want to contribute to this project, please have a look at the file CONTRIBUTING.md for guidelines and further details.

License

This project is licensed under GNU General Public License v2.0.

Work in progress

To see what will be the next things to be implemented, please have a look at the list of issues.