Have you ever wanted to build socio-technical developer networks the way you want? Here, you are in the right place. Using this network library, you are able to construct such networks based on various data sources (commits, e-mails, issues) in a configurable and modular way. Additionally, we provide, e.g., analysis methods for network motifs, network metrics, and developer classification.
The network library coronet
can be used to construct analyzable networks based on data extracted from Codeface
[https://github.com/siemens/codeface] and its companion tool codeface-extraction
[https://github.com/se-sic/codeface-extraction]. The library reads the written/extracted data from disk and constructs intermediate data structures for convenient data handling, either data containers or, more importantly, developer networks.
If you wonder: The name coronet
derives as an acronym from the words “configurable”, “reproducible”, and, most importantly, “network”. The name says it all and very much conveys our goal.
While using the package, we require the following infrastructure.
R
Minimum requirement is R
version 3.4.4
. Hence, later R
versions also work. (Earlier R
versions beginning from version 3.3.1
on should also work, but some packages are not available any more for these versions, so we do not test them any more in our CI pipeline.)
We currently recommend R
version 4.1.1
or 3.6.3
for reliability reasons and packrat
compatibility, but also later R
versions should work (and are tested using our CI script).
packrat
(recommended)The local package manager of R
enables the user to store all needed R
packages for this repository inside the repository itself.
All R
tools and IDEs should provide a more sophisticated interface for the interaction with packrat
(RStudio does).
To use this network library, the input data has to match a certain folder structure and agree on certain file names.
The data folder – which can result from consecutive runs of Codeface
[https://github.com/se-sic/codeface] (branch infosaar-updates
) and codeface-extraction
[https://github.com/se-sic/codeface-extraction] – needs to have the following structure (roughly):
codeface-data
├── configurations
│ ├── threemonth
│ │ └──{project-name}_{tagging}.conf
│ ├── releases
│ │ └──{project-name}_{tagging}.conf
│ ├── ...
│
└── results
├── threemonth
│ └──{project-name}_{tagging}
│ └──{tagging}
│ ├── authors.list
│ ├── bots.list
│ ├── commits.list
│ ├── commitMessages.list
│ ├── emails.list
│ ├── issues-github.list
│ ├── issues-jira.list
│ └── revisions.list
├── releases
│ └──{project-name}_{tagging}
│ └──{tagging}
│ ├── authors.list
│ ├── ...
├── ...
The names “threemonth” and “releases” correspond to selection processes that are used inside Codeface
and describe the notation of the revs
key in the Codeface
configuration files.
Essentially, these are arbitrary names that are used internally for grouping.
If you are in doubt, just pick a name and you are fine (you just need to take care that you give Codeface
the correct folders!).
E.g., if you use “threemonth” as selection process, you need to give Codeface
and codeface-extraction
the folder “releases/threemonth” as results folder (resdir
command-line parameter of Codeface
).
{tagging}
corresponds to the different Codeface
commit-analysis types.
In this network library, {tagging}
can be either proximity
or feature
.
While proximity
triggers a file/function-based commit analysis in Codeface
, feature
triggers a feature-based analysis.
When using this network library, the user only needs to give the artifact
parameter to the ProjectConf
constructor, which automatically ensures that the correct tagging is selected.
The configuration files {project-name}_{tagging}.conf
are mandatory and contain some basic configuration regarding a performed Codeface
analysis (e.g., project name, name of the corresponding repository, name of the mailing list, etc.).
For further details on those files, please have a look at some example files in the Codeface
repository.
All the *.list
files listed above are output files of codeface-extraction
and contain meta data of, e.g., commits or e-mails to the mailing list, etc., in CSV format.
This network library lazily loads and processes these files when needed.
To manage the following packages, we recommend to use packrat
using the R
command install.packages("packrat"); packrat::on()
.
This will automatically detect all needed packages and install them.
Alternatively, you can run Rscript install.R
to install the packages.
yaml
: To read YAML configuration files (i.e., Codeface configuration files)R6
: For proper classesigraph
: For the construction of networks (package version 1.3.0
or higher is recommended)plyr
: For the dlply
splitting-function and rbind.fill
parallel
: For parallelizationlogging
: Loggingsqldf
: For advanced aggregation of data.frame
objectsdata.table
: For faster data processingreshape2
: For reshaping of datatestthat
: For the test suitepatrick
: For the test suiteggplot2
: For plotting of dataggraph
: For plotting of networks (needs udunits2
system library, e.g., libudunits2-dev
on Ubuntu!)markovchain
: For core/peripheral transition probabilitieslubridate
: For convenient date conversion and parsingviridis
: For plotting of networks with nice colorsjsonlite
: For parsing the issue datarTensor
: For calculating EDCPTD centralityMatrix
: For sparse matrix representation of large adjacency matricesPlease insert the project into yours by use of git submodules.
Furthermore, the file install.R
installs all needed R packages (see below) into your R library.
Although, the use of packrat with your project is recommended.
This library is written in a way to not interfere with the loading order of your project’s R
packages (i.e., library()
calls), so that the library does not lead to masked definitions.
To initialize the library in your project, you need to source all files of the library in your project using the following command:
source("path/to/util-init.R", chdir = TRUE)
It may lead to unpredictable behavior, when you do not do this, as we need to set some system and environment variables to ensure correct behavior of all functionality (e.g., parsing timestamps in the correct timezone and reading files from disk using the correct encoding).
Note: If you have used this library as a submodule already before it was renamed as coronet
, you need to ensure that the right remote URL is used. The best way to do that is to remove the current submodule and re-add it with the new URL.
When selecting a version to work with, you should consider the following points:
v{major}.{minor}[.{bugfix}]
.master
, there is always the most recent and complete version.master
branch. If you, nonetheless, work on a former version, there might be a branch called {your_version}-fixes
(e.g., v2.3-fixes
) when we have fixed some extreme bugs in the current version, then select this one as it contains backported bugfixes for the former version. We will backport some very important bug fixes only in special cases and only for the last minor version of the second last major version.dev
branch.There are two different classes of configuration objects in this library:
ProjectConf
class which determines all configuration parameters needed for the configured project (mainly data paths) andNetworkConf
class which is used for all configuration parameters concerning data retrieval and network construction.You can find an overview on all the parameters in these classes below in this file.
There are two distinguishable types of data sources that are both handled by the class ProjectData
(and possibly its subclass RangeData
):
"commits"
internally)"mails"
internally)"issues"
internally)commit.messages
in the ProjectConf
class. Three values can be used:
none
is the default value and does not impact the configuration at all.title
merges the commit message titles (i.e. the first non white space line of a commit message) to the commit data. This gives the data frame an additional column title
.messages
merges both titles and message bodies to the commit data frame. This adds two new columns title
and message
.gender
in the ProjectConf
class)))pasta
in the ProjectConf
class))
synchronicity
in the ProjectConf
class)
custom.event.timestamps.file
in the ProjectConf
class)The important difference is that the main data sources are used internally to construct artifact vertices in relevant types of networks. Additionally, these data sources can be used as a basis for splitting ProjectData
in a time-based or activity-based manner – obtaining RangeData
instances as a result (see file split.R
and the contained functions). Thus, RangeData
objects contain only data of a specific period of time.
The additional data sources are orthogonal to the main data sources, can augment them by additional information, and, thus, are not split at any time.
All data sources are accessible from the ProjectData
and RangeData
objects through their respective getter methods. For some data sources, there are additional methods available to access, for example, a more aggregated version of the data.
When constructing networks by using a NetworkBuilder
object, we basically construct igraph
objects. You can find more information on how to handle these objects on the igraph
project website.
For the construction to work, you need to pass an instance of each the classes ProjectData
and NetworkConf
as parameters when calling the NetworkBuilder
constructor. The ProjectData
object holds the data that is used as basis for the constructed networks, while the NetworkConf
object configures the construction process in detail (see below and also Section NetworkConf
for more information).
Beware: The ProjectData
instance passed to the constructor of the class NetworkBuilder
is getting cloned inside the NetworkBuilder
instance! The main reason is the latent ability to cut data to unified date ranges (the parameterunify.date.ranges
in the class NetworkConf
) which would compromise the original given data object; consequently, data cutting is only performed on the cloned data object. Further implications are:
NetworkBuilder$reset.environment()
, the cloned ProjectData
object gets replaced by a new clone based on the originally given ProjectData
instance.NetworkBuilder
, you need to adapt it via NetworkBuilder$get.project.data()
. This also includes that, if data is read and is cached inside a ProjectData
object during network construction, the cached data is only available through the NetworkBuilder
instance!ProjectData
object in any way, you need to create a new NetworkBuilder
instance!There are four types of networks that can be built using this library: author networks, artifact networks, bipartite networks, and multi networks (which are a combination of author, artifact, and bipartite networks). In the following, we give some more details on the various types. All types and their incorporated relations can be configured using a NetworkConf
object supplied to an NetworkBuilder
object. The respective relations and their meaning are explained in the next section in more detail.
NetworkConf
attribute author.relation
. For the edge-construction algorithms used for constructing author networks, please also see the respective section.NetworkConf
attribute artifact.relation
. The relation also describes which kinds of artifacts are represented as vertices in the network. (For example, if “mail” is selected as artifact.relation
, only mail-thread vertices are included in the network.)NetworkConf
attribute artifact.relation
.NetworkConf
attributes author.relation
and artifact.relation
, respectively.Relations determine which information is used to construct edges among the vertices in the different types of networks. In this network library, you can specify, if wanted, several relations for a single network using the corresponding NetworkConf
attributes mentioned in the following.
cochange
author.relation
in the NetworkConf
), authors who change the same source-code artifact are connected with an edge.artifact.relation
in the NetworkConf
), source-code artifacts that are concurrently changed in the same commit are connected with an edge.artifact.relation
in the NetworkConf
), authors get linked to all source-code artifacts they have changed in their respective commits.mail
author.relation
in the NetworkConf
), authors who contribute to the same mail thread are connected with an edge.artifact.relation
in the NetworkConf
), mail threads are connected when they reference each other. (Note: There are no edges available right now.)artifact.relation
in the NetworkConf
), authors get linked to all mail threads they have contributed to.issue
author.relation
in the NetworkConf
), authors who contribute to the same issue are connected with an edge.artifact.relation
in the NetworkConf
), issues are connected when they reference each other. (Note: There are no edges available right now.)artifact.relation
in the NetworkConf
), authors get linked to all issues they have contributed to.callgraph
artifact.relation
in the NetworkConf
), source-code artifacts are connected when they reference each other (i.e., one artifact calls a function contained in the other artifact).artifact.relation
in the NetworkConf
), authors get linked to all source-code artifacts they have changed in their respective commits (same as for the relation cochange
).When constructing author networks, we use events in time (i.e., commits, e-mails, issue events) to model interactions among authors on the same artifact as edges. Therefore, we group the events on artifacts, based on the configured relation (see the previous section).
We have four different edge-construction possibilities, based on two configuration parameters in the NetworkConf
:
On the one hand, networks can either be directed or undirected (configured via author.directed
in the NetworkConf
). If directedness is configured, the edges are directed from the author of an event (i.e., the actor) to the authors the actor interacted with via this event.
On the other hand, we can construct edges based on the temporal order of events or just construct edges neglecting the temporal order of events (configured via author.respect.temporal.order
in the NetworkConf
). When respecting the temporal order, for every group of events, there will be edges for each event in the group from its author to the actors of all previous events in the group. More precisely, if there are serveral previous events of an author, we construct an individual edge for each of those events (resulting in several duplicated edges arising from the same event). Potentially, this also includes loop edges (i.e., edges from one vertex to itself). Otherwise, when neglecting the temporal order, there will be mutual edges among all pairs of authors, representing all events in the group performed by one pair of authors (i.e., if directedness is configured, there are edges in both directions).
In the following, we illustrate the edge construction for all combinations of temporally (un-)ordered data and (un-)directed networks on an example with one mail thread:
Consider the following raw e-mail data for one thread (i.e., one group of events), temporally ordered from the first to the last e-mail:
Author | Date (Timestamp) | Artifact (Mail Thread) |
---|---|---|
A | 1 | <thread-1> |
A | 2 | <thread-1> |
B | 3 | <thread-1> |
Based on the above raw data, we get the following author networks with relation mail
:
respect temporal order | without respecting temporal order | |
---|---|---|
network directed | A ←(2)– A A ←(3)– B A ←(3)– B |
A –(1)→ B A –(2)→ B A ←(3)– B |
network undirected | A –(2)– A A –(3)– B A –(3)– B |
A –(1)– B A –(2)– B A –(3)– B |
When constructing author networks with respecting the temporal order, there is one edge for each answer in a mail thread from the answer’s author to the senders of every previous e-mail in this mail thread. Note that this can lead to duplicated edges if an author has sent several previous e-mails to the mail thread (see the duplicated edges A –(3)– B
in the above example). This also leads to loop edges if an author of an answer has already sent an e-mail to this thread before (see the edge A –(2)– A
).
If the temporal order is not respected, for each e-mail in a mail thread, there is an edge from the sender of the e-mail to every other author participating in this mail thread (regardless of in which order the e-mails were sent). In this case, no loop edges are contained in the network. However, it is possible that there are several edges (having different timestamps) between two authors (see the edges A –(1)– B
and A –(2)– B
in the example above). If directedness is configured, the edges are directed from the sender of an e-mail to the other authors.
Analogously, these edge-construction algorithms apply also for all other relations among authors (see the Section Relations).
There are some mandatory attributes that are added to vertices and edges in the process of network construction. These are not optional and will be added in all cases when using instances of the class NetworkBuilder
to obtain networks.
type
"Author"
, "Artifact"
]kind
type
"Author"
,"File"
, "Feature"
, "Function"
, "MailThread"
, "Issue"
,"FeatureExpression"
]name
type
"Unipartite"
, "Bipartite"
]relation
type
(see also the attributes artifact.relation
and author.relation
in the NetworkConf
class)"mail"
, "cochange"
, "issue"
, "callgraph"
]artifact.type
"File"
, "Feature"
, "Function"
, "Mail"
, "IssueEvent"
,"FeatureExpression"
]weight
date
To add further edge attributes, please see the parameter edge.attributes
in the NetworkConf
class. To add further vertex attributes – which can only be done after constructing a network –, please see the functions add.vertex.attribute.*
in the file util-networks-covariates.R
for the set of corresponding functions to call.
Often, it is interesting to build the networks not only for the whole project history but also to split the data into smaller ranges. One’s benefit is to observe changes in the network over time. Further details can be found in the Section Splitting information.
Since we extract the data for each data source independently, the time ranges for available data can be quite different. For example, there may be a huge amount of time between the first extracted commit and the first extracted e-mail (and also analogously for the last commit resp. e-mail). This circumstance can affect various analyses using this network library.
To compensate for this, the class ProjectData
supplies a method ProjectData$get.data.cut.to.same.date()
, which returns a clone of the underlying ProjectData
instance for which the data sources are cut to their common latest first entry date and their common earliest last entry date.
Analogously, the NetworkConf
parameter unify.date.ranges
enables this very functionality latently when constructing networks with a NetworkBuilder
instance. Note: Please see also Section Data sources for network construction for further information on data handling inside the class NetworkBuilder
!
In some cases, it is not necessary to build a network to get the information you need. Therefore, please remember that we offer the possibility to get the raw data or mappings between, e.g., authors and the files they edited. The data inside an instance of ProjectData
can be accessed independently. Examples can be found in the file showcase.R
.
In this section, we give a short example on how to initialize all needed objects and build a bipartite network.
Disclaimer: The following code is configured to use sample data shipped with this repository. If you want to use the network library with a real-world project such as BusyBox, you need actual data and adjust the variables in the first block of the code to the existing data.
CF.DATA = "./sample/" # path to codeface data
CF.SELECTION.PROCESS = "testing" # selection process
CASESTUDY = "sample" # project name
ARTIFACT = "feature" # the source-code artifact to use
## configuration of network relations
AUTHOR.RELATION = "mail"
ARTIFACT.RELATION = "cochange"
## initialize network library
source("./util-init.R", chdir = TRUE)
## create the configuration objects
proj.conf = ProjectConf$new(CF.DATA, CF.SELECTION.PROCESS, CASESTUDY, ARTIFACT)
net.conf = NetworkConf$new()
## update the values of the NetworkConf object to the specific needs
net.conf$update.values(list(author.relation = AUTHOR.RELATION,
artifact.relation = ARTIFACT.RELATION,
simplify = TRUE))
## get project-folder information from project configuration
cf.project.folder = proj.conf$get.entry("project") # obtaining: "sample_feature"
## create data object which actually holds and handles data
data = ProjectData$new(proj.conf)
## create network builder to construct networks from the given data object
netbuilder = NetworkBuilder$new(data, net.conf)
## create and get the bipartite network
## (construction configured by net.conf's "artifact.relation")
bpn = netbuilder$get.bipartite.network()
## plot the retrieved network
plot.network(bpn)
Please also see the other types of networks we can construct.
For more information on how to use the configuration classes and how to construct networks with them, please see the corresponding section.
Additionally, for more examples, the file showcase.R
is worth a look.
util-init.R
util-conf.R
util-read.R
util-data.R
util-networks.R
NetworkBuilder
class and all corresponding helper functions to construct networksutil-split.R
util-bulk.R
util-networks-covariates.R
util-networks-metrics.R
util-data-misc.R
util-networks-misc.R
util-tensor.R
util-core-peripheral.R
util-motifs.R
util-plot.R
util-plot-evaluation.R
util-misc.R
showcase.R
tests.R
tests/
subfolder)In this section, we give an overview on the parameters of the ProjectConf
class and their meaning.
All parameters can be retrieved with the method ProjectConf$get.entry(...)
, by passing one parameter name as method parameter.
There is no way to update the entries, except for the revision-based parameters.
project
busybox_feature
repo
busybox
description
mailinglists
mails
(e.g., an entry 13#5
in the column thread
corresponds to thread ID 5
on mailing list 13
).artifact
artifact.short
artifact.codeface
tagging
artifact
parameter"proximity"
or "feature"
Note: This data is updated after performing a data-based splitting (i.e., by calling the functions split.data.*(...)
).
Note: These parameters can be updated using the method ProjectConf$set.splitting.info()
, but you should not do that manually!
revisions
revisions.dates
revisions
revisions.callgraph
ranges
revisions
ranges.callgraph
revisions.callgraph
datapath
datapath.callgraph
datapath.synchronicity
datapath.pasta
Note: This data is added to the ProjectConf
object only after performing a data-based splitting (by calling the functions split.data.*(...)
).
Note: These parameters can be updated using the method ProjectConf$set.splitting.info()
, but you should not do that manually!
split.type
"time-based"
or "activity-based"
, depending on splitting functionsplit.length
split.basis
"commits"
or "mails"
)split.sliding.window
"TRUE"
or "FALSE"
)split.revisions
split.revisions.dates
split.revisions
split.ranges
split.revisions
(either in sliding-window manner or not, depending on split.sliding.window
)Note: These parameters can be configured using the method ProjectConf$update.values()
.
commits.filter.base.artifact
get.commits.filtered
, because then the result of which does not contain any commit information about changes to the base artifact. Networks built on top of this ProjectData
do also not contain any base artifact information anymore.TRUE
, FALSE
]commits.filter.untracked.files
get.commits.filtered
, because then the result of which does not contain any commits that solely changed untracked files. Networks built on top of this ProjectData
do also not contain any information about untracked files.TRUE
, FALSE
]commits.locked
TRUE
, FALSE
]commmit.messages
title
will contain the first line of the message and, if selected, the column message
will contain the rest.none
, title
, messages
]filter.bots
bots.list
file.TRUE
, FALSE
]gender
gender
)TRUE
, FALSE
]issues.only.comments
TRUE
, FALSE
]issues.from.source
github
, jira
]issues.locked
TRUE
, FALSE
]mails.filter.patchstack.mails
mails.filter.patchstack.mails = TRUE
.TRUE
, FALSE
]mails.locked
TRUE
, FALSE
]pasta
pasta
and revision.set.id
)TRUE
, FALSE
]"pasta"
edge attribute for edge.attributes
.synchronicity
synchronicity
)TRUE
, FALSE
]"synchronicity"
edge attribute for edge.attributes
.synchronicity.time.window
:
synchronicity = TRUE
custom.event.timestamps.file
:
custom.event.timestamps.locked
:
TRUE
, FALSE
]In this section, we give an overview on the parameters of the NetworkConf
class and their meaning.
All parameters can be retrieved with the method NetworkConf$get.variable(...)
, by passing one parameter name as method parameter.
Updates to the parameters can be done by calling NetworkConf$update.variables(...)
and passing a list of parameter names and their respective values.
Note: Default values are shown in italics.
author.relation
artifact.relation
!"mail"
, "cochange"
, "issue"
]author.directed
TRUE
, FALSE
]author.respect.temporal.order
NA
is used), the value of author.directed
is used for determining whether to respect the temporal order during edge construction.TRUE
, FALSE
, NA
]author.all.authors
TRUE
, FALSE
]author.only.committers
artifact.relation
as relation, i.e., all authors that have no biparite relations in a bipartite/multi network are removed.TRUE
, FALSE
]artifact.relation
"cochange"
, "callgraph"
, "mail"
, "issue"
]artifact.directed
issue
relation, as the cochange
relation is always undirected, while the callgraph
relation is always directed. For the mail
, we currently do not have data available to exhibit edge information.TRUE
, FALSE
]edge.attributes
"date"
, "date.offset"
"artifact.type"
"author.name"
, "author.email"
"committer.date"
, "committer.name"
, "committer.email"
"message.id"
, "thread"
, "subject"
"hash"
, "file"
, "artifact"
, "changed.files"
, "added.lines"
, "deleted.lines"
, "diff.size"
, "artifact.diff.size"
, "synchronicity"
"pasta"
,"issue.id"
, "event.name"
, "issue.state"
, "creation.date"
, "closing.date"
, "is.pull.request"
"date"
and "artifact.type"
are always included as this information is needed for several parts of the library, e.g., time-based splitting."pasta"
and "synchronicity"
, the project configuration’s parameters pasta
and synchronicity
need to be set to TRUE
, respectively (see below).edges.for.base.artifacts
commits.filter.base.artifact == TRUE
, or, when commits.filter.untracked.files == TRUE
and artifact == FILE
; all of these options can be configured in the ProjectConf
; warning: commits.filter.base.artifact
and commits.filter.untracked.files
are TRUE
by default).TRUE
, FALSE
]simplify
TRUE
, FALSE
]simplify.multiple.relations
simplify = FALSE
!TRUE
, FALSE
]skip.threshold
mail
-based directed edges in an author network for one mail thread with 100 authors is 5049.
A value of 5000 for skip.threshold
(as it is smaller than 5049) would lead to the omission of this mail thread from the network.unify.date.ranges
NetworkBuilder
instance. See also Section Cutting data to unified date ranges for more information on this.TRUE
, FALSE
]The class NetworkBuilder
holds an instance of the NetworkConf
class, just pass the object as parameter to the constructor.
You can also update the NetworkConf
object at any time by calling NetworkBuilder$update.network.conf(...)
, but as soon as you do so, all cached data of the NetworkBuilder
object are reset and have to be rebuilt.
For more examples, please have a look into the file showcase.R
.
For the most recent changes and releases, please have a look at our NEWS.
If you want to contribute to this project, please have a look at the file CONTRIBUTING.md for guidelines and further details.
This project is licensed under GNU General Public License v2.0.
To see what will be the next things to be implemented, please have a look at the list of issues.