Keywords

1 Introduction

Software maintenance is, according to van Vliet [27], one of the most effort and time intensive activities in the software lifecycle, since it consumes 50–75 % of the overall project resources. According to ISO 14764-2006 the most common maintenance activities include the adaptation of a system to new environments, the implementation of additional requirements, the improvement of run- or design-time quality properties, or the identification of latent defects. It is expected that the worse the internal quality of a system is, the more difficult to maintain the software will be [2]. However, there are several precautionary actions that managers can take (e.g., software refactorings [10]) so as to improve the structural quality of the software, which in turn will lead to decreased maintenance effort and cost. Nevertheless, the time budget that is usually allocated on such maintenance activities is usually limited. Thus, it should be cautiously allocated to the modules that suffer from the most important bad smells [10].

In this study we focus on the relationship between structural software quality characteristics (expressed through metrics), maintenance production (i.e., software units maintained) and duration. Although we acknowledge the fact that structural quality is not the only parameter that should be used in such a model (i.e., a complete model should consider the number of changing requirements, number of defects, etc.), we isolate software structure, in order to explore its influence exclusively. In classical economics production and duration are combined through the measure of productivity, which is defined as an average measure of production in the unit of time [6]. In this line of thought, we define maintenance productivity as the average lines of code that are being maintained (software units maintained) in a given time unit. Although the exploration of the relationship between software units maintained and structural quality measures is not novel (see Sect. 2), to the best of our knowledge this is the first study that considers maintenance productivity. To achieve this goal we used a Bayesian Belief Network (BBN) to model the relationships between the input (structural software metrics) and output parameters (maintenance production, maintenance duration, and maintenance productivity). The network has been trained through a case study on 454 versions of 20 Open Source Software (OSS). The rest of the paper is organized as follows: Sect. 2 presents research efforts on studies related to maintenance production; Sect. 3 presents background information on BBN and the metrics that have been used in the developed network; Sect. 4 presents the case study design; in Sect. 5, we present and discuss our results and the implications to the researchers and practitioners; in Sect. 6 we present threats to validity and finally, and in Sect. 7 we conclude the paper.

2 Related Work

Research on software maintenance effort can be categorized into two groups: (a) development of effort estimation models, and (b) development of methods that improve the maintenance outcomes.

Among popular effort estimation models we can find Estimation by Analogy (EbA) and estimation based on Regression. Shepperd et al. [24] suggested EbA for allocating staff in software maintenance projects based on similar completed historical maintenance projects. De Lucia et al. [16] use multiple regression analysis to construct corrective maintenance effort estimation models. The results show that the performance of the model is improved when the different types of maintenance tasks are included. The use of function points are proposed for estimating the effort of software maintenance projects [1]. Bayesian Networks are also used in Software Maintenance [26]. Van Koten and Gray [26] use Bayesian Networks to predict the maintainability measured as the change in software lines of code between subsequent releases. The results are improved compared to regression models. In [19] BBNs aim to predict delays in software maintenance tasks. Bayesian Networks have been also applied in software fault prediction [9] and cost estimation [25]. The application of the method indicated certain advantages regarding their ability to be combined with expert judgement and provide flexible and informative estimates.

On the other hand there are studies found in literature that examine the factors that affect software maintenance effort. System size, system age, number of input/output data items, application type, and programming language are among the parameters that can affect software maintenance effort according to [1, 12]. Also there is a vast literature on code metrics that affect the maintainability of software and consequently the corresponding effort required. According to Riaz et al. [22] the metric suites proposed by Li and Henry [15] and Chidamber and Kemerer [8] are the most accurate ones for assessing software maintainability. On the other hand there are studies that explore the specific parameters that affect open source project development, quality and evolution [14, 20, 21]. Developer participation, core team size, code ownership, productivity, defect density, and problem resolution intervals were examined in these studies leading to the conclusion among others that defect density and productivity benefit from the large open source community of testers and bug fixers. Speed of releases seems to require highly modularized software. To the best of our knowledge this is the first study that considers maintenance productivity.

3 Background Information

In this section we present some background information that is needed to comprehend this study. In particular, we provide information related to Bayesian Belief Networks, and structural quality metrics associated to maintainability.

Bayesian Belief Networks are Directed Acyclic Graphs (DAGs), which are causal networks that consist of a set of nodes and a set of directed links between them, in a way that they do not form a cycle [13]. Each node represents a random variable that can take mutually exclusive values according to a probability distribution, which can be different for each node. Each link expresses probabilistic cause-effect relations among the linked variables and is depicted by an arc starting from the influencing variable (parent node) and terminating on the influenced variable (child node). The relation between the two nodes is based on Bayes’ Rule:

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$
(1)

A simple BBN example is presented in Fig. 1. The model consists of two nodes. The first node (NOC) represents the number of classes in a software package and the second node (Maintenance Effort) represents the effort required for package maintenance measured in Person-Months widely reported in literature as Man-Months (MMs). We consider that the values of these two nodes fall into two discrete categories (Low and High). For the node NOC, let’s suppose that Low values range between 1 class and 10 classes (Similarly for Maintenance Effort values). For example the first column of the Node Probability Table states that if the number of classes is low then there is 70 % probability that the maintenance effort will be low and 30 % percent probability that the maintenance effort will be high. The algorithm used for learning Bayesian Networks is analytically presented in [7].

Fig. 1.
figure 1

a. A BBN for maintenance effort b. Node probability table for Fig. 1a

The tool used for constructing the networks and the classifier can be found in the web (http://www.cs.ualberta.ca/~jcheng/bnpc.htm).

Maintainability Predictors: In this study we focus on metrics that are the most accurate software maintainability predictors. According to Riaz et al. [22] the metric suites proposed by Li and Henry [15] and Chidamber and Kemerer [8] are the most accurate ones, while the model of van Koten and Gray [26] is the most stable one [22]. The metric suite that is used from the aforementioned study is presented in Table 1. In order to automate the calculation of software quality metrics we used Percerons Client (http://www.percerons.com), i.e., a tool developed in our research group.

Table 1. Maintainability predictors

4 Case Study Design

In order to investigate the influence of structural quality in several managerial maintenance indices (i.e., duration, effort and productivity), we performed a case study on 20 open-source software (OSS) projects. The case study is designed and reported according to the guidelines of Runeson et al. [23].

4.1 Objectives and Research Questions

The objective of this study is to analyze software metrics and investigate their influence on the maintenance production, duration and productivity. In particular we investigate three research questions:

  • RQ1: Which quality metrics are related to the duration of maintenance among two successive releases?

    This question aims at identifying the metrics that mostly influence the time required for developing a new version of an OSS project measured in days elapsed from one release to another. Quick successive releases may indicate the level of readiness of the project community, the tendency to launch quick releases, and short time to fix bugs or add functionality.

  • RQ2: Which quality metrics are related to the production of maintenance among two successive releases?

    The objective of this question is to reveal which quality metrics are more associated to the production of the maintenance in OSS, measured as the number of lines of code added from previous release. The target of this analysis is to identify the quality characteristics whose levels enable a substantial change in a single version. We expect that low values of coupling and complexity, and high values of cohesion, will yield for projects in which maintenance production is large and efficient.

  • RQ3: Which quality metrics are related to the productivity of maintenance among two successive releases?

    This research question is relevant to productivity. Productivity is calculated as the ratio between production and duration. We note that this calculation of productivity deviates from the typical one, because duration in OSS development might include idle time, when developers are inactive. However, to the best of our knowledge, the actual development time cannot be retrieved from source code repositories. The objective is to confirm if high levels of quality lead to increased productivity, and to some extent eliminate the bias caused by change load.

4.2 Case Selection and Data Processing

To collect subjects for our case study we retrieved a set of popular and active OSS projects. In particular, we selected only projects that were active in 2014 (i.e., published at least one version), and when sorted by popularity (based on the default sourceforge.net measure) they were among the top ranked ones. Additionally, we selected projects containing more than 300 classes to ensure that sufficient units of analysis will participate in the study and that toy-applications would be excluded. Therefore, we downloaded and collected metrics data from 20 popular OSS projects so as to build a dataset that is adequate for statistical analysis. Some demographics for these projects are presented in Table 2.

Table 2. Case study subjects

The case study is an embedded multiple-case study, in the sense that for every OSS project (i.e., case), we analyze every transition between two versions (i.e., units of analysis) [23]. For each release of the projects presented in Table 2, we collected three sets of data: the structural quality metrics presented in Table 1 (see Sect. 3), the targeted maintenance managerial indices (see Sect. 4.1), and some process metrics. The obtained process metrics include the date each release was uploaded, the sequential number of release that indicates the maturity of the project (i.e., if we are in the beginning of the project lifecycle or not), and the number of downloads recorded for each release (considering the previous version each time) from the date launched to 01/01/2015. At this point the resulting project releases were further filtered to exclude releases that presented negative values in the production variable (deletion of functionality) and releases that presented zero duration time (releases uploaded all at once). After this pre-processing phase the remaining project releases were selected for analysis. Descriptive statistics for all dependent variables of our analysis are presented in Table 3.

Table 3. Descriptive statistics for maintenance indices

To apply the BBN analysis we had to further process the gathered metrics, since quality metrics present continuous arithmetic values. In particular, continuous values had to be transformed to categorical ones, so as to be utilized in BBN analysis. The transformation method used was the equal frequency binning [28]. This method automatically sets the boundaries of each bin (category), so as to ensure that all of them have an equal number of observations. Each metric that is represented in a categorical form has five distinct categories (VL—Very Low, L—Low, A—Average, H—High, and VH—Very High).

4.3 Data Analysis

On the completion of the aforementioned process we applied the Bayesian Network analysis and utilized heatmaps to graphically represent our data. Our goal was to better depict the influence of the independent variables on the dependent ones.

The first step of the analysis includes the selection and tuning of the variables that affect the structure of the BBN. In this step expert input is required to set the order of the variables that will affect the causal relationships that the model will define. Metrics order was selected taking into account two considerations: (a) the development phase, when the values of the metrics will be available in the sense that we first define certain design metrics, while some time afterwards certain source code metrics are available, and (b) the relatedness of metrics, based on their calculation method. For example coupling metrics like CBO and MPC are closely related. In the model CBO precedes MPC, in the sense that the value of CBO “includes” the value of MPC.

After extracting the model we employed heatmaps to visualize the obtained outcome. Heatmaps are graphical representations that use color intensity in order to denote occurrence frequency. While applying heatmaps to our results set, we developed a matrix, in which row represents a value of the dependent variable (e.g., low maintenance duration) and a column a value of the independent variable (e.g., systems with a very low level of polymorphism). Each cell is a percentage that shows the probability that the independent variable will present a certain value according to the value of the dependent variable (e.g., 10 % of the cases with a very low level of polymorphism are maintained very quicklylow maintenance duration).

5 Results and Discussion

In this section we present the results obtained from the BBN analysis, organized by research question. The extracted network including all structural quality metrics, process measures, and managerial maintenance indices is presented in Fig. 2. We note that since this study aims to investigate the interdependencies between the nodes of the graph, the estimation accuracy of the network is not explored. A discussion/interpretation of the most important relationships presented in the network is provided later in this section, while answering each research question.

Fig. 2.
figure 2

Maintenance indices Bayesian belief network

5.1 Duration

In this section we identify the structural properties that are related to maintenance duration (RQ1). From Fig. 2, we can observe that variables that are directly connected to DURATION are Number of downloads, Depth of Inheritance Tree (DIT), and Message Passing Coupling (MPC). The heatmaps visualizing the relationships of these variables to the duration of the maintenance activities is presented in Table 4.

Table 4. Parameters influencing maintenance duration

By interpreting the left-most part of the heatmap, we can observe that as the number of downloads increases, the maintenance duration becomes larger. This result can be intuitively interpreted by the fact that projects with many downloads are expected to serve larger communities, leading to the creation of more feature requests and bug fixing activities. The number of extra requirements that are requested in each maintenance cycle increases their duration. Based on the central part of the heatmap, our results suggest that an average depth of inheritance tree offers shorter maintenance cycles. On the other hand, extensive or limited inheritance leads to larger maintenance cycles. Again, this can be characterized as an expected result, in the sense that: (a) using very large inheritance trees decreases software understandability since the full list of accessible variables and metrics from a class can spread to many classes, and (b) using limited inheritance does not make use of one of the most important benefits from using the object-oriented paradigm, and at the same time hinders the applicability of well-known object-oriented principles (e.g., open-closed principle [17]) and design patterns (e.g., template method, state/strategy, etc. [11]) that have been reported as beneficial to maintainability.

Finally, very high coupling leads to high or very high maintenance duration (see right-most part of the heatmap). This result is expected in the sense that according to the high coupling principle [17] dense inter-connection of classes reduces their understandability, increases the probability of initiating a ripple effect, and reduces their changeability. On the other hand, by focusing on very small coupling we get interesting results. Intuitively one would expect that very small coupling would increase modularity (low ripple effect, increase changeability) and therefore be connected to quick maintenance cycles, but this is the 2nd most frequent event. The most frequent event is that very small coupling is connected to large maintenance cycles. This result, although initially surprising, can be interpreted by the inherent relationship/trade-off between coupling and cohesion (as it is confirmed by the network of Fig. 2). Specifically, it is expected that low coupling can also lead to bad design, in cases that specific classes are growing so large that they do not need to call methods from other classes (they are self-contained). This leads to reduced coupling, but probably these classes are assigned to more than one responsibility. This decision overrules the Single Responsibility Principle, which according to Martin [17], in turn leads to maintainability problems, especially focused on understandability and change proneness.

5.2 Production

In this section, we explore the BBN for mining relationship related to RQ2, i.e., metrics that are related to maintenance production, measured as lines of code added from one version to the other. Similarly to Sect. 5.1, in Table 5, we present the heatmap for the two variables that are directly related to maintenance production. Interestingly, we can observe that the two most important parameters are: one complexity metric (namely AMS) and the same coupling metric as in Table 4 (namely MPC). This is considered important from two perspectives:

  1. (a)

    Average Method Size (AMS) has, until now, not been explored as a possible maintainability predictor, although it appears to be related to maintenance production. However this result can be considered intuitive, in the sense that the size of the method is related to one the most persistent and frequently occurring bad smells, namely Long Method [10]. According to the literature methods of large size are very difficult to understand, modify and extend, leading to difficulties in increasing maintenance production.

  2. (b)

    Message Passing Coupling (MPC) is consistently the coupling metric that can be used as the optimal maintainability index. The superiority of MPC with regard to assessing ripple effects (a significant aspect of maintainability) compared to the other examined coupling metrics is also discussed by Arvanitou et al. [5]. In particular MPC is the only (among the investigated metrics) that captures both coupling volume (number of relationships) and coupling intensity (how closely connected the two classes are). An additional important characteristic of MPC is that it counts coupling intensity using the discrete count function, and therefore is not biased from the number of times one method is being called [4].

Table 5. Parameters influencing maintenance production

5.3 Productivity

Similarly to RQ1 and RQ2, in Table 6, we present the heatmap that visualizes the parameters that are more closely related to maintenance productivity. By comparing the results of Table 6, with those of Tables 4 and 5, we can observe that productivity is more closely related to maintenance production rather than maintenance duration.

Table 6. Parameters influencing maintenance productivity

In particular we observe that the only metric that is consistently among the optimal maintenance predictors is MPC, whereas AMS is only related to maintenance production and productivity. Concerning the range of values for these variables we can reach the following conclusions: According to Martin [23], the average method size should not exceed 5 lines of code. Based on the findings of this case study, very small methods provide a maximum productivity in 42 % of the cases, which is a rather intuitive result. The counter point, as discussed by Fowler [10] (i.e., long methods are hard to maintain) cannot be discussed based on our results, since in order for such an investigation to be possible, it would be required to count the number of long method bad smells and not an aggregated/average measure. Concerning MPC, we can observe that tightly coupled systems are exhibiting small and very small productivity rates—a result that is expected based on the negative effect of coupling on maintainability.

5.4 Implications to Practitioners and Researchers

The results of our case study have provided some interesting insights and implications to both practitioners and researchers. Concerning practitioners, our study provides evidence that indeed structural quality metrics are affecting maintenance indices. However, from the vast amount of metrics that are available, only a limited number of them are highly influential. As expected the most significant metrics cover the most frequently quantified quality attributes, directly or indirectly, i.e., complexity, cohesion, coupling, and inheritance. In particular, the study pointed out that practitioners should pay attention on balancing the values of the following metrics: Average Method Size (AMS), Message Passing Coupling (MPC), and Depth of Inheritance Tree (DIT). Thus, practitioners should not only include these metrics in their quality dashboards, but also manage their natural trade-off (e.g., optimizing coupling on the expense of cohesion).

Furthermore, the outcome of this study indicated some interesting future research directions. First a replication of the case study can be performed including metrics of different categories and suites (e.g., architecture level) and also metrics that describe the communication and message exchange of the OSS developers’ community. Secondly, the development of a complete estimation model that additionally takes into account factors other than software structure is expected to fully capture and predict all aspects of software maintenance productivity. Finally, tailoring the proposed model in an industrial context is necessary for the adoption of such approaches in closed source software development.

6 Threats to Validity

In this section we discuss the threats to validity of our case study, based on the categorization described by Runeson et al. [23]. With regard to internal validity, we need to clarify that the relationships obtained through the network are influenced by some confounding factors. For example, the load of high-level changes from one version of the system to another (e.g., new features, number of fixed bugs, etc.) is not considered in this study. However, we believe that the large size of our dataset imposed a normal distribution of actual changes that minimizes the effect of the confounding factor. Additionally, we remind that the investigation of the strength of relationships is out of the scope of this study, since it only aims at their identification.

Concerning external validity, we have identified two threats to generalization. First, since all the examined projects come from one source code repository the results cannot be generalized to the OSS population. Second, since we only explored Java projects (due to the limitation of the used tool) the results cannot be generalized to other programming languages. However, we believe that projects popularity and size are indicative factors of successful OSS projects.

Regarding reliability, based on the fact that our study is purely quantitative and the developed protocol (see Sect. 4) is thoroughly described, we believe that the process is easily replicable by other researchers. Though, concerning construct validity, Bayesian network model construction includes by its nature many subjective decisions made by the modeler. The parameters of the model such as the selection of the binning method for the discretization of the values of the continuous variables and the ordering of the independent variables may also affect the model extracted. This provides an interesting direction for future studies that could include additionally sensitivity analysis regarding the parameters of the model.

7 Conclusions

Long-term monitoring of maintenance effort requires a measurement process of respective attributes that affect software maintenance and a model to represent the interrelations of the attributes gathered. Open source software is a suitable paradigm of continuous, sustainable maintenance efforts. Understanding why each system requires more or less maintenance effort is necessary for better monitoring software maintenance process and releases launch. In this study we have performed an empirical study on the factors that affect software maintenance duration, production and productivity on 20 open source software projects. We suggest Bayesian Networks as a tool to model all attributes and factors that can affect maintenance efforts. The created model included structural quality and process metrics, and managerial indices. The results indicate that maintenance duration is affected by the popularity of a project and DIT (a design metric) and MPC (a source code metric). On the other hand Production and Productivity are solely affected by quality metrics AMS and MPC.