Abstract
To improve the software quality and reduce the maintenance cost, cross-project fault prediction (CPFP) identifies faulty software components in a particular project (aka target project) using the historical fault data of other projects (aka source/reference projects). Although several diverse approaches/models have been proposed in the past, there exists room for improvement in the prediction performance. Further, they did not consider effort-based evaluation metrics (EBEMs), which are important to ensure the model’s application in the industry, undertaking a realistic constraint of having a limited inspection effort. Besides, they validated their respective approaches using a limited number of datasets. Addressing these issues, we propose an improved CPFP model with its validation on a large corpus of data containing 62 datasets in terms of EBEMs (PIM@20%, Cost-effectiveness@20%, and IFA) and other machine learning-based evaluation metrics (MLBEMs) like PF, G-measure, and MCC. The reference data and the target data are first normalized to reduce the distribution divergence between them and then the relevant training data is selected from the reference data using the KNN algorithm. Seeing the experimental and statistical test results, we claim the efficacy of our proposed model over state-of-the-art CPFP models namely the Turhan-Filter and Cruz model comprehensively. Thus, the proposed CPFP model provides an effective solution for predicting faulty software components, enabling practitioners in developing quality software with lesser maintenance cost.
Similar content being viewed by others
References
Arar ÖF, Ayan K (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263–277. https://doi.org/10.1016/J.ASOC.2015.04.045
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Software Eng 22(10):751–761. https://doi.org/10.1109/32.544352
Bisi M, Goyal NK (2016) An ANN-PSO-based model to predict fault-prone modules in software. Int J Reliab Saf 10(3):243–264. https://doi.org/10.1504/IJRS.2016.081611
Bowes D, Hall T, Petrić J (2018) Software defect prediction: do different classifiers find the same defects? Softw Qual J 26(2):525–552. https://doi.org/10.1007/s11219-016-9353-3
Briand LC, Melo WL, Wüst J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720. https://doi.org/10.1109/TSE.2002.1019484
Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A, Panichella S (2015) Defect prediction as a multiobjective optimization problem. Softw Test Verif Reliab 25(4):426–459. https://doi.org/10.1002/STVR.1570
Chen L, Fang B, Shang Z, Tang Y (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62(1):67–77. https://doi.org/10.1016/j.infsof.2015.01.014
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Cruz AEC, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd International symposium on empirical software engineering and measurement, pp 460–463. https://doi.org/10.1109/ESEM.2009.5316002
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4–5):531–577. https://doi.org/10.1007/s10664-011-9173-9
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190. https://doi.org/10.1016/j.infsof.2014.11.006
Herbold S, Trautsch A, Grabowski J (2018) A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans Softw Eng 44(9):811–833. https://doi.org/10.1109/TSE.2017.2724538
Herbold S (2013) Training data selection for cross-project defect prediction. In: ACM international conference proceeding series, Part F1288, pp 1–10. https://doi.org/10.1145/2499393.2499397
Hosseini S, Turhan B, Gunarathna D (2019) A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans Softw Eng 45(2):111–147. https://doi.org/10.1109/TSE.2017.2770124
Huang Q, Xia X, Lo D (2018) Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction. Empir Softw Eng 24(5):2823–2862. https://doi.org/10.1007/s10664-018-9661-2
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: ACM international conference proceeding series, pp 1–10. https://doi.org/10.1145/1868328.1868342
Jureczko M, Spinellis D (2010) Using object-oriented design metrics to predict software defects. In: Models and methods of system dependability. Oficyna Wydawnicza Politechniki Wrocławskiej, pp 69–81. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.2285
Kassab M, Defranco JF, Laplante PA (2017) Software testing: the state of the practice. IEEE Softw 34(5):46–52. https://doi.org/10.1109/MS.2017.3571582
Kawata K, Amasaki S, Yokogawa T (2015) Improving relevancy filter methods for cross-project defect prediction. In: Proceedings—3rd international conference on applied computing and information technology and 2nd international conference on computational science and intelligence, ACIT-CSI 2015, pp 2–7. https://doi.org/10.1109/ACIT-CSI.2015.104
Khatri Y, Singh SK (2021) Cross project defect prediction: a comprehensive survey with its SWOT analysis. Innov Syst Softw Eng. https://doi.org/10.1007/s11334-020-00380-5
Khatri Y, Singh SK (2022) Towards building a pragmatic cross-project defect prediction model combining non-effort based and effort-based performance measures for a balanced evaluation. Inf Softw Technol 150:106980. https://doi.org/10.1016/J.INFSOF.2022.106980
Kochhar PS, Xia X, Lo D, Li S (2016) Practitioners’ expectations on automated fault localization. In: Proceedings of the 25th international symposium on software testing and analysis, pp 165–176. https://doi.org/10.1145/2931037.2931051
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496. https://doi.org/10.1109/TSE.2008.35
Liu Y, Khoshgoftaar TM, Seliya N (2010) Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 36(6):852–864. https://doi.org/10.1109/TSE.2010.51
Lu H, Cukic B, Culp M (2012) Software defect prediction using semi-supervised learning with dimension reduction. In: 2012 27th IEEE/ACM international conference on automated software engineering, ASE 2012 —Proceedings, pp 314–317. https://doi.org/10.1145/2351676.2351734
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256. https://doi.org/10.1016/j.infsof.2011.09.007
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13. https://doi.org/10.1109/TSE.2007.256941
Meyer AN, Fritz T, Murphy GC, Zimmermann T (2014) Software developers’ perceptions of productivity. In: Proceedings of the ACM SIGSOFT symposium on the foundations of software engineering, 16–21 November, pp 19–29. https://doi.org/10.1145/2635868.2635892
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings—international conference on software engineering, pp 382–391. https://doi.org/10.1109/ICSE.2013.6606584
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355. https://doi.org/10.1109/TSE.2005.49
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281
Pelayo L, Dick S (2007) Applying novel resampling strategies to software defect prediction. In: Annual conference of the north American fuzzy information processing society—NAFIPS, pp 69–72. https://doi.org/10.1109/NAFIPS.2007.383813
Peng L, Yang B, Chen Y, Abraham A (2009) Data gravitation based classification. Inf Sci 179(6):809–819. https://doi.org/10.1016/j.ins.2008.11.007
Ryu D, Jang JI, Baik J (2017) A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw Qual J 25(1):235–272. https://doi.org/10.1007/s11219-015-9287-1
Subramanyam R, Krishnan MS (2003) Empirical analysis of CK metrics for object-oriented design complexity: implications for software defects. IEEE Trans Softw Eng 29(4):297–310. https://doi.org/10.1109/TSE.2003.1191795
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14:540–578. https://doi.org/10.1007/s10664-008-9103-7
Wang T, Zhang Z, Jing X, Zhang L (2015) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23(4):569–590. https://doi.org/10.1007/S10515-015-0179-1
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter language reuse. In: Proceedings—international conference on software engineering, pp 19–24. https://doi.org/10.1145/1370788.1370794
Zhou Y, Yang Y, Lu H, Chen L, Li Y, Zhao Y, Qian J, Xu B (2018) How far we have progressed in the journey? An examination of cross-project defect prediction. ACM Trans Softw Eng Methodol 27(1):1–51. https://doi.org/10.1145/3183339
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: ESEC-FSE’09—Proceedings of the joint 12th European software engineering conference and 17th ACM SIGSOFT symposium on the foundations of software engineering, pp 91–100. https://doi.org/10.1145/1595696.1595713
Funding
No funds, grants, or other support was received.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Human and/or animals participants
This study doesn’t involve any human/animal participants.
Informed consent
This study does not involve any human/animal participants, as a result no consent is needed.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Khatri, Y., Singh, S.K. Predictive software maintenance utilizing cross-project data. Int J Syst Assur Eng Manag 15, 1503–1518 (2024). https://doi.org/10.1007/s13198-023-01957-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-023-01957-6