On Using Instance Hardness Measures to Select Training Data for Software Defect Prediction Benyamin Langgu Sinaga (1, 2, *), Sabrina Ahmad (2), Zuraida Abal Abas (2)
1). Department of Informatics Universitas Atma Jaya Yogyakarta, Jalan Babarsari 43 Yogyakarta, Indonesia 55281
2). Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka, Hang Tuah Jaya, 76100 Durian Tunggal Melaka, Malaysia
*) benyamin.sinaga[at]uajy.ac.id
Abstract
The software defect prediction model has been a popular solution to allow the software quality assurance team to focus closely on testing the highly defect-prone modules. However, directly using cross-project datasets to learn the prediction model results in an unsatisfactory predictive model. As a result, the selection of training data is critical. Most training data selection occurs at the instance level, using kNN and the Euclidean distance to measure the similarity between source and target data. Such an approach, however, is susceptible to noise. Defect datasets are complex due to class imbalance, noisy datasets, and class overlaps. However, selection criteria are predominantly based on the distance between the source and target datasets while ignoring those data complexity-related factors. It causes several machine learning algorithms to underperform. This study proposed a filter for selecting training data instances considering the complexity factors. The filter is constructed utilizing four instance hardness measures related to defect dataset complexity factors: noisy instances and the overlapping character of instance classes on cross-project data. The proposed system was evaluated using 14 datasets and six classification algorithms. The findings indicate that using instance hardness measures for data selection can improve the prediction performance of the defect prediction model.
Keywords: Training data selection, software defect prediction, instance hardness measures