Enhanced Detection of Online Loan Fraud using Cost-Sensitive Weighted Random Forest with Recursive Feature Elimination and Cross Validation Karina Agustina, Kartika Fithriasari, Dedy Dwi Prastyo
Department of Statistics, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia
karinagustina123[at]gmail.com
kartika_f[at]statistika.its.ac.id
dedy-dp[at]statistika.its.ac.id
Abstract
Online loan is one of the innovations that combine credit distribution system with digital technology such that it can be accessed in easy, fast, and efficient ways. The ease of access provided not only encourages the increase in the growth of online loan application from year to year but also increases the risk of fraudulent transactions (fraud). Therefore, a fraud detection system is an essential requirement for credit financial institutions to minimize the risk of losses that may arise. This study compares between Random Forest and Cost-Sensitive Weighted Random Forest model to solve the class imbalance problem in online loan fraud data. Cost-Sensitive Weighted Random Forest is development of Random Forest model that use cost-function based on the misclassification rate of the instances for both majority and minority classes to improve the prediction ability of each tree and the overall performance of the ensemble. The trees are given weightage based on the quantity of error. The trees with lower error rate are given higher weight. This cost driven learning scheme is adapted to give more emphasis on learning the minority class instances. In addition, the Recursive Feature Elimination and Cross Validation method used to eliminate unimportant features biases in the classification results and to speed up the data processing. The proposed methods are tested on real online loan application datasets obtained from a private bank. The results of the study show that information about the device data used when submitting loan application has a considerable influence on decision making to classify the loan application as fraud or not. The findings also show that the Cost-Sensitive Weighted Random Forest works better than Random Forest because it has higher accuracy, F1 score, and AUC-ROC.
Keywords: Cost Sensitive Weighted Random Forest, Imbalance Dataset, Fraud Detection