International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 8, Issue - 85, December 2021
  1. 1
    Google Scholar
A two-phase feature selection technique using mutual information and XGB-RFE for credit card fraud detection

C. Victoria Priscilla and D. Padma Prabha

Abstract

With the rapid increase in online transactions, credit card fraud has become a serious menace. Machine Learning (ML) algorithms are beneficial in building a good model to detect fraudulent transactions. Dealing with high-dimensional and imbalanced dataset becomes a hinder in real-world applications like credit card fraud detection. To overcome this issue, feature selection a pre-processing technique is adopted considering the classification performance and computational efficiency. This paper proposes a new two-phase feature selection approach that integrates filter and wrapper methods to identify the significant feature subsets. In the first phase, Mutual Information (MI) has been adopted due to its computational efficiency to rank the features based on their feature importance. However, they cannot drop the less important features. Thus, a second phase is added to eliminate the redundant features using Recursive Feature Elimination (RFE) a wrapper method employed by 5-fold cross-validation. eXtreme Gradient Boosting (XGBoost) is adopted as the estimator for RFE by adjusting the class weights. The optimal features obtained from the proposed method were used in four boosting algorithms such as XGBoost, Gradient Boosting Machine (GBM), Classic Gradient Boosting (CatBoost) and Light Gradient Boosting Machine (LGBM) to analyse the performance of classification. The proposed approach has been applied to the credit card fraud detection dataset obtained from the IEEE-CIS, which consists of imbalance in the binary class target. The experimental outcome shows promising results in terms of Geometric mean (G-Mean) for XGBoost (84.8%) and LGBM (83.7%), the Area Under a Receiver Operating Character (ROC) Curve (AUC) has increased from 79.8% to 85.5% for XGBoost and also the computation time are reduced in training the classifiers.

Keyword

Recursive feature elimination, Hyper-parameter optimization, Class imbalance, XGBoost, Binary classification.

Cite this article

Priscilla CV, Prabha DP

Refference

[1][1]https://nilsonreport.com/publication_newsletter_archive_issue.php?issue=1187. Accessed 22 July 2021.

[2][2]Bagga S, Goyal A, Gupta N, Goyal A. Credit card fraud detection using pipeling and ensemble learning. Procedia Computer Science. 2020; 173:104-12.

[3][3]Liu Y, Wang Y, Ren X, Zhou H, Diao X. A classification method based on feature selection for imbalanced data. IEEE Access. 2019; 7:81794-807.

[4][4]Mahmoudi N, Duman E. Detecting credit card fraud by modified Fisher discriminant analysis. Expert Systems with Applications. 2015; 42(5):2510-6.

[5][5]De SAG, Pereira AC, Pappa GL. A customized classification algorithm for credit card fraud detection. Engineering Applications of Artificial Intelligence. 2018; 72:21-9.

[6][6]El Hajjami S, Malki J, Bouju A, Berrada M. A machine learning based approach to reduce behavioral noise problem in an imbalanced data: application to a fraud detection. In international conference on intelligent data science technologies and applications 2020 (pp. 11-20). IEEE.

[7][7]Abdulrauf SG, Zainol Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes. 2020; 11(7):1-26.

[8][8]Chen H, Li T, Fan X, Luo C. Feature selection for imbalanced data based on neighborhood rough sets. Information Sciences. 2019; 483:1-20.

[9][9]Pilnenskiy N, Smetannikov I. Feature selection algorithms as one of the python data analytical tools. Future Internet. 2020; 12(3):1-14.

[10][10]Liu H, Zhou M, Liu Q. An embedded feature selection method for imbalanced data classification. IEEE/CAA Journal of Automatica Sinica. 2019; 6(3):703-15.

[11][11]Abdel-basset M, El-shahat D, El-henawy I, De AVH, Mirjalili S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Systems with Applications. 2020.

[12][12]Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: a review. Egyptian Informatics Journal. 2018; 19(3):179-89.

[13][13]Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020.

[14][14]Albashish D, Hammouri AI, Braik M, Atwan J, Sahran S. Binary biogeography-based optimization based SVM-RFE for feature selection. Applied Soft Computing. 2021.

[15][15]Elavarasan D, Vincent PM DR, Srinivasan K, Chang CY. A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling. Agriculture. 2020; 10(9):1-27.

[16][16]Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems. 2018; 140:103-19.

[17][17]Amini F, Hu G. A two-layer feature selection method using genetic algorithm and elastic net. Expert Systems with Applications. 2021.

[18][18]Fu GH, Wu YJ, Zong MJ, Yi LZ. Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemometrics and Intelligent Laboratory Systems. 2020.

[19][19]Barraza N, Moro S, Ferreyra M, De LPA. Mutual information and sensitivity analysis for feature selection in customer targeting: a comparative study. Journal of Information Science. 2019; 45(1):53-67.

[20][20]Wang Y, Cang S, Yu H. Mutual information inspired feature selection using kernel canonical correlation analysis. Expert Systems with Applications: X. 2019.

[21][21]Zhang J, Xiong Y, Min S. A new hybrid filter/wrapper algorithm for feature selection in classification. Analytica Chimica Acta. 2019; 1080:43-54.

[22][22]Jeon H, Oh S. Hybrid-recursive feature elimination for efficient feature selection. Applied Sciences. 2020; 10(9):1-8.

[23][23]Rtayli N, Enneya N. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications. 2020.

[24][24]Karasu S, Altan A, Bekiros S, Ahmad W. A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series. Energy. 2020.

[25][25]Matos T, Macedo JA, Lettich F, Monteiro JM, Renso C, Perego R, et al. Leveraging feature selection to detect potential tax fraudsters. Expert Systems with Applications. 2020.

[26][26]Omar B, Rustam F, Mehmood A, Choi GS. Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection: application to fraud detection. IEEE Access. 2021; 9:28101-10.

[27][27]Viharos ZJ, Kis KB, Fodor Á, Büki MI. Adaptive, hybrid feature selection (AHFS). Pattern Recognition. 2021.

[28][28]El-Hasnony IM, Barakat SI, Elhoseny M, Mostafa RR. Improved feature selection model for big data analytics. IEEE Access. 2020; 8:66989-7004.

[29][29]Lian W, Nie G, Jia B, Shi D, Fan Q, Liang Y. An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Mathematical Problems in Engineering. 2020.

[30][30]Zhang X, Han Y, Xu W, Wang Q. HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture. Information Sciences. 2021; 557:302-16.

[31][31]Chiew KL, Tan CL, Wong K, Yong KS, Tiong WK. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences. 2019; 484:153-66.

[32][32]Singh N, Singh P. A hybrid ensemble-filter wrapper feature selection approach for medical data classification. Chemometrics and Intelligent Laboratory Systems. 2021.

[33][33]Zhou Y, Cheng G, Jiang S, Dai M. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Computer Networks. 2020.

[34][34]Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. Journal of Information Security and Applications. 2019; 44:80-8.

[35][35]Nagarajan SM, Muthukumaran V, Murugesan R, Joseph RB, Meram M, Prathik A. Innovative feature selection and classification model for heart disease prediction. Journal of Reliable Intelligent Environments. 2021:1-11.

[36][36]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001 (pp. 74-81).

[37][37]Priscilla CV, Prabha DP. Influence of optimizing XGBoost to handle class imbalance in credit card fraud detection. In third international conference on smart systems and inventive technology 2020 (pp. 1309-15). IEEE.

[38][38]Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Computing and Applications. 2014; 24(1):175-86.

[39][39]Chang W, Liu Y, Xiao Y, Yuan X, Xu X, Zhang S, et al. A machine-learning-based prediction method for hypertension outcomes based on medical data. Diagnostics. 2019; 9(4):1-21.

[40][40]https://www.kaggle.com/c/ieee-fraud-detection/data. Accessed 11 July 2020.

[41][41]Luque A, Carrasco A, Martín A, De LHA. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition. 2019; 91:216-31.