International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 10, Issue - 107, October 2023
  1. 1
    Google Scholar
Performance analysis of samplers and calibrators with various classifiers for asymmetric hydrological data

C. Kaleeswari, K. Kuppusamy and A. Senthilrajan

Abstract

Asymmetric data classification presents a significant challenge in machine learning (ML). While ML algorithms are known for their ability to classify symmetric data effectively, addressing data asymmetry remains an on-going concern in classification tasks. This research paper aims to select an appropriate method for classifying and predicting asymmetric data, focusing on label and probability predictions. To achieve this, various ML classifiers, calibration techniques, and sampling methods are systematically analyzed. The classifiers under consideration include logistic regression (LR), k-nearest neighbour (KNN), gaussian naive Bayes (GNB), random forest (RF), decision tree (DT), and support vector classifier (SVC). Calibration techniques explored encompass isotonic regression (IR) and platt scaling (PS), while sampling techniques comprise synthetic minority oversampling technique (SMOTE), T-link (Tomek), adaptive synthetic sampling (AdaSyn), integration of SMOTE and edited nearest neighbour (SMOTEENN), and integration of SMOTE and T-link (SMOTETomek). Simulation results for label prediction consistently favour the SMOTEENN approach, with the RF classifier combined with SMOTEENN providing outstanding performance, boasting a balanced random accuracy (BRA) of 98.07%, sensitivity of 98.02%, specificity of 99.01%, an area under the curve (AUC) of 0.98, and a geometric mean (G-mean) of 98.50%. In terms of probability prediction, IR calibration consistently excels. Specifically, the GNB classifier combined with IR produces the best performance, yielding a low brier score (BS), expected calibration error (ECE), and maximum calibration error (MCE). Furthermore, it achieves perfect calibration as demonstrated by the reliability curve. In light of these findings, this study recommends the utilization of SMOTEENN for data resampling and IR calibration for probability prediction as superior methods to address data asymmetry. The comparative analysis presented in this research offers valuable insights for selecting appropriate techniques in the context of asymmetric data classification.

Keyword

Machine learning, Calibration, Asymmetric data, Classification, Probability, Prediction.

Cite this article

Kaleeswari C, Kuppusamy K, Senthilrajan A

Refference

[1][1]Tazoe H. Water quality monitoring. Analytical Sciences. 2023; 39(1):1-3.

[2][2]Adeleke IA, Nwulu NI, Ogbolumani OA. A hybrid machine learning and embedded IoT-based water quality monitoring system. Internet of Things. 2023; 22:100774.

[3][3]Wang Z, Jia D, Song S, Sun J. Assessments of surface water quality through the use of multivariate statistical techniques: a case study for the watershed of the Yuqiao reservoir, China. Frontiers in Environmental Science. 2023; 11:1-15.

[4][4]Banerjee P, Dehnbostel FO, Preissner R. Prediction is a balancing act: importance of sampling methods to balance sensitivity and specificity of predictive models based on imbalanced chemical data sets. Frontiers in Chemistry. 2018; 6:1-11.

[5][5]Chinnakkaruppan K, Krishnamoorthy K, Agniraj S. A hybrid approach for forecasting the technical anomalies in sensor-based water quality distribution data. In international conference on power, instrumentation, energy and control 2023 (pp. 1-5). IEEE.

[6][6]Piao C, Wang N, Yuan C. Rebalance weights adaboost-SVM model for imbalanced data. Computational Intelligence and Neuroscience. 2023; 2023:1-26.

[7][7]Pandey S, Kumar K. Software fault prediction for imbalanced data: a survey on recent developments. Procedia Computer Science. 2023; 218:1815-24.

[8][8]Douzas G, Bacao F, Fonseca J, Khudinyan M. Imbalanced learning in land cover classification: improving minority classes’ prediction accuracy using the geometric SMOTE algorithm. Remote Sensing. 2019; 11(24):1-14.

[9][9]Liang Z, Wang H, Yang K, Shi Y. Adaptive fusion based method for imbalanced data classification. Frontiers in Neurorobotics. 2022; 16:1-8.

[10][10]Ahmed J, Green II RC. Predicting severely imbalanced data disk drive failures with machine learning models. Machine Learning with Applications. 2022; 9:1-12.

[11][11]Basora L, Bry P, Olive X, Freeman F. Aircraft fleet health monitoring with anomaly detection techniques. Aerospace. 2021; 8(4):1-33.

[12][12]Muharemi F, Logofătu D, Leon F. Machine learning approaches for anomaly detection of water quality on a real-world data set. Journal of Information and Telecommunication. 2019; 3(3):294-307.

[13][13]Bao F, Wu Y, Li Z, Li Y, Liu L, Chen G. Effect improved for high-dimensional and unbalanced data anomaly detection model based on KNN-SMOTE-LSTM. Complexity. 2020; 2020:1-7.

[14][14]Muntasir NM, Faisal F, Jahan RI, Al-monsur A, Ar-rafi AM, Nasrullah SM, et al. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Scientific Programming. 2022; 2022:1-7.

[15][15]Wang ZH, Wu C, Zheng K, Niu X, Wang X. SMOTETomek-based resampling for personality recognition. IEEE Access. 2019; 7:129678-89.

[16][16]Huang L, Zhao J, Zhu B, Chen H, Broucke SV. An experimental investigation of calibration techniques for imbalanced data. IEEE Access. 2020; 8:127343-52.

[17][17]Joloudari JH, Marefat A, Nematollahi MA, Oyelere SS, Hussain S. Effective class-imbalance learning based on SMOTE and convolutional neural networks. Applied Sciences. 2023; 13(6):1-34.

[18][18]Zheng X, Jia J, Chen J, Guo S, Sun L, Zhou C, et al. Hyperspectral image classification with imbalanced data based on semi-supervised learning. Applied Sciences. 2022; 12(8):1-19.

[19][19]Schmidt L, Heße F, Attinger S, Kumar R. Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany. Water Resources Research. 2020; 56(5):1-10.

[20][20]Rahman AA, Prasetiyowati SS, Sibaroni Y. Performance analysis of the imbalanced data method on increasing the classification accuracy of the machine learning hybrid method. JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika). 2023; 8(1):115-26.

[21][21]Werner DVV, Schneider AJA, Dos SCR, Da SPPR, Victória BJL. Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowledge and Information Systems. 2023; 65(1):31-57.

[22][22]Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Information Sciences. 2020; 513:429-41.

[23][23]Swana EF, Doorsamy W, Bokoro P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors. 2022; 22(9):1-21.

[24][24]Bennin KE, Tahir A, Macdonell SG, Börstler J. An empirical study on the effectiveness of data resampling approaches for cross‐project software defect prediction. IET Software. 2022; 16(2):185-99.

[25][25]Aggarwal U, Popescu A, Belouadah E, Hudelot C. A comparative study of calibration methods for imbalanced class incremental learning. Multimedia Tools and Applications. 2022:1-20.

[26][26]Kuhn M, Johnson K, Kuhn M, Johnson K. Remedies for severe class imbalance. Applied Predictive Modeling. 2013:419-43.

[27][27]Liu L, Wu X, Li S, Li Y, Tan S, Bai Y. Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Medical Informatics and Decision Making. 2022; 22(1):1-6.

[28][28]Davagdorj K, Lee JS, Pham VH, Ryu KH. A comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention. Applied Sciences. 2020; 10(9):1-20.

[29][29]Xu Z, Shen D, Nie T, Kou Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. Journal of Biomedical Informatics. 2020; 107:103465.

[30][30]Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Information Systems Frontiers. 2020; 22(5):1113-31.

[31][31]Shaikh S, Daudpota SM, Imran AS, Kastrati Z. Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models. Applied Sciences. 2021; 11(2):1-20.

[32][32]Chyon FA, Suman MN, Fahim MR, Ahmmed MS. Time series analysis and predicting COVID-19 affected patients by ARIMA model using machine learning. Journal of Virological Methods. 2022; 301:1-6.

[33][33]Ahmed DM, Hassan MM, Mstafa RJ. A review on deep sequential models for forecasting time series data. Applied Computational Intelligence and Soft Computing. 2022; 2022:1-19.

[34][34]Susan S, Kumar A. The balancing trick: optimized sampling of imbalanced datasets-a brief survey of the recent state of the art. Engineering Reports. 2021; 3(4):1-24.

[35][35]Alharbi F, Ouarbya L, Ward JA. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors. 2022; 22(4):1-20.

[36][36]Liang XW, Jiang AP, Li T, Xue YY, Wang GT. LR-SMOTE—an improved unbalanced data set oversampling based on K-means and SVM. Knowledge-Based Systems. 2020; 196:105845.

[37][37]Hussein HI, Anwar SA, Ahmad MI. Imbalanced data classification using SVM based on improved simulated annealing featuring synthetic data generation and reduction. CMC-Computers Materials & Continua. 2023; 75(1):547-64.

[38][38]Zhao C, Shuai R, Ma L, Liu W, Wu M. Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT. Multimedia Tools and Applications. 2022; 81(17):24265-300.

[39][39]Christianto Y, Rusli A. Evaluating RNN architectures for handling imbalanced dataset in multi-class text classification in Bahasa Indonesia. International Journal of Advanced Trends in Computer Science and Engineering. 2020: 8418-23.

[40][40]Gao H, Li Y, Lu H, Zhu S. Water potability analysis and prediction. Highlights in Science, Engineering and Technology. 2022; 16:70-7.

[41][41]https://www.kaggle.com/datasets/adityakadiwal/water-potability. Accessed 02 March 2023.

[42][42]Patel J, Amipara C, Ahanger TA, Ladhva K, Gupta RK, Alsaab HO, et al. A machine learning-based water potability prediction model by using synthetic minority oversampling technique and explainable AI. Computational Intelligence and Neuroscience. 2022; 2022:1-15.

[43][43]Rawat N, Kazembe MD, Mishra PK. Water quality prediction using machine learning. International Journal for Research in Applied Science and Engineering Technology. 2022; 10(VI):4173-87.

[44][44]Wang H, Zhao Y, Zhou Y, Wang H. Prediction of urban water accumulation points and water accumulation process based on machine learning. Earth Science Informatics. 2021; 14:2317-28.

[45][45]Moeini M, Shojaeizadeh A, Geza M. Supervised machine learning for estimation of total suspended solids in urban watersheds. Water. 2021; 13(2):1-24.

[46][46]Mensi A, Tax DM, Bicego M. Detecting outliers from pairwise proximities: proximity isolation forests. Pattern Recognition. 2023; 138:109334.

[47][47]Buschjäger S, Honysz PJ, Morik K. Randomized outlier detection with trees. International Journal of Data Science and Analytics. 2022; 13(2):91-104.

[48][48]Gao R, Zhang T, Sun S, Liu Z. Research and improvement of isolation forest in detection of local anomaly points. In journal of physics: conference series 2019 (pp. 1-6). IOP Publishing.

[49][49]Niculescu-mizil A, Caruana R. Predicting good probabilities with supervised learning. In proceedings of the 22nd international conference on machine learning 2005 (pp. 625-32).

[50][50]Mulugeta G, Zewotir T, Tegegne AS, Juhar LH, Muleta MB. Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia. BMC Medical Informatics and Decision Making. 2023; 23(1):1-7.

[51][51]Alipour A, Ahmadalipour A, Abbaszadeh P, Moradkhani H. Leveraging machine learning for predicting flash flood damage in the Southeast US. Environmental Research Letters. 2020; 15(2):1-12.

[52][52]Ahmed U, Mumtaz R, Anwar H, Shah AA, Irfan R, García-nieto J. Efficient water quality prediction using supervised machine learning. Water. 2019; 11(11):1-14.

[53][53]Gakii C, Jepkoech J. A classification model for water quality analysis using decision tree. Euro Journal of Computer Science and Information Technology. 2019; 7(3):1-8.

[54][54]Jaloree S, Rajput A, Gour S. Decision tree approach to build a model for water quality. Binary Journal of Data Mining & Networking. 2014; 4(1):25-8.

[55][55]Khan TM, Xu S, Khan ZG. Implementing multilabeling, ADASYN, and relieff techniques for classification of breast cancer diagnostic through machine learning: efficient computer-aided diagnostic system. Journal of Healthcare Engineering. 2021; 2021:1-15.

[56][56]Peng CY, Park YJ. A new hybrid under-sampling approach to imbalanced classification problems. Applied Artificial Intelligence. 2022; 36(1):1-18.

[57][57]Zhang A, Yu H, Huan Z, Yang X, Zheng S, Gao S. SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse K-nearest neighbors. Information Sciences. 2022; 595:70-88.

[58][58]Pan T, Zhao J, Wu W, Yang J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Information Sciences. 2020; 512:1214-33.

[59][59]Wang K, Tian J, Zheng C, Yang H, Ren J, Li C, et al. Improving risk identification of adverse outcomes in chronic heart failure using SMOTE+ ENN and machine learning. Risk management and Healthcare Policy. 2021:2453-63.

[60][60]Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR). 2016; 49(2):1-50.

[61][61]Silva FT, Song H, Perello-nieto M, Santos-rodriguez R, Kull M, Flach P. Classifier calibration: a survey on how to assess and improve predicted class probabilities. Machine Learning. 2023:1-50.

[62][62]Alqarni AA, Yadav OP, Rathore AP. Application of isotonic regression in predicting corrosion depth of the oil refinery pipelines. In annual reliability and maintainability symposium 2022 (pp. 1-6). IEEE.

[63][63]Mahmudah KR, Indriani F, Takemori-sakai Y, Iwata Y, Wada T, Satou K. Classification of imbalanced data represented as binary features. Applied Sciences. 2021; 11(17):1-13.

[64][64]Wegier W, Ksieniewicz P. Application of imbalanced data classification quality metrics as weighting methods of the ensemble data stream classification algorithms. Entropy. 2020; 22(8):1-17.

[65][65]Ri J, Kim H. G-mean based extreme learning machine for imbalance learning. Digital Signal Processing. 2020; 98:102637.

[66][66]Aridas CK, Karlos S, Kanas VG, Fazakis N, Kotsiantis SB. Uncertainty based under-sampling for learning naive bayes classifiers under imbalanced data sets. IEEE Access. 2019; 8:2122-33.

[67][67]Alaraj M, Abbod MF, Majdalawieh M. Modelling customers credit card behaviour using bidirectional LSTM neural networks. Journal of Big Data. 2021; 8(1):1-27.

[68][68]Rožanec JM, Bizjak L, Trajkova E, Zajec P, Keizer J, Fortuna B, et al. Active learning and novel model calibration measurements for automated visual inspection in manufacturing. Journal of Intelligent Manufacturing. 2023:1-22.