Enhancing big data classification accuracy through integration of k-means clustering and logistic regression
Zeyaul Mustfa and Sujeet Gautam
Abstract
The exponential growth of data necessitates effective classification techniques capable of handling large and complex datasets. An integration of k-means clustering with logistic regression (KM-LR) was performed to enhance classification accuracy. The process begins with data normalization, followed by the initialization of k-means parameters. Data points are assigned to clusters, and centroids are updated iteratively until convergence. The enriched dataset, incorporating cluster assignments, is then used to train a LR model. Evaluations on big data show that KM-LR significantly improves accuracy, precision, and recall compared to standalone k-means and fuzzy c-means (FCM) algorithms. KM-LR achieves an accuracy of 96%, precision of 95%, and recall of 95%, demonstrating its effectiveness in managing large volumes of data efficiently and accurately. This hybrid approach leverages unsupervised clustering to structure data and supervised learning for precise classification, making it highly suitable for big data environments.
Keyword
K-means, Logistic regression, KM-LR, Fuzzy c-means.
Cite this article
Mustfa Z, Gautam S.Enhancing big data classification accuracy through integration of k-means clustering and logistic regression. ACCENTS Transactions on Image Processing and Computer Vision. 2024;10(28):14-19. DOI:10.19101/TIPCV.2024.1026002
Refference
[1]Yang J, Fricker P, Jung A. From intangible to tangible: the role of big data and machine learning in walkability studies. Computers, Environment and Urban Systems. 2024; 109:102087.
[2]Jemili F, Meddeb R, Korbaa O. Intrusion detection based on ensemble learning for big data classification. Cluster Computing. 2024; 27(3):3771-98.
[3]Yang W. Analysis and application of big data feature extraction based on improved k-means algorithm. Scalable Computing: Practice and Experience. 2024; 25(1):137-45.
[4]Dandugala LS, Vani KS. Advancing big data clustering with fuzzy logic-based IMV-FCA and ensemble approach. Iranian Journal of Fuzzy Systems. 2024; 21(2):141-60.
[5]Paulraj D, Junaid KM, Sethukarasi T, Prem MV, Neelakandan S, Alhudhaif A, et al. A novel efficient rank-revealing QR matrix and Schur decomposition method for big data mining and clustering (RRQR-SDM). Information Sciences. 2024; 657:119957.
[6]Vairetti C, Assadi JL, Maldonado S. Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification. Expert Systems with Applications. 2024; 246:123149.
[7]Puri D, Gupta D. H-mrk-means: enhanced heuristic mrk-means for linear time clustering of big data using hybrid meta-heuristic algorithm. Journal of Information & Knowledge Management. 2024:2450054.
[8]Seydali M, Khunjush F, Dogani J. Streaming traffic classification: a hybrid deep learning and big data approach. Cluster Computing. 2024:1-29.
[9]Beavers TE, Cheng G, Duan Y, Cabrera J, Lubomirski M, Amaratunga D, et al. Data nuggets: a method for reducing big data while preserving data structure. Journal of Computational and Graphical Statistics. 2024 (just-accepted):1-21.
[10]Salman Z, Alomary A. Performance of the K-means and fuzzy C-means algorithms in big data analytics. International Journal of Information Technology. 2024; 16(1):465-70.
[11]Chen S, Xue Y, Cui X. Information literacy of college students from library education in smart classrooms: based on big data exploring data mining patterns using Apriori algorithm. Soft Computing. 2024; 28(4):3571-89.
[12]Yang P. Intelligent analysis and processing technology of financial big data based on clustering algorithm. In international conference on innovative computing 2024 (pp. 68-73). Singapore: Springer Nature Singapore.
[13]Zuo X. Research on data quality improvement program based on big data application. In IEEE 3rd international conference on information technology, big data and artificial intelligence (ICIBA) 2023 (pp. 1742-5). IEEE.
[14]Pan M, Yan H, Zhang Z, Chen K. A study on the impact of big data complexity technostress on data management capabilities. In 6th international conference on artificial intelligence and big data (ICAIBD) 2023 (pp. 203-8). IEEE.
[15]Zhang T, Liu H, Yu C, Wang P. Corddl: an efficient and extensible connector between relational databases and data lakes. In 5th international conference on machine learning, big data and business intelligence (MLBDBI) 2023 (pp. 173-7). IEEE.
[16]Yang S, Xia Q, Zhu B. Exploration of data reuse patterns based on scientific data lifecycle in big data environment. In 9th international conference on cloud computing and big data analytics (ICCCBDA) 2024 (pp. 280-4). IEEE.
[17]Muniswamaiah M, Agerwala T, Tappert CC. Big data and data visualization challenges. In IEEE international conference on big data (BigData) 2023 (pp. 6227-9). IEEE.
[18]Li D, Lu C, Tang Z, He J, Liu M, Zhao Y, et al. Research on data fusion and sharing based on power big data. In 9th annual international conference on network and information systems for computers (ICNISC) 2023 (pp. 287-90). IEEE.
[19]Guo J, Cui Y. Research on big data retrieval system of intelligent computer AI in post-moore Era. In IEEE 3rd international conference on electrical engineering, big data and algorithms (EEBDA) 2024 (pp. 1231-5). IEEE.
[20]Qureshi K. A comparative study on recent trends to secure big data. In 3rd international conference on electrical, computer, communications and mechatronics engineering (ICECCME) 2023 (pp. 1-3). IEEE.
[21]Kasera S, Gehlot A, Uniyal V, Pandey S, Chhabra G, Joshi K. Right to digital privacy: a technological intervention of blockchain and big data analytics. In international conference on innovative data communication technologies and application (ICIDCA) 2023 (pp. 1122-7). IEEE.
[22]Hu Y, Liu X, Luo J, Du W. Fractal data story model of fusion time and space for big data. In 2023 5th international conference on machine learning, big data and business intelligence (MLBDBI) 2023 (pp. 158-63). IEEE.