An efficient distance estimation and centroid selection based on k-means clustering for small and large dataset
Girdhar Gopal Ladha and Ravi Kumar Singh Pippal
Abstract
In this paper an efficient distance estimation and centroid selection based on k-means clustering for small and large dataset. Data pre-processing was performed first on the dataset. For the complete study and analysis PIMA Indian diabetes dataset was considered. After pre-processing distance and centroid estimation was performed. It includes initial selection based on randomization and then centroids updations were performed till the iterations or epochs determined. Distance measures used here are Euclidean distance (Ed), Pearson Coefficient distance (PCd), Chebyshev distance (Csd) and Canberra distance (Cad). The results indicate that all the distance algorithms performed approximately well in case of clustering but in terms of time Cad outperforms in comparison to other algorithms.
Keyword
K-means, Distance estimation, Centroid selection, Distance methods.
Cite this article
Ladha GG, Pippal RK.An efficient distance estimation and centroid selection based on k-means clustering for small and large dataset. International Journal of Advanced Technology and Engineering Exploration. 2020;7(73):234-240. DOI:10.19101/IJATEE.2020.762109
Refference
[1]Fard MM, Thonet T, Gaussier E. Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognition Letters. 2020;138:185-92.
[2]Tavse P, Khandelwal A. An Efficient K-means Clustering approach in Wireless Network for data sharing. International Journal of Advanced Technology and Engineering Exploration. 2015; 2(2):9-16.
[3]Dubey AK, Gupta U, Jain S. Analysis of k-means clustering approach on the breast cancer Wisconsin dataset. International Journal of Computer Assisted Radiology and Surgery. 2016; 11(11):2033-47.
[4]Pan Q, Xiang L, Jin Y. Rare association rules mining of diabetic complications based on improved rarity algorithm. In international conference on bioinformatics and computational biology 2019 (pp. 115-9). IEEE.
[5]Cios KJ, Moore GW. Uniqueness of medical data mining. Artificial Intelligence in Medicine. 2002; 26(1-2):1-24.
[6]Chahar R, Kaur D. A systematic review of the machine learning algorithms for the computational analysis in different domains. International Journal of Advanced Technology and Engineering Exploration. 2020; 7 (71): 147-64.
[7]Aljumah AA, Ahamad MG, Siddiqui MK. Application of data mining: diabetes health care in young and old patients. Journal of King Saud University-Computer and Information Sciences. 2013; 25(2):127-36.
[8]Kumari I, Sharma V. A review for the efficient clustering based on distance and the calculation of centroid. International Journal of Advanced Technology and Engineering Exploration. 2020; 7(63):48-52.
[9]Dubey AK, Gupta U, Jain S. Comparative study of K-means and fuzzy C-means algorithms on the breast cancer data. International Journal on Advanced Science, Engineering and Information Technology. 2018; 8(1):18-29.
[10]Pebesma J, Martinez-Millana A, Sacchi L, Fernandez-Llatas C, De Cata P, Chiovato L, et al. Clustering cardiovascular risk trajectories of patients with type 2 diabetes using process mining. In annual international conference of the engineering in medicine and biology society 2019 (pp. 341-4). IEEE.
[11]Iyer A, Jeyalatha S, Sumbaly R. Diagnosis of diabetes using classification mining techniques. arXiv preprint arXiv:1502.03774. 2015.
[12]Hao J, Zheng Y, Xu C, Yan Z, Li H. Feature assessment and classification of diabetes employing concept lattice. In 23rd international conference on computer supported cooperative work in design 2019 (pp. 333-8). IEEE.
[13]Yaacob H, Omar H, Handayani D, Hassan R. Emotional profiling through supervised machine learning of interrupted EEG interpolation. International Journal of Advanced Computer Research. 2019; 9(43):242-51.
[14]Syafitri N, Labellapansa A, Kadir EA, Saian R, Zahari NN, Anwar NH, Shaharuddin NE. Early detection of fire hazard using fuzzy logic approach. International Journal of Advanced Computer Research. 2019; 9(43):252-9.
[15]Abood LH, Karam EH, Issa AH. Design of adaptive neuro sliding mode controller for anesthesia drug delivery based on biogeography based optimization. International Journal of Advanced Computer Research. 2019; 9(42):146-55.
[16]Wang F, Wang Q, Nie F, Li Z, Yu W, Ren F. A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recognition. 2020; 107:107521.
[17]Wu H, Yang S, Huang Z, He J, Wang X. Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked. 2018; 10:100-7.
[18]Dubey AK. An efficient variable distance measure k-means [VDMKM] algorithm for cluster head selection in WSN. International Journal of Innovative Technology and Exploring Engineering. 2019; 9(1):87-92.
[19]Mahajan A, Kumar S, Bansal R. Diagnosis of diabetes mellitus using PCA and genetically optimized neural network. In international conference on computing, communication and automation 2017 (pp. 334-8). IEEE.
[20]Jasim IS, Duru AD, Shaker K, Abed BM, Saleh HM. Evaluation and measuring classifiers of diabetes diseases. In international conference on engineering and technology 2017 (pp. 1-4). IEEE.
[21]Kalyankar GD, Poojara SR, Dharwadkar NV. Predictive analysis of diabetic patient data using machine learning and Hadoop. In international conference on I-SMAC (IoT in social, mobile, analytics and cloud) (I-SMAC) 2017 (pp. 619-24). IEEE.
[22]Kaur H, Batra S. HPCC: An ensembled framework for the prediction of the onset of diabetes. In 4th international conference on signal processing, computing and control (ISPCC) 2017 (pp. 216-22). IEEE.
[23]Kaur P, Sharma N, Singh A, Gill B. CI-DPF: A cloud IoT based framework for diabetes prediction. In annual information technology, electronics and mobile communication conference 2018 (pp. 654-60). IEEE.
[24]Huang L, Lu C. Intelligent diagnosis of diabetes based on information gain and deep neural network. In international conference on cloud computing and intelligence systems 2018 (pp. 493-6). IEEE.
[25]Kohli PS, Arora S. Application of machine learning in disease prediction. In international conference on computing communication and automation 2018 (pp. 1-4). IEEE.
[26]Rani S, Kautish S. Association clustering and time series based data mining in continuous data for diabetes prediction. In second international conference on intelligent computing and control systems (ICICCS) 2018 (pp. 1209-14). IEEE.
[27]Li Y, Ye H. An analysis and research of type-2 diabetes TCM records based on text mining. In international conference on bioinformatics and biomedicine 2018 (pp. 1872-5). IEEE.
[28]Guttikonda G, Katamaneni M, Pandala M. Diabetes Data Prediction Using Spark and Analysis in Hue Over Big Data. In international conference on computing methodologies and communication 2019 (pp. 1112-17). IEEE.
[29]Kim HS, Yi C, Kim Y, Park U, Kook W, Oh B, Kim H, Park T. Topological data analysis can extract sub-groups with high incidence rates of Type 2 diabetes. International Journal of Data Mining and Bioinformatics. 2019; 22(1):44-60.
[30]Karthikeyan R, Geetha P, Ramaraj E. Rule Based System for Better Prediction of Diabetes. In 3rd international conference on computing and communications technologies 2019 (pp. 195-203). IEEE.
[31]Devasena MG, Grace RK, Gopu G. PDD: predictive diabetes diagnosis using datamining algorithms. In international conference on computer communication and informatics 2020 (pp. 1-4). IEEE.