International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 10, Issue - 106, September 2023
  1. 1
    Google Scholar
A hybrid approach for generative process model with topic modelling towards efficient and dynamic document clustering

Gugulothu Venkanna and K.F Bharati

Abstract

Clustering text documents has a wide range of applications across various domains. However, due to the diversity and rapid growth of textual data, performing clustering on a given text corpus has become increasingly challenging. Several existing approaches for text document clustering rely on natural language processing (NLP) and text similarity measures. However, there is a pressing need for a generative process model to systematically and progressively handle text corpora. Furthermore, a hybrid approach that enhances clustering performance is essential. Therefore, developing a model for a given text corpus and dynamically updating it as new documents arrive, rather than starting clustering from scratch, is of paramount importance. In this paper, a framework known as the hybrid approach for dynamic document clustering (HADDC) was proposed. This framework is realized through the definition of two algorithms that collaborate to achieve dynamic document clustering. The first algorithm, called similar document identification (SDI), leverages a lexical dictionary, WordNet, and similarity measures to effectively identify similar documents. The second algorithm, topic modelling for efficient and dynamic document clustering (TM-EDDC), is designed as a dynamic process model based on latent Dirichlet allocation (LDA). It has the capability to cluster documents incrementally as new ones become available. Experimental results demonstrate that the proposed methods outperform existing ones, as evidenced by a lower mean absolute error (MAE). The proposed framework and underlying algorithms were evaluated using the news groups dataset. The empirical study showcases the enhanced utility and efficiency of the proposed framework, making it a valuable tool for organizations to integrate into their existing applications.

Keyword

Document clustering, Natural language processing, Generative process model, Document similarity, Dynamic document clustering.

Cite this article

Venkanna G, Bharati K

Refference

[1][1]Bui QV, Sayadi K, Amor SB, Bui M. Combining latent Dirichlet allocation and K-means for documents clustering: effect of probabilistic based distance measures. In intelligent information and database systems: 9th Asian conference, ACIIDS, Kanazawa, Japan, Proceedings, Part I 2017 (pp. 248-57). Springer International Publishing.

[2][2]Han X. Evolution of research topics in LIS between 1996 and 2019: an analysis based on latent Dirichlet allocation topic model. Scientometrics. 2020; 125(3):2561-95.

[3][3]Montenegro C, Ligutom III C, Orio JV, Ramacho DA. Using latent dirichlet allocation for topic modeling and document clustering of Dumaguete city twitter dataset. In proceedings of the international conference on computing and data engineering 2018 (pp. 1-5). ACM.

[4][4]Raghuveer K. Legal documents clustering using latent Dirichlet allocation. IAES International Journal of Artificial Intelligence 2012; 2(1):34-7.

[5][5]Tresnasari NA, Adji TB, Permanasari AE. Social-child-case document clustering based on topic modeling using latent Dirichlet allocation. Indonesian Journal of Computing and Cybernetics Systems (IJCCS). 2020; 14(2):179-88.

[6][6]Duan Z, Liu X, Su Y, Xu Y, Chen B, Zhou M. Bayesian progressive deep topic model with knowledge informed textual data coarsening process. In international conference on machine learning 2023 (pp. 8731-46). PMLR.

[7][7]Crain SP, Zhou K, Yang SH, Zha H. Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond. Mining Text Data. 2012:129-61.

[8][8]Yeh JF, Lee CH, Tan YS, Yu LC. Topic model allocation of conversational dialogue records by latent Dirichlet allocation. In signal and information processing association annual summit and conference (APSIPA), Asia-Pacific 2014 (pp. 1-4). IEEE.

[9][9]Andrzejewski D, Zhu X. Latent Dirichlet allocation with topic-in-set knowledge. In proceedings of the NAACL HLT workshop on semi-supervised learning for natural language processing 2009 (pp. 43-8). Association for Computational Linguistics.

[10][10]Sharaff A, Nagwani NK. Email thread identification using latent Dirichlet allocation and non-negative matrix factorization based clustering techniques. Journal of Information Science. 2016; 42(2):200-12.

[11][11]Ning W, Liu J, Xiong H. Knowledge discovery using an enhanced latent Dirichlet allocation-based clustering method for solving on-site assembly problems. Robotics and Computer-Integrated Manufacturing. 2022; 73:102246.

[12][12]Raja DK, Pushpa S. Diversifying personalized mobile multimedia application recommendations through the latent Dirichlet allocation and clustering optimization. Multimedia Tools and Applications. 2019; 78(17):24047-66.

[13][13]Syed AR, Yau KL, Qadir J, Mohamad H, Ramli N, Keoh SL. Route selection for multi-hop cognitive radio networks using reinforcement learning: an experimental study. IEEE Access. 2016; 4:6304-24.

[14][14]Shafiei MM, Milios EE. Latent Dirichlet co-clustering. In sixth international conference on data mining (ICDM06) 2006 (pp. 542-51). IEEE.

[15][15]Curiskis SA, Drake B, Osborn TR, Kennedy PJ. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing & Management. 2020; 57(2):102034.

[16][16]Hong F, Lai C, Guo H, Shen E, Yuan X, Li S. FLDA: latent Dirichlet allocation based unsteady flow analysis. IEEE Transactions on Visualization and Computer Graphics. 2014; 20(12):2545-54.

[17][17]Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications. 2019; 78:15169-211.

[18][18]Liu Y, Du F, Sun J, Jiang Y. iLDA: an interactive latent Dirichlet allocation model to improve topic quality. Journal of Information Science. 2020; 46(1):23-40.

[19][19]Wang D, Thint M, Al-rubaie A. Semi-supervised latent Dirichlet allocation and its application for document classification. In IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology 2012 (pp. 306-10). IEEE.

[20][20]Tang H, Shen L, Qi Y, Chen Y, Shu Y, Li J, et al. A multiscale latent Dirichlet allocation model for object-oriented clustering of VHR panchromatic satellite images. IEEE Transactions on Geoscience and Remote Sensing. 2012; 51(3):1680-92.

[21][21]Saif A, Ab AMJ, Omar N. Reducing explicit semantic representation vectors using latent Dirichlet allocation. Knowledge-Based Systems. 2016; 100:145-59.

[22][22]Bird C, Menzies T, Zimmermann T. The art and science of analyzing software data. Elsevier; 2015.

[23][23]Abinaya G, Winster SG. Event identification in social media through latent Dirichlet allocation and named entity recognition. In proceedings of IEEE international conference on computer communication and systems ICCCS14 2014 (pp. 142-6). IEEE.

[24][24]Lienou M, Maitre H, Datcu M. Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geoscience and Remote Sensing Letters. 2009; 7(1):28-32.

[25][25]Park H, Park T, Lee YS. Partially collapsed Gibbs sampling for latent Dirichlet allocation. Expert Systems with Applications. 2019; 131:208-18.

[26][26]Pérez J, Pérez A, Casillas A, Gojenola K. Cardiology record multi-label classification using latent Dirichlet allocation. Computer Methods and Programs in Biomedicine. 2018; 164:111-9.

[27][27]Ma T, Zhou X, Liu J, Lou Z, Hua Z, Wang R. Combining topic modeling and SAO semantic analysis to identify technological opportunities of emerging technologies. Technological Forecasting and Social Change. 2021; 173:121159.

[28][28]Lossio-ventura JA, Gonzales S, Morzan J, Alatrista-salas H, Hernandez-boussard T, Bian J. Evaluation of clustering and topic modeling methods over health-related tweets and emails. Artificial Intelligence in Medicine. 2021; 117:102096.

[29][29]Rani S, Kumar M. Topic modeling and its applications in materials science and engineering. Materials Today: Proceedings. 2021; 45:5591-6.

[30][30]Thirumoorthy K, Muneeswaran K. A hybrid approach for text document clustering using Jaya optimization algorithm. Expert Systems with Applications. 2021; 178:115040.

[31][31]Murshed BA, Abawajy J, Mallappa S, Saif MA, Al-ghuribi SM, Ghanem FA. Enhancing big social media data quality for use in short-text topic modeling. IEEE Access. 2022; 10:105328-51.

[32][32]Pathak AR, Pandey M, Rautaray S. Topic-level sentiment analysis of social media data using deep learning. Applied Soft Computing. 2021; 108:107440.

[33][33]Khan MA, Smyth B, Coyle D. Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approach. Journal of Intelligent Information Systems. 2021; 57(2):229-69.

[34][34]Shaik T, Tao X, Li Y, Dann C, Mcdonald J, Redmond P, Galligan L. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access. 2022; 10:56720-39.

[35][35]Hindistan YS, Yetkin EF. A hybrid approach with GAN and DP for privacy preservation of IIoT data. IEEE Access. 2023; 11:5837-49.

[36][36]Curiac CD, Micea MV. Identifying hot information security topics using LDA and multivariate mann-kendall test. IEEE Access. 2023; 11:18374-84.

[37][37]Murshed BA, Mallappa S, Abawajy J, Saif MA, Al-ariki HD, Abdulwahab HM. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis. Artificial Intelligence Review. 2023; 56(6):5133-260.

[38][38]Vayansky I, Kumar SA. A review of topic modeling methods. Information Systems. 2020; 94:101582.

[39][39]Farkhod A, Abdusalomov A, Makhmudov F, Cho YI. LDA-based topic modeling sentiment analysis using topic/document/sentence (TDS) model. Applied Sciences. 2021; 11(23):1-15.

[40][40]Alamsyah A, Rizkika W, Nugroho DD, Renaldi F, Saadah S. Dynamic large scale data on twitter using sentiment analysis and topic modeling. In 6th international conference on information and communication technology (ICoICT) 2018 (pp. 254-8). IEEE.

[41][41]Gurcan F, Cagiltay NE. Big data software engineering: analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access. 2019; 7:82541-52.

[42][42]Sundarkumar GG, Ravi V, Nwogu I, Govindaraju V. Malware detection via API calls, topic models and machine learning. In international conference on automation science and engineering (CASE) 2015 (pp. 1212-7). IEEE.

[43][43]Shahbazi Z, Byun YC. Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches. Journal of Intelligent & Fuzzy Systems. 2021; 41(1):2441-57.

[44][44]Miles S, Yao L, Meng W, Black CM, Miled ZB. Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling. Information Processing & Management. 2022; 59(3):1-11.

[45][45]Acharya S, Rawat U, Bhatnagar R. A low computational cost method for mobile malware detection using transfer learning and familial classification using topic modelling. Applied Computational Intelligence and Soft Computing. 2022; 2022:1-22.

[46][46]Chehal D, Gupta P, Gulati P. Implementation and comparison of topic modeling techniques based on user reviews in e-commerce recommendations. Journal of Ambient Intelligence and Humanized Computing. 2021; 12:5055-70.

[47][47]Pathak AR, Pandey M, Rautaray S. Adaptive model for dynamic and temporal topic modeling from big data using deep learning architecture. International Journal of Intelligent Systems and Applications. 2019; 9(6):13-27.

[48][48]Mazzei D, Ramjattan R. Machine learning for industry 4.0: a systematic review using deep learning-based topic modelling. Sensors. 2022; 22(22):1-31.

[49][49]https://www.kaggle.com/datasets/crawford/20-newsgroups. Accessed 20 July 2023.

[50]