A hybrid approach for generative process model with topic modelling towards efficient and dynamic document clustering

Gugulothu Venkanna and K.F Bharati


Clustering text documents has a wide range of applications across various domains. However, due to the diversity and rapid growth of textual data, performing clustering on a given text corpus has become increasingly challenging. Several existing approaches for text document clustering rely on natural language processing (NLP) and text similarity measures. However, there is a pressing need for a generative process model to systematically and progressively handle text corpora. Furthermore, a hybrid approach that enhances clustering performance is essential. Therefore, developing a model for a given text corpus and dynamically updating it as new documents arrive, rather than starting clustering from scratch, is of paramount importance. In this paper, a framework known as the hybrid approach for dynamic document clustering (HADDC) was proposed. This framework is realized through the definition of two algorithms that collaborate to achieve dynamic document clustering. The first algorithm, called similar document identification (SDI), leverages a lexical dictionary, WordNet, and similarity measures to effectively identify similar documents. The second algorithm, topic modelling for efficient and dynamic document clustering (TM-EDDC), is designed as a dynamic process model based on latent Dirichlet allocation (LDA). It has the capability to cluster documents incrementally as new ones become available. Experimental results demonstrate that the proposed methods outperform existing ones, as evidenced by a lower mean absolute error (MAE). The proposed framework and underlying algorithms were evaluated using the news groups dataset. The empirical study showcases the enhanced utility and efficiency of the proposed framework, making it a valuable tool for organizations to integrate into their existing applications.


Document clustering, Natural language processing, Generative process model, Document similarity, Dynamic document clustering.

