International Journal of Advanced Computer Research (IJACR) ISSN (P): 2249-7277 ISSN (O): 2277-7970 Vol - 7, Issue - 30, May 2017
  1. 1
    Google Scholar
  2. 4
    Impact Factor
A subject identification method based on term frequency technique

Nurul Syafidah Jamil, Ku Ruhana Ku-Mahamud, Aniza Mohamed Din, Faudziah Ahmad, Noraziah ChePa, Wan Hussain Wan Ishak, Roshidi Din and Farzana Kabir Ahmad

Abstract

The analyzing and extracting important information from a text document is crucial and has produced interest in the area of text mining and information retrieval. This process is used in order to notice particularly in the text. Furthermore, on view of the readers that people tend to read almost everything in text documents to find some specific information. However, reading a text document consumes time to complete and additional time to extract information. Thus, classifying text to a subject can guide a person to find relevant information. In this paper, a subject identification method which is based on term frequency to categorize groups of text into a particular subject is proposed. Since term frequency tends to ignore the semantics of a document, the term extraction algorithm is introduced for improving the result of the extracted relevant terms from the text. The evaluation of the extracted terms has shown that the proposed method is exceeded other extraction techniques.

Keyword

Subject identification, Text classification, Term frequency, Term filtering, Text document.

Cite this article

Refference

[1][1]Korde V, Mahender CN. Text classification and classifiers: a survey. International Journal of Artificial Intelligence & Applications. 2012; 3(2):85-99.

[2][2]Weiss SM, Indurkhya N, Zhang T, Damerau F. Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media; 2010.

[3][3]Aggarwal CC, Zhai C. A survey of text classification algorithms. In mining text data 2012 (pp. 163-222). Springer US.

[4][4]Patil TR, Sherekar SS. Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications. 2013; 6(2):256-61.

[5][5]Elmehdwi Y, Samanthula BK, Jiang W. Secure k-nearest neighbor query over encrypted data in outsourced environments. In international conference on data engineering 2014 (pp. 664-75). IEEE.

[6][6]Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications. 2013; 40(1):200-10.

[7][7]Bouamor D, Semmar N, Zweigenbaum P. Using wordnet and semantic similarity for bilingual terminology mining from comparable corpora. In proceedings of the 6th workshop on building and using comparable corpora 2013 (pp. 16-23).

[8][8]Gupta R, Pal S, Bandyopadhyay S. Improving MT system using extracted parallel fragments of text from comparable corpora. In proceedings of 6th workshop of building and using comparable corpora 2013 (pp. 69-76).

[9][9]Ker SJ, Chen JN. A text categorization based on summarization technique. In proceedings of the ACL-2000 workshop on recent advances in natural language processing and information retrieval: held in conjunction with the 38th annual meeting of the association for computational linguistics (pp. 79-83). Association for Computational Linguistics.

[10][10]Baghdadi HS, Ranaivo-Malançon B. An automatic topic identification algorithm. Journal of Computer Science. 2011; 7(9):1363-7.

[11][11]Meena YK, Jain A, Gopalani D. Survey on graph and cluster based approaches in multi-document text summarization. In recent advances and innovations in engineering 2014 (pp. 1-5). IEEE.

[12][12]Sawant Ganesh S, Kanawade Bhavana R. A review on topic modeling in information retrieval. 2014.

[13][13]Butarbutar M, McRoy S. Indexing text documents based on topic identification. In international symposium on string processing and information retrieval 2004 (pp. 113-24). Springer Berlin Heidelberg.

[14][14]Jain S, Pareek J. Automatic topic (s) identification from learning material: An ontological approach. In second international conference on computer engineering and applications 2010 (pp. 358-62). IEEE.

[15][15]McDonough J, Ng K, Jeanrenaud P, Gish H, Rohlicek JR. Approaches to topic identification on the switchboard corpus. In international conference on acoustics, speech, and signal processing 1994 (pp. I-385). IEEE.

[16][16]Berkowitz S. Method of identifying topic of text using nouns. The United States of America as represented by the Director National Security Agency. United States Patent US 7,805,291. 2010.

[17][17]Dalal MK, Zaveri MA. Automatic text classification of sports blog data. In computing, communications and applications conference 2012 (pp. 219-22). IEEE.

[18][18]Van Zaanen M, Kanters P. Automatic mood classification using TF* IDF based on lyrics. In international society for music information retrieval conference 2010 (pp. 75-80).

[19][19]Coursey K, Mihalcea R, Moen W. Using encyclopedic knowledge for automatic topic identification. In proceedings of the thirteenth conference on computational natural language learning 2009 (pp. 210-8). Association for Computational Linguistics.

[20][20]Schönhofen P. Identifying document topics using the Wikipedia category network. Web Intelligence and Agent Systems: an International Journal. 2009; 7(2):195-207.

[21][21]Ku-Mahamud KR, Ahmad F, Mohamed Din A, Ishak W, Hussain W, Ahmad FK, et al. Semantic network representation of female related issues from the Holy Quran. Knowledge management international conference 2012 (pp. 726-30).