International Journal of Advanced Computer Research (IJACR) ISSN (P): 2249-7277 ISSN (O): 2277-7970 Vol - 6, Issue - 25, July 2016
  1. 1
    Google Scholar
  2. 4
    Impact Factor
Keyword extraction from single documents using mean word intermediate distance

Sifatullah Siddiqi and Aditi Sharan

Abstract

Keyword extraction is an important task in text mining. In this paper a novel, unsupervised, domain independent and language independent approach for automatic keyword extraction from single documents have been proposed. We have used the word intermediate distance vector and its mean value to extract keywords. We have compared our approach with results from the standard deviation of intermediate distances approach as standard and found that there is heavy overlapping between the results of both approaches with the advantage that our approach is faster, especially in case of long documents as it removes the need to compute the standard deviation of word intermediate distance vector. Two famous works viz. “Origin of Species” and “A Brief History of Time” to demonstrate the experimental results have been used. Experiments show that the proposed approach works almost as better as the standard deviation approach and the percentage overlap between top 30 extracted keywords is more than 50%.

Keyword

Keyword extraction, Word means intermediate distance, Clustering, Standard deviation.

Cite this article

Refference

[1][1]Zhang C, Wang H, Liu Y, Wu D, Liao Y, Wang B. Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems. 2008; 4(3):1169-80.

[2][2]Siddiqi S, Sharan A. Keyword and keyphrase extraction techniques: a literature review. International Journal of Computer Applications. 2015; 109(2):18-23.

[3][3]Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972; 28(1):11-21.

[4][4]Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 1988; 24(5):513-23.

[5][5]Buckley C. The importance of proper weighting methods. In proceedings of the workshop on human language technology 1993 (pp. 349-52). Association for Computational Linguistics.

[6][6]Turney PD. Learning algorithms for keyphrase extraction. Information Retrieval. 2000; 2(4):303-36.

[7][7]Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG. Domain-specific keyphrase extraction. In international joint conference on artificial intelligence 1999 (pp. 668-73).

[8][8]Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In proceedings of the conference on empirical methods in natural language processing 2003 (pp. 216-23). Association for Computational Linguistics.

[9][9]Zhang C. Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems. 2008; 4(3):1169-80.

[10][10]Litvak M, Last M, Aizenman H, Gobits I, Kandel A. DegExt-A language-independent graph-based keyphrase extractor. In advances in intelligent web mastering–3 2011 (pp. 121-30). Springer Berlin Heidelberg.

[11][11]Harter SP. A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science. 1975; 26(5):280-9.

[12][12]Bookstein A, Swanson DR. Probabilistic models for automatic indexing. Journal of the American Society for Information Science. 1974; 25(5):312-6.

[13][13]Ortuño M, Carpena P, Bernaola-Galván P, Muñoz E, Somoza AM. Keyword detection in natural languages and DNA. EPL (Europhysics Letters). 2002; 57(5):759-64.

[14][14]Herrera JP, Pury PA. Statistical keyword detection in literary corpora. The European Physical Journal B. 2008; 63(1):135-46.

[15][15]Feng J, Xie F, Hu X, Li P, Cao J, Wu X. Keyword extraction based on sequential pattern mining. In proceedings of the third international conference on internet multimedia computing and service 2011 (pp. 34-8). ACM.

[16][16]Hong B, Zhen D. An extended keyword extraction method. International conference on applied physics and industrial engineering 2012 (pp. 1120-7). Physics Procedia.

[17][17]Mehri A, Darooneh AH. Keyword extraction by nonextensivity measure. Physical Review E. 2011; 83(5):056106.

[18][18]Carretero-Campos C, Bernaola-Galván P, Coronado AV, Carpena P. Improving statistical keyword detection in short texts: entropic and clustering approaches. Physica A: Statistical Mechanics and its Applications. 2013; 392(6):1481-92.