International Journal of Advanced Computer Research (IJACR) ISSN (P): 2249-7277 ISSN (O): 2277-7970 Vol - 10, Issue - 49, July 2020
  1. 1
    Google Scholar
  2. 4
    Impact Factor
Review of feature selection methods for text classification

Muhammad Iqbal, Malik Muneeb Abid, Muhammad Noman Khalid and Amir Manzoor

Abstract

For the last three decades, the World Wide Web (WWW) has become one of the most widely used podium to generate an immense amount of heterogeneous data in a single day. Presently, many organizations aimed to process their domain data for taking quick decisions to improve their organizational performance. However, high dimensionality in datasets is a biggest obstacle for researchers and domain engineers to achieve their desired performance through their selected machine learning (ML) algorithms. In ML, feature selection is a core concept used for selecting most relevant features of high dimension data and thus improve the performance of the trained learning model. Moreover, the feature selection process also provides an effective way by eliminating in appropriate and redundant features and ultimately shrinks the computational time. Due to the significance and applications of feature selection, it has become a well-researched area of ML. Nowadays, feature selection has a vital role in most of the effective spam detection systems, pattern recognition systems, automated organization, management of documents, and information retrieval systems. In order to do accurate classification, the relevant feature selection is the most important task, and to achieve its objectives, this study starts with an overview of text classification. This overview is then followed by a survey. The survey covered the popular feature selection methods commonly used for text classification. This survey also sheds light on applications of feature selection methods. The focus of this study is three feature selection algorithms, i.e., Principal Component Analysis (PCA), Chi-Square (CS) and Information Gain (IG). This study is helpful for researchers looking for some suitable criterion to decide the suitable technique to be used for better understanding of the performance of the classifier. In order to conduct experiments, web spam uk2007 dataset is considered. Ten, twenty, thirty, and forty features were selected as an optimal subset from web spam uk2007 dataset. Among all three feature selection algorithms, CS and IG had highest F1Score (F-measure =0.911) but at the same time suffered with model building time.

Keyword

Feature selection, Binary classification, Feature selection algorithms.

Cite this article

Iqbal M, Abid MM, Khalid MN, Manzoor A

Refference

[1][1]Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW. Feature selection methods for text classification. In proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining 2007 (pp. 230-9).

[2][2]Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z. A novel feature selection algorithm for text categorization. Expert Systems with Applications. 2007; 33(1):1-5.

[3][3]Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018; 300:70-9.

[4][4]Khalifi H, Elqadi A, Ghanou Y. Support vector machines for a new hybrid information retrieval system. Procedia Computer Science. 2018; 127:139-45.

[5][5]Salo F, Nassif AB, Essex A. Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection. Computer Networks. 2019; 148:164-75.

[6][6]Tadist K, Najah S, Nikolov NS, Mrabti F, Zahi A. Feature selection methods and genomic big data: a systematic review. Journal of Big Data. 2019.

[7][7]Ramesh B, Sathiaseelan JG. An advanced multi class instance selection based support vector machine for text classification. Procedia Computer Science. 2015; 57:1124-30.

[8][8]Caggiano A, Angelone R, Napolitano F, Nele L, Teti R. Dimensionality reduction of sensorial features by principal component analysis for ANN machine learning in tool condition monitoring of CFRP drilling. Procedia CIRP. 2018; 78:307-12.

[9][9]Gibert D, Mateu C, Planes J. The rise of machine learning for detection and classification of malware: research developments, trends and challenges. Journal of Network and Computer Applications. 2020.

[10][10]Almuallim H, Dietterich TG. Learning with many irrelevant features. In AAAI 1991 (pp. 547-52).

[11][11]Swets DL, Weng JJ. Efficient content-based image retrieval using automatic feature selection. In proceedings of international symposium on computer vision-ISCV 1995 (pp. 85-90). IEEE.

[12][12]Lee W, Stolfo SJ, Mok KW. Adaptive intrusion detection: a data mining approach. Artificial Intelligence Review. 2000; 14(6):533-67.

[13][13]Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research. 2003:1289-305.

[14][14]Rill S, Reinel D, Scheidt J, Zicari RV. Politwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowledge-Based Systems. 2014; 69:24-33.

[15][15]Idris I, Selamat A. Improved email spam detection model with negative selection algorithm and particle swarm optimization. Applied Soft Computing. 2014; 22:11-27.

[16][16]Uysal AK, Gunal S, Ergin S, Gunal ES. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika. 2013; 19(5):67-72.

[17][17]Zhang C, Wu X, Niu Z, Ding W. Authorship identification from unstructured texts. Knowledge-Based Systems. 2014; 66:99-111.

[18][18]Saraç E, Özel SA. An ant colony optimization based feature selection for web page classification. The Scientific World Journal. 2014.

[19][19]Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Engineering Journal. 2014; 5(4):1093-113.

[20][20]Xu S, Chan HK. Forecasting medical device demand with online search queries: a big data and machine learning approach. Procedia Manufacturing. 2019; 39:32-9.

[21][21]Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR). 2002; 34(1):1-47.

[22][22]Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997; 97(1-2):273-324.

[23][23]Donoser M, Wagner S, Bischof H. Context information from search engines for document recognition. Pattern Recognition Letters. 2010; 31(8):750-4.

[24][24]Guyon I, Gunn S, Nikravesh M, Zadeh LA, editors. Feature extraction: foundations and applications. Springer; 2008.

[25][25]Ruiz FE, Pérez PS, Bonev BI. Information theory in computer vision and pattern recognition. Springer Science & Business Media; 2009.

[26][26]Chen J, Huang H, Tian S, Qu Y. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications. 2009; 36(3):5432-5.

[27][27]Hegde J, Rokseth B. Applications of machine learning methods for engineering risk assessment–a review. Safety Science. 2020; 122.

[28][28]Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In ICML 1997.

[29][29]Kumar V, Minz S. Feature selection: a literature review. Smart Computing Review. 2014; 4(3):211-29.

[30][30]Uysal AK. An improved global feature selection scheme for text classification. Expert Systems with Applications. 2016; 43:82-92.

[31][31]Dash M, Liu H. Feature selection for classification. Intelligent Data Analysis. 1997; 1(3):131-56.

[32][32]Wang F, Liang J. An efficient feature selection algorithm for hybrid data. Neurocomputing. 2016; 193:33-41.

[33][33]Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020; 143:106839.

[34][34]Taşcı Ş, Güngör T. Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications. 2013; 40(12):4871-86.

[35][35]Chen G, Chen J. A novel wrapper method for feature selection and its applications. Neurocomputing. 2015; 159:219-26.

[36][36]Liu H, Motoda H, Setiono R, Zhao Z. Feature selection: an ever evolving frontier in data mining. In feature selection in data mining 2010 (pp. 4-13).

[37][37]Lee SJ, Xu Z, Li T, Yang Y. A novel bagging C4. 5 algorithm based on wrapper feature selection for supporting wise clinical decision making. Journal of Biomedical Informatics. 2018; 78:144-55.

[38][38]Sosa-Cabrera G, García-Torres M, Gómez-Guerrero S, Schaerer CE, Divina F. A multivariate approach to the symmetrical uncertainty measure: application to feature selection problem. Information Sciences. 2019; 494:1-20.

[39][39]Kumar V, Minz S. Poem classification using machine learning approach. In proceedings of the second international conference on soft computing for problem solving (SocProS 2012), 2014 (pp. 675-82). Springer, New Delhi.

[40][40]Kumar V, Minz S. Mood classifiaction of lyrics using SentiWordNet. In international conference on computer communication and informatics 2013 (pp. 1-5). IEEE.

[41][41]Kumar V, Minz S. Multi-view ensemble learning for poem data classification using SentiWordNet. In Advanced Computing, Networking and Informatics 2014:57-66. Springer, Cham.

[42][42]Jia X, Kuo BC, Crawford MM. Feature mining for hyperspectral image classification. Proceedings of the IEEE. 2013; 101(3):676-97.

[43][43]Kuo BC, Landgrebe DA. Nonparametric weighted feature extraction for classification. IEEE Transactions on Geoscience and Remote Sensing. 2004; 42(5):1096-105.

[44][44]Zhao YQ, Zhang L, Kong SG. Band-subset-based clustering and fusion for hyperspectral imagery classification. IEEE Transactions on Geoscience and Remote Sensing. 2010; 49(2):747-56.

[45][45]Dua M. Attribute selection and ensemble classifier based novel approach to intrusion detection system. Procedia Computer Science. 2020; 167:2191-9.

[46][46]Mishra S, Panda M. Medical image retrieval using self-organising map on texture features. Future Computing and Informatics Journal. 2018; 3(2):359-70.

[47][47]Da Silva SF, Ribeiro MX, Neto JD, Traina-Jr C, Traina AJ. Improving the ranking quality of medical image retrieval using a genetic feature selection method. Decision Support Systems. 2011; 51(4):810-20.

[48][48]Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. New York: Springer Series in Statistics; 2001.

[49][49]Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(4):491-502.

[50][50]Rodriguez-Galiano VF, Luque-Espinar JA, Chica-Olmo M, Mendes MP. Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the Total Environment. 2018; 624:661-72.

[51][51]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001: 74-81.

[52][52]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001: 74-81.

[53][53]Fahad A, Tari Z, Khalil I, Habib I, Alnuweiri H. Toward an efficient and scalable feature selection approach for internet traffic classification. Computer Networks. 2013; 57(9):2040-57.

[54][54]Jiarpakdee J, Tantithamthavorn C, Hassan AE. The impact of correlated metrics on the interpretation of defect models. IEEE Transactions on Software Engineering. 2019.

[55][55]SL SD, Jaidhar CD. Windows malware detector using convolutional neural network based on visualization images. IEEE Transactions on Emerging Topics in Computing. 2019.

[56][56]Artoni F, Delorme A, Makeig S. A visual working memory dataset collection with bootstrap independent component analysis for comparison of electroencephalographic preprocessing pipelines. Data in Brief. 2019; 22:787-93.

[57][57]http://www.cs.waikato.ac.nz/ml/weka. Accessed 10 March 2020.

[58][58]http://chato.cl/webspam/datasets/. Accessed 10 March 2020.

[59][59]http://www.cs.waikato.ac.nz/~mhall/thesis.pdf. Accessed 10 March 2020.