ACCENTS Journals

Download PDF
Back

Paper Title	:	Review of feature selection methods for text classification
Author Name	:	Muhammad Iqbal, Malik Muneeb Abid, Muhammad Noman Khalid and Amir Manzoor
Abstract	:	For the last three decades, the World Wide Web (WWW) has become one of the most widely used podium to generate an immense amount of heterogeneous data in a single day. Presently, many organizations aimed to process their domain data for taking quick decisions to improve their organizational performance. However, high dimensionality in datasets is a biggest obstacle for researchers and domain engineers to achieve their desired performance through their selected machine learning (ML) algorithms. In ML, feature selection is a core concept used for selecting most relevant features of high dimension data and thus improve the performance of the trained learning model. Moreover, the feature selection process also provides an effective way by eliminating in appropriate and redundant features and ultimately shrinks the computational time. Due to the significance and applications of feature selection, it has become a well-researched area of ML. Nowadays, feature selection has a vital role in most of the effective spam detection systems, pattern recognition systems, automated organization, management of documents, and information retrieval systems. In order to do accurate classification, the relevant feature selection is the most important task, and to achieve its objectives, this study starts with an overview of text classification. This overview is then followed by a survey. The survey covered the popular feature selection methods commonly used for text classification. This survey also sheds light on applications of feature selection methods. The focus of this study is three feature selection algorithms, i.e., Principal Component Analysis (PCA), Chi-Square (CS) and Information Gain (IG). This study is helpful for researchers looking for some suitable criterion to decide the suitable technique to be used for better understanding of the performance of the classifier. In order to conduct experiments, web spam uk2007 dataset is considered. Ten, twenty, thirty, and forty features were selected as an optimal subset from web spam uk2007 dataset. Among all three feature selection algorithms, CS and IG had highest F1Score (F-measure =0.911) but at the same time suffered with model building time.
Keywords	:	Feature selection, Binary classification, Feature selection algorithms.
Cite this article	:	Iqbal M, Abid MM, Khalid MN, Manzoor A.Review of feature selection methods for text classification. International Journal of Advanced Computer Research. 2020;10(49):138-152. DOI:10.19101/IJACR.2020.1048037
References	:	[1]Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW. Feature selection methods for text classification. In proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining 2007 (pp. 230-9). [Crossref] [Google Scholar] [2]Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z. A novel feature selection algorithm for text categorization. Expert Systems with Applications. 2007; 33(1):1-5. [Crossref] [Google Scholar] [3]Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing. 2018; 300:70-9. [Crossref] [Google Scholar] [4]Khalifi H, Elqadi A, Ghanou Y. Support vector machines for a new hybrid information retrieval system. Procedia Computer Science. 2018; 127:139-45. [Crossref] [Google Scholar] [5]Salo F, Nassif AB, Essex A. Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection. Computer Networks. 2019; 148:164-75. [Crossref] [Google Scholar] [6]Tadist K, Najah S, Nikolov NS, Mrabti F, Zahi A. Feature selection methods and genomic big data: a systematic review. Journal of Big Data. 2019. [Crossref] [Google Scholar] [7]Ramesh B, Sathiaseelan JG. An advanced multi class instance selection based support vector machine for text classification. Procedia Computer Science. 2015; 57:1124-30. [Crossref] [Google Scholar] [8]Caggiano A, Angelone R, Napolitano F, Nele L, Teti R. Dimensionality reduction of sensorial features by principal component analysis for ANN machine learning in tool condition monitoring of CFRP drilling. Procedia CIRP. 2018; 78:307-12. [Crossref] [Google Scholar] [9]Gibert D, Mateu C, Planes J. The rise of machine learning for detection and classification of malware: research developments, trends and challenges. Journal of Network and Computer Applications. 2020. [Crossref] [Google Scholar] [10]Almuallim H, Dietterich TG. Learning with many irrelevant features. In AAAI 1991 (pp. 547-52). [Google Scholar] [11]Swets DL, Weng JJ. Efficient content-based image retrieval using automatic feature selection. In proceedings of international symposium on computer vision-ISCV 1995 (pp. 85-90). IEEE. [Crossref] [Google Scholar] [12]Lee W, Stolfo SJ, Mok KW. Adaptive intrusion detection: a data mining approach. Artificial Intelligence Review. 2000; 14(6):533-67. [Crossref] [Google Scholar] [13]Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research. 2003:1289-305. [Google Scholar] [14]Rill S, Reinel D, Scheidt J, Zicari RV. Politwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowledge-Based Systems. 2014; 69:24-33. [Crossref] [Google Scholar] [15]Idris I, Selamat A. Improved email spam detection model with negative selection algorithm and particle swarm optimization. Applied Soft Computing. 2014; 22:11-27. [Crossref] [Google Scholar] [16]Uysal AK, Gunal S, Ergin S, Gunal ES. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika. 2013; 19(5):67-72. [Crossref] [Google Scholar] [17]Zhang C, Wu X, Niu Z, Ding W. Authorship identification from unstructured texts. Knowledge-Based Systems. 2014; 66:99-111. [Crossref] [Google Scholar] [18]Saraç E, Özel SA. An ant colony optimization based feature selection for web page classification. The Scientific World Journal. 2014. [Crossref] [Google Scholar] [19]Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Engineering Journal. 2014; 5(4):1093-113. [Crossref] [Google Scholar] [20]Xu S, Chan HK. Forecasting medical device demand with online search queries: a big data and machine learning approach. Procedia Manufacturing. 2019; 39:32-9. [Crossref] [Google Scholar] [21]Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR). 2002; 34(1):1-47. [Crossref] [Google Scholar] [22]Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997; 97(1-2):273-324. [Google Scholar] [23]Donoser M, Wagner S, Bischof H. Context information from search engines for document recognition. Pattern Recognition Letters. 2010; 31(8):750-4. [Crossref] [Google Scholar] [24]Guyon I, Gunn S, Nikravesh M, Zadeh LA, editors. Feature extraction: foundations and applications. Springer; 2008. [Google Scholar] [25]Ruiz FE, Pérez PS, Bonev BI. Information theory in computer vision and pattern recognition. Springer Science & Business Media; 2009. [Google Scholar] [26]Chen J, Huang H, Tian S, Qu Y. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications. 2009; 36(3):5432-5. [Crossref] [Google Scholar] [27]Hegde J, Rokseth B. Applications of machine learning methods for engineering risk assessment–a review. Safety Science. 2020; 122. [Crossref] [Google Scholar] [28]Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In ICML 1997. [Google Scholar] [29]Kumar V, Minz S. Feature selection: a literature review. Smart Computing Review. 2014; 4(3):211-29. [Crossref] [Google Scholar] [30]Uysal AK. An improved global feature selection scheme for text classification. Expert Systems with Applications. 2016; 43:82-92. [Crossref] [Google Scholar] [31]Dash M, Liu H. Feature selection for classification. Intelligent Data Analysis. 1997; 1(3):131-56. [Crossref] [Google Scholar] [32]Wang F, Liang J. An efficient feature selection algorithm for hybrid data. Neurocomputing. 2016; 193:33-41. [Crossref] [Google Scholar] [33]Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020; 143:106839. [Crossref] [Google Scholar] [34]Taşcı Ş, Güngör T. Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications. 2013; 40(12):4871-86. [Crossref] [Google Scholar] [35]Chen G, Chen J. A novel wrapper method for feature selection and its applications. Neurocomputing. 2015; 159:219-26. [Crossref] [Google Scholar] [36]Liu H, Motoda H, Setiono R, Zhao Z. Feature selection: an ever evolving frontier in data mining. In feature selection in data mining 2010 (pp. 4-13). [Google Scholar] [37]Lee SJ, Xu Z, Li T, Yang Y. A novel bagging C4. 5 algorithm based on wrapper feature selection for supporting wise clinical decision making. Journal of Biomedical Informatics. 2018; 78:144-55. [Crossref] [Google Scholar] [38]Sosa-Cabrera G, García-Torres M, Gómez-Guerrero S, Schaerer CE, Divina F. A multivariate approach to the symmetrical uncertainty measure: application to feature selection problem. Information Sciences. 2019; 494:1-20. [Crossref] [Google Scholar] [39]Kumar V, Minz S. Poem classification using machine learning approach. In proceedings of the second international conference on soft computing for problem solving (SocProS 2012), 2014 (pp. 675-82). Springer, New Delhi. [Crossref] [Google Scholar] [40]Kumar V, Minz S. Mood classifiaction of lyrics using SentiWordNet. In international conference on computer communication and informatics 2013 (pp. 1-5). IEEE. [Crossref] [Google Scholar] [41]Kumar V, Minz S. Multi-view ensemble learning for poem data classification using SentiWordNet. In Advanced Computing, Networking and Informatics 2014:57-66. Springer, Cham. [Crossref] [Google Scholar] [42]Jia X, Kuo BC, Crawford MM. Feature mining for hyperspectral image classification. Proceedings of the IEEE. 2013; 101(3):676-97. [Crossref] [Google Scholar] [43]Kuo BC, Landgrebe DA. Nonparametric weighted feature extraction for classification. IEEE Transactions on Geoscience and Remote Sensing. 2004; 42(5):1096-105. [Crossref] [Google Scholar] [44]Zhao YQ, Zhang L, Kong SG. Band-subset-based clustering and fusion for hyperspectral imagery classification. IEEE Transactions on Geoscience and Remote Sensing. 2010; 49(2):747-56. [Crossref] [Google Scholar] [45]Dua M. Attribute selection and ensemble classifier based novel approach to intrusion detection system. Procedia Computer Science. 2020; 167:2191-9. [Crossref] [Google Scholar] [46]Mishra S, Panda M. Medical image retrieval using self-organising map on texture features. Future Computing and Informatics Journal. 2018; 3(2):359-70. [Crossref] [Google Scholar] [47]Da Silva SF, Ribeiro MX, Neto JD, Traina-Jr C, Traina AJ. Improving the ranking quality of medical image retrieval using a genetic feature selection method. Decision Support Systems. 2011; 51(4):810-20. [Crossref] [Google Scholar] [48]Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. New York: Springer Series in Statistics; 2001. [Google Scholar] [49]Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(4):491-502. [Crossref] [Google Scholar] [50]Rodriguez-Galiano VF, Luque-Espinar JA, Chica-Olmo M, Mendes MP. Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the Total Environment. 2018; 624:661-72. [Crossref] [Google Scholar] [51]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001: 74-81. [Google Scholar] [52]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001: 74-81. [53]Fahad A, Tari Z, Khalil I, Habib I, Alnuweiri H. Toward an efficient and scalable feature selection approach for internet traffic classification. Computer Networks. 2013; 57(9):2040-57. [Crossref] [Google Scholar] [54]Jiarpakdee J, Tantithamthavorn C, Hassan AE. The impact of correlated metrics on the interpretation of defect models. IEEE Transactions on Software Engineering. 2019. [Crossref] [Google Scholar] [55]SL SD, Jaidhar CD. Windows malware detector using convolutional neural network based on visualization images. IEEE Transactions on Emerging Topics in Computing. 2019. [Crossref] [Google Scholar] [56]Artoni F, Delorme A, Makeig S. A visual working memory dataset collection with bootstrap independent component analysis for comparison of electroencephalographic preprocessing pipelines. Data in Brief. 2019; 22:787-93. [Crossref] [Google Scholar] [57]http://www.cs.waikato.ac.nz/ml/weka. Accessed 10 March 2020. [58]http://chato.cl/webspam/datasets/. Accessed 10 March 2020. [59]http://www.cs.waikato.ac.nz/~mhall/thesis.pdf. Accessed 10 March 2020.