International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (Print): 2394-5443 ISSN (Online): 2394-7454 Volume - 10 Issue - 99 February - 2023

  1. Google Scholar
Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

Kartika Makkar, Pardeep Kumar, Monika Poriye and Shalini Aggarwal

Abstract

In today's digital age, an enormous volume of data is generated daily from various internet sources, including social media sites, emails, and consumer reviews. With competition on the rise, it has become essential for organizations to understand their customers' needs and preferences. To gain meaningful insights from human language data, such as reviews, and understand consumer perceptions, sentiment analysis is an effective method. This research article presents a text preprocessing approach consisting of three stages: data collection, cleaning, and transformation. The approach was applied to three datasets - restaurant, cell phone, and garments - and evaluated using various machine learning classifiers for sentiment prediction. A comparison was made between two sets of techniques: set1 employed data cleaning and transformation with stemming, while set2 used data cleaning and transformation with lemmatization. The results indicated that set2 (data cleaning, transformation with lemmatization) performed better during preprocessing when evaluated using various machine learning classifiers, such as support vector machine (SVM), logistic regression (LR), decision tree (DT), random forest (RF), and Naïve Bayes (NB). Specifically, SVM, LR, RF, and NB performed better for the restaurant dataset, while DT, LR, and RF outperformed for the cell phone dataset. In the garment’s dataset, LR, DT, and RF outperformed for set2 compared to set1, making set2 the best preprocessing technique for subsequent comparison. Additionally, another comparison was made between two sets of techniques: set3 included text cleaning, transformation with lemmatization, and unigram features, while the other set included text cleaning, transformation with lemmatization, and bigram features. The sets were evaluated using machine learning classifiers, and the results revealed that set3 performed better with most classifiers.

Keyword

Support vector machine (SVM), Random forest (RF), Decision tree (DT), Logistic regression(LR), Naïve bayes (NB).

Cite this article

Makkar K, Kumar P, Poriye M, Aggarwal S.Improvisation in opinion mining using data preprocessing techniques based on consumer’s review . International Journal of Advanced Technology and Engineering Exploration. 2023;10(99):257-277. DOI:10.19101/IJATEE.2021.875886

Refference

[1]Rosid MA, Fitrani AS, Astutik IR, Mulloh NI, Gozali HA. Improving text preprocessing for student complaint document classification using sastrawi. In IOP conference series: materials science and engineering 2020 (pp. 1-7). IOP Publishing.

[2]Pavan KCS, Dhinesh BLD. Novel text preprocessing framework for sentiment analysis. In smart intelligent computing and applications: proceedings of the second international conference on SCI 2018, 2019 (pp. 309-17). Springer Singapore.

[3]Hacohen-kerner Y, Miller D, Yigal Y. The influence of preprocessing on text classification using a bag-of-words representation. PloS one. 2020; 15(5):1-20.

[4]Barushka A, Hajek P. The effect of text preprocessing strategies on detecting fake consumer reviews. In proceedings of the 3rd international conference on e-business and internet 2019 (pp. 13-7).

[5]Khyani D, Siddhartha BS, Niveditha NM, Divya BM. An interpretation of lemmatization and stemming in natural language processing. Journal of University of Shanghai for Science and Technology. 2021; 22(10):350-7.

[6]Muaad AY, Davanagere HJ, Guru DS, Benifa JB, Chola C, Alsalman H, et al. Arabic document classification: performance investigation of preprocessing and representation techniques. Mathematical Problems in Engineering. 2022; 2022:1-6.

[7]Ali MA, Kulkarni SB. Preprocessing of text for emotion detection and sentiment analysis of Hindi movie reviews. International conference on IoT based control networks and intelligent systems 2020 (pp. 848-56).

[8]Pandya SS, Kalani NB. Preprocessing phase of text sequence generation for Gujarati language. In 5th international conference on computing methodologies and communication 2021 (pp. 749-52). IEEE.

[9]Kumar D, Rana P. Stemming of punjabi words by using brute force technique. International Journal of Engineering Science and Technology. 2011; 3:1351-7.

[10]Pind J, Magnússon F, Briem S. The icelandic frequency dictionary. The Institute of Lexicography, University of Iceland, Reykjavik, Iceland. 1991.

[11]Ingason AK, Helgadóttir S, Loftsson H, Rögnvaldsson E. A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). In advances in natural language processing: 6th international conference, GoTAL 2008 Gothenburg, Sweden, 2008 (pp. 205-16). Springer Berlin Heidelberg.

[12]Helgadóttir S. Testing data-driven learning algorithms for POS tagging of icelandic. Nordisk Sprogteknologi. 2004:257-65.

[13]Setiabudi R, Iswari NM, Rusli A. Enhancing text classification performance by preprocessing misspelled words in Indonesian language. Telecommunication Computing Electronics and Control. 2021; 19(4):1234-41.

[14]Alam S, Yao N. The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Computational and Mathematical Organization Theory. 2019; 25:319-35.

[15]Churchill R, Singh L. Textprep: a text preprocessing toolkit for topic modeling on social media data. In proceedings of the 10th international conference on data science, technology and applications 2021 (pp. 60-70).

[16]Orlovskyi O, Ostapov S. Analysis of the text preprocessing methods influence on the destructive messages classifier. Advanced Information Systems. 2020; 4(3):104-8.

[17]Babanejad N, Agrawal A, An A, Papagelis M. A comprehensive analysis of preprocessing for word representation learning in affective tasks. In proceedings of the 58th annual meeting of the association for computational linguistics 2020 (pp. 5799-810).

[18]Kunilovskaya M, Plum A. Text preprocessing and its implications in a digital humanities project. In proceedings of the student research workshop associated with RANLP 2021 (pp. 85-93).

[19]Dash NS, Dash NS. Lemmatization of inflected nouns. Language Corpora Annotation and Processing. 2021:165-94.

[20]Prakash C, Chittimalli PK, Naik R. Domain specific text preprocessing for open information extraction. In 15th innovations in software engineering conference 2022 (pp. 1-5).

[21]Ranganathan G. A study to find facts behind preprocessing on deep learning algorithms. Journal of Innovative Image Processing (JIIP). 2021; 3(1):66-74.

[22]Mohammad F. Is preprocessing of text really worth your time for online comment classification? Proceedings on the international conference on artificial intelligence 2018 (pp.1-7).

[23]El KA, Zeroual I. The effects of pre-processing techniques on Arabic text classification. International Journal of Advanced Trends in Computer Science and Engineering. 2021; 10(1):41-8.

[24]Yogish D, Manjunath TN, Hegadi RS. Review on natural language processing trends and techniques using NLTK. In recent trends in image processing and pattern recognition: second international conference, RTIP2R 2018, Solapur, India, Revised Selected Papers, Part III 2019 (pp. 589-606). Springer Singapore.

[25]A ML, Benoit K, Keyes O, Selivanov D, Arnold J. Fast, consistent tokenization of natural language text. Journal of Open Source Software. 2018; 3(23):1-3.

[26]Orellana G, Arias B, Orellana M, Saquicela V, Baculima F, Piedra N. A study on the impact of pre-processing techniques in Spanish and English text classification over short and large text documents. In international conference on information systems and computer science 2018 (pp. 277-83). IEEE.

[27]Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management. 2014; 50(1):104-12.

[28]Méndez JR, Iglesias EL, Fdez-riverola F, Díaz F, Corchado JM. Tokenising, stemming and stopword removal on anti-spam filtering domain. In current topics in artificial intelligence: 11th conference of the Spanish association for artificial intelligence, CAEPIA 2005, Santiago de Compostela, Spain, 2006 (pp. 449-58). Springer Berlin Heidelberg.

[29]Kotsiantis SB, Kanellopoulos D, Pintelas PE. Data preprocessing for supervised leaning. International Journal of Computer Science. 2006; 1(2):111-7.

[30]Hickman L, Thapa S, Tay L, Cao M, Srinivasan P. Text preprocessing for text mining in organizational research: review and recommendations. Organizational Research Methods. 2022; 25(1):114-46.

[31]Saif H, Fernandez M, He Y, Alani H. On stopwords, filtering and data sparsity for sentiment analysis of twitter. Ninth international conference on language resources and evaluation. 2014 (pp.810-17).

[32]Srividhya V, Anitha R. Evaluating preprocessing techniques in text categorization. International Journal of Computer Science and Application. 2010; 47(11):49-51.

[33]Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, et al. A comparison between preprocessing techniques for sentiment analysis in twitter. KDWeb. 2016:1-11.

[34]Haddi E, Liu X, Shi Y. The role of text pre-processing in sentiment analysis. Procedia Computer Science. 2013; 17:26-32.

[35]Dos SFL, Ladeira M. The role of text pre-processing in opinion mining on a social media language dataset. In Brazilian conference on intelligent systems 2014 (pp. 50-4). IEEE.

[36]Hemalatha I, Varma GS, Govardhan A. Preprocessing the informal text for efficient sentiment analysis. International Journal of Emerging Trends & Technology in Computer Science. 2012; 1(2):58-61.

[37]Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access. 2017; 5:2870-9.

[38]M. AS, Mustapha M. The effect of noise elimination and stemming in sentiment analysis for Malay documents. In proceedings of the international conference on computing, mathematics and statistics (iCMS 2015) Bridging Research Endeavors 2017 (pp. 93-102). Springer Singapore.

[39]Boban I, Doko A, Gotovac S. Sentence retrieval using stemming and lemmatization with different length of the queries. Advances in Science, Technology and Engineering Systems. 2020; 5(3):349-54.

[40]Kariyawasam KT, Senanayake SY, Haddela PS. A rule based stemmer for Sinhala language. In 14th conference on industrial and information systems 2019 (pp. 326-31). IEEE.

[41]Akhmetov I, Pak A, Ualiyeva I, Gelbukh A. Highly language-independent word lemmatization using a machine-learning classifier. Computing and Systems. 2020; 24(3):1353-64.

[42]Balakrishnan V, Lloyd-yemoh E. Stemming and lemmatization: a comparison of retrieval performances. In proceedings of SCEI Seoul conferences. 2014 (pp.10-4).

[43]Ozturkmenoglu O, Alpkocak A. Comparison of different lemmatization approaches for information retrieval on Turkish text collection. In international symposium on innovations in intelligent systems and applications 2012 (pp. 1-5). IEEE.

[44]Dalianis H, Jongejan B. Hand-crafted versus machine-learned inflectional rules: the euroling-siteseeker stemmer and CSTs lemmatiser. In LREC 2006 (pp. 663-6).

[45]Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of finish text documents. In proceedings of the thirteenth ACM international conference on information and knowledge management 2004 (pp. 625-33).

[46]Gupta D, Kumar YR, Sajan N. Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi. International Journal of Computer Applications. 2012; 38(8):1-8.

[47]Kurniasih A, Manik LP. On the role of text preprocessing in BERT embedding-based DNNs for classifying informal texts. Neuron. 2022; 1024(512):927-34.

[48]Haque TU, Saber NN, Shah FM. Sentiment analysis on large scale Amazon product reviews. In international conference on innovative research and development 2018 (pp. 1-6). IEEE.

[49]Krishna A, Akhilesh V, Aich A, Hegde C. Sentiment analysis of restaurant reviews using machine learning techniques. In emerging research in electronics, computer science and technology: proceedings of international conference 2019 (pp. 687-96). Springer Singapore.

[50]Makkar K, Kumar P, Poriye M, Aggarwal S. A comparative study of supervised and unsupervised machine learning algorithms on consumer reviews. In world conference on applied intelligence and computing 2022 (pp. 598-603). IEEE.