International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 9, Issue - 91, June 2022
  1. 1
    Google Scholar
Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing

Vaishali Gupta and Nisheeth Joshi

Abstract

Text can be translated from one language to another using statistical machine translation, but there are still gaps in the translations because of a lack of language resource material. Building a linguistic corpus necessarily requires the extraction of multiword expressions (MWE). MWE is a collection of words with idiomatic expression properties. However, due to its non-compositional meaning of distinctive words, identifying and extracting MWE is a time-consuming task. In this case, an automated system has been developed for the extraction of MWEs from Hindi and Urdu language sources automatically. The entire process includes tagging, pattern matching, an identification algorithm, and the extraction of MWEs from the data. Tagging each word with a unique part of speech tag is used as an input to the pattern-matching algorithm. Using pattern matching, MWE tags of specific patterns were selected, and the algorithm for automatic MWE detection was built on top of that. The conditional random field (CRF++) model was used to automatically extract the MWEs from data. Confusion matrix was used to conduct the automated evaluation of this proposed system. For Hindi and Urdu, the calculated overall accuracy is 96.82% and 96.62%, respectively.

Keyword

Bigrams, Tags, Multiword expression (MWE), Conditional random field (CRF), Confusion matrix.

Cite this article

Gupta V, Joshi N

Refference

[1][1]De CHM, Ramisch C, Das GVNM, Villavicencio A. Alignment-based extraction of multiword expressions. Language Resources and Evaluation. 2010; 44(1):59-77.

[2][2]Constant M, Eryiğit G, Monti J, Van DPL, Ramisch C, Rosner M, et al. Multiword expression processing: a survey. Computational Linguistics. 2017; 43(4):837-92.

[3][3]Baldwin T, Kim SN. Multiword expressions. Handbook of Natural Language Processing. 2010; 2:267-92.

[4][4]Sag IA, Baldwin T, Bond F, Copestake A, Flickinger D. Multiword expressions: a pain in the neck for NLP. In international conference on intelligent text processing and computational linguistics 2002 (pp. 1-15). Springer, Berlin, Heidelberg.

[5][5]Nandi M, Ramasree R. Rule based extraction of multi-word expressions for elementary sanskrit texts. International Journal of Advanced Research in Computer Science. 2013; 3(11):661-7.

[6][6]Kumar S, Behera P, Jha GN. A classification-based approach to the identification of multiword expressions (MWEs) in magahi applying SVM. Procedia Computer Science. 2017; 112:594-603.

[7][7]Boroş T, Pipa S, Mititelu VB, Tufiş D. A data-driven approach to verbal multiword expression detection. PARSEME shared task system description paper. In proceedings of the 13th workshop on multiword expressions 2017 (pp. 121-6).

[8][8]Sinha RM. Stepwise mining of multi-word expressions in Hindi. In proceedings of the workshop on multiword expressions: from parsing and generation to the real world 2011 (pp. 110-5).

[9][9]Agrawal S, Sanyal R, Sanyal S. Hybrid method for automatic extraction of multiword expressions. International Journal of Engineering & Technology. 2018; 7(2.6):33-8.

[10][10]Majumder G, Pakray P, Khiangte Z, Gelbukh A. Multiword expressions (MWE) for Mizo language: literature survey. In international conference on intelligent text processing and computational linguistics 2016 (pp. 623-35). Springer, Cham.

[11][11]Singh D, Bhingardive S, Bhattacharyya P. Multiword expressions dataset for Indian languages. In proceedings of the tenth international conference on language resources and evaluation (LREC16) 2016 (pp. 2331-5).

[12][12]Dandapat S, Mitra P, Sarkar S. Statistical investigation of Bengali noun-verb (NV) collocations as multi-word-expressions. Proceedings of Modeling and Shallow Parsing of Indian Languages (MSPIL). 2006:230-3.

[13][13]Attia M, Toral A, Tounsi L, Pecina P, Van Genabith J. Automatic extraction of Arabic multiword expressions. In proceedings of the 2010 workshop on multiword expressions: from theory to applications 2010 (pp. 19-27).

[14][14]Kulkarni N, Finlayson M. JMWE: a java toolkit for detecting multi-word expressions. In proceedings of the workshop on multiword expressions: from parsing and generation to the real world 2011 (pp. 122-4).

[15][15]Chakraborty T, Das D, Bandyopadhyay S. Identifying bengali multiword expressions using semantic clustering. Lingvisticæ Investigationes. 2014; 37(1):106-28.

[16][16]Daoud D, Al-kouz A, Daoud M. Time-sensitive Arabic multiword expressions extraction from social networks. International Journal of Speech Technology. 2016; 19(2):249-58.

[17][17]Singh A, Jamwal SS. Identification, extraction and translation of multiword expressions. International Journal of Advanced Research in Computer Science and Software Engineering. 2016; 6(7):445-9.

[18][18]Joon R, Singhal A. Role of lexical and syntactic fixedness in acquisition of hindi MWEs. In international conference on advances in computing and data sciences 2019 (pp. 155-63). Springer, Singapore.

[19][19]Qasmi NH, Zia HB, Athar A, Raza AA. SimplifyUR: unsupervised lexical text simplification for Urdu. In proceedings of the 12th language resources and evaluation conference 2020 (pp. 3484-9).

[20][20]Han L, Jones GJ, Smeaton AF. MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. arXiv preprint arXiv:2005.10583. 2020.

[21][21]Fleischhauer J. Predicative multi-word expressions in persian. In proceedings of the 34th Pacific Asia conference on language, information and computation 2020 (pp. 552-61).

[22][22]Goyal KD, Goyal V. Development of hybrid algorithm for automatic extraction of multiword expressions from monolingual and parallel corpus of English and Punjabi. In proceedings of the 17th international conference on natural language processing (ICON): system demonstrations 2020 (pp. 4-6).

[23][23]Ramisch C, Savary A, Guillaume B, Waszczuk J, Candito M, Vaidya A, et al. Edition 1.2 of the PARSEME shared task on semi-supervised identification of verbal multiword expressions. In proceedings of the joint workshop on multiword expressions and electronic lexicons 2020 (pp. 107-18).

[24][24]Marszałek-kowalewska K. Discovery of multiword expressions with loanwords and their equivalents in the persian language. In proceedings of the international conference on recent advances in natural language processing 2021 (pp. 918-28).

[25][25]Tan KS, Lim TM, Tan CW. A study on multiword expression features in emotion detection of code-mixed twitter data. In international conference on artificial intelligence in engineering and technology (IICAIET) 2021 (pp. 1-5). IEEE.

[26][26]Han L, Jones GJ, Smeaton AF, Bolzoni P. Chinese character decomposition for neural MT with multi-word expressions. arXiv preprint arXiv:2104.04497. 2021.

[27][27]Jamwal SS, Gupta P, Sen VS. Multiword expression extraction using supervised ML for dogri language. In mobile radio communications and 5G networks 2022 (pp. 365-77). Springer, Singapore.

[28][28]Iwatsuki K, Boudin F, Aizawa A. Extraction and evaluation of formulaic expressions used in scholarly papers. Expert Systems with Applications. 2022.

[29][29]Muraki EJ, Abdalla S, Brysbaert M, Pexman PM. Concreteness ratings for 62 thousand English multiword expressions. Concreteness Ratings for Multiword Expressions. 2022.

[30][30]Nunsanga MV, Pakray P, Lalngaihtuaha M, Lolit Kumar Singh L. Stochastic based part of speech tagging in mizo language: unigram and bigram hidden markov model. In edge analytics 2022 (pp. 711-22). Springer, Singapore.

[31][31]Khan W, Daud A, Khan K, Nasir JA, Basheri M, Aljohani N, et al. Part of speech tagging in Urdu: comparison of machine and deep learning approaches. IEEE Access. 2019; 7:38918-36.

[32][32]Kaur J, Saini JR. A study of text classification natural language processing algorithms for Indian languages. VNSGU Journal of Science and Technology. 2015; 4(1):162-7.

[33][33]Gayen V, Sarkar K. A machine learning approach for the identification of bengali noun-noun compound multiword expressions. arXiv preprint arXiv:1401.6567. 2014.

[34][34]Sing S, Jha GN. English multi-word expressions (MWE): a tagset for health domain. In international conference on advances in computing, communications and informatics (ICACCI) 2018 (pp. 1812-7). IEEE.

[35][35]Venkatapathy S, Joshi A. Measuring the relative compositionality of verb-noun (VN) collocations by integrating features. In proceedings of human language technology conference and conference on empirical methods in natural language processing 2005 (pp. 899-906).

[36][36]Diab MT, Krishna M. Unsupervised classification of verb noun multi-word expression tokens. In international conference on intelligent text processing and computational linguistics 2009 (pp. 98-110). Springer, Berlin, Heidelberg.

[37][37]Bharati A, Sangal R, Mishra D, Venkatapathy S, Reddy TP. Handling multi-word expressions without explicit linguistic rules in an MT system. In international conference on text, speech and dialogue 2004 (pp. 31-40). Springer, Berlin, Heidelberg.

[38][38]Hu D. An introductory survey on attention mechanisms in NLP problems. In proceedings of SAI intelligent systems conference 2019 (pp. 432-48). Springer, Cham.

[39][39]Khan SA, Anwar W, Bajwa UI. Challenges in developing a rule based urdu stemmer. In proceedings of the 2nd workshop on south southeast asian natural language processing 2011 (pp. 46-51).

[40][40]Kansal R, Goyal V, Lehal GS. Rule based Urdu stemmer. In proceedings of COLING 2012: demonstration papers 2012 (pp. 267-76).

[41][41]Lafferty J, McCallum A, Pereira FC. Conditional random fields: probabilistic models for segmenting and labeling sequence data. 2001.

[42][42]Shahnawaz, Mishra RB. Statistical machine translation system for English to Urdu. International Journal of Advanced Intelligence Paradigms. 2013; 5(3):182-203.