ACCENTS Journals

Download PDF
Back

Paper Title	:	Analyzing lip dynamics using sparrow search optimized BiLSTM classifier
Author Name	:	Shilpa Sonawane and P. Malathi
Abstract	:	Applications involving voice-based automatic speech recognition (ASR) have recently gained popularity. The voice-based applications fail in noisy backgrounds, overlapping speeches, and when the speech signal is completely distorted. Speech information can be recovered from the mouth region and facial emotions. The effective solution over ASR is visual speech synthesis (VSS) as it provides information about the utterance of the word from lip dynamics. The proposed methodology aims to generate speech directly from lip motion without text as an intermediate representation. A visual-voice embedding is introduced to store vital acoustic knowledge, enabling the production of audio from different speakers. The proposed sparrow search optimized bidirectional long short-term memory (BiLSTM) model takes input from lip movements and relative acoustic information, which are utilized during training. Our major contributions are: (1) suggested the use of visual voice embedding that provides additional audio information and enhances the visual aspects, thus generating superior speech from lip movements (2) the sparrow search algorithm (SSA) is employed to optimize the search for the best solution in generating audio samples from the search space, aiming to reduce loss (3) an autoregression model is proposed to produce speech from silent video without need of transcription of audio. The effectiveness of the model is checked on the GRID corpus. The performance analysis of the model is conducted by comparison between generated speech and ground truth signals concerning mean squared error (MSE), root mean square error (RMSE), signal to noise ratio (SNR), short time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). It is observed that the proposed methodology outperforms in terms of PESQ and STOI parameters. The PESQ score shows a significant improvement of 4.06 over the generative adversarial network (GAN), while the STOI score improves by 0.202.
Keywords	:	Visual speech synthesis, Automatic speech recognition, Lip dynamics, Sparrow search algorithm, Bidirectional long short-term memory.
Cite this article	:	Sonawane S, Malathi P.Analyzing lip dynamics using sparrow search optimized BiLSTM classifier. International Journal of Advanced Technology and Engineering Exploration. 2024;11(119):1430-1448. DOI:10.19101/IJATEE.2024.111100169
References	:	[1]Devi S, Chokshi S, Kotian K, Warwatkar J. Visual speech recognition. In 4th Biennial international conference on nascent technologies in engineering 2021 (pp. 1-4). IEEE. [Crossref] [Google Scholar] [2]Gabbay A, Ephrat A, Halperin T, Peleg S. Seeing through noise: visually driven speaker separation and enhancement. In international conference on acoustics, speech and signal processing 2018 (pp. 3051-5). IEEE. [Crossref] [Google Scholar] [3]Stewart D, Seymour R, Pass A, Ming J. Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics. 2013; 44(2):175-84. [Crossref] [Google Scholar] [4]Lesani FS, Ghazvini FF, Dianat R. Mobile phone security using automatic lip reading. In 9th international conference on e-commerce in developing countries: with focus on e-business 2015 (pp. 1-5). IEEE. [Crossref] [Google Scholar] [5]Mathulaprangsan S, Wang CY, Kusum AZ, Tai TC, Wang JC. A survey of visual lip reading and lip-password verification. In international conference on orange technologies 2015 (pp. 22-5). IEEE. [Crossref] [Google Scholar] [6]Sengupta S, Bhattacharya A, Desai P, Gupta A. Automated lip reading technique for password authentication. International Journal of Applied Information Systems. 2012; 4(3):18-24. [Crossref] [Google Scholar] [7]Son CJ, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 3444-53). IEEE. [Crossref] [Google Scholar] [8]Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In proceedings of the international conference on computer vision workshops 2017 (pp. 455-62). IEEE. [Crossref] [Google Scholar] [9]Liu J, Li C, Ren Y, Chen F, Zhao Z. Diffsinger: singing voice synthesis via shallow diffusion mechanism. In proceedings of the conference on artificial intelligence 2022 (pp. 11020-8). AAAI. [Crossref] [Google Scholar] [10]Bocquelet F, Hueber T, Girin L, Savariaux C, Yvert B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLoS Computational Biology. 2016; 12(11):e1005119. [Crossref] [Google Scholar] [11]Gabbay A, Shamir A, Peleg S. Visual speech enhancement. In proceedings of the workshop on interspeech 2018 (pp. 1170-4). [Google Scholar] [12]Mattos AB, Oliveira DA. Multi-view mouth renderization for assisting lip-reading. In proceedings of the 15th international web for all conference 2018 (pp. 1-10). ACM. [Crossref] [Google Scholar] [13]Deshmukh N, Ahire A, Bhandari SH, Mali A, Warkari K. Vision based lip reading system using deep learning. In international conference on computing, communication and green engineering 2021 (pp. 1-6). IEEE. [Crossref] [Google Scholar] [14]Ali NH, Abdulmunem ME, Ali AE. Constructed model for micro-content recognition in lip reading based deep learning. Bulletin of Electrical Engineering and Informatics. 2021; 10(5):2557-65. [Crossref] [Google Scholar] [15]Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(2):198-213. [Crossref] [Google Scholar] [16]Cox SJ, Harvey RW, Lan Y, Newman JL, Theobald BJ. The challenge of multispeaker lip-reading. In AVSP 2008 (pp. 179-84). [Google Scholar] [17]Ortega A, Sukno F, Lleida E, Frangi AF, Miguel A, Buera L, et al. AVCAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In LREC 2004:763-6. [Google Scholar] [18]Mahboob K, Nizami H, Ali F, Alvi F. Sentences prediction based on automatic lip-reading detection with deep learning convolutional neural networks using video-based features. In soft computing in data science: 6th international conference, virtual event, proceedings 2021 (pp. 42-53). Springer Singapore. [Crossref] [Google Scholar] [19]Caranica A, Cucu H, Burileanu C, Portet F, Vacher M. Speech recognition results for voice-controlled assistive applications. In international conference on speech technology and human-computer dialogue 2017 (pp. 1-8). IEEE. [Crossref] [Google Scholar] [20]Kumar K, Chen T, Stern RM. Profile view lip reading. In international conference on acoustics, speech and signal processing 2007 (pp. 429-32). IEEE. [Crossref] [Google Scholar] [21]Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 44(12):8717-27. [Crossref] [Google Scholar] [22]Fernandez-lopez A, Sukno FM. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing. 2018; 78:53-72. [Crossref] [Google Scholar] [23]Ivanko D, Ryumin D, Karpov A. Automatic lip-reading of hearing impaired people. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2019; 42:97-101. [Crossref] [Google Scholar] [24]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence. 2015; 42:722-37. [Crossref] [Google Scholar] [25]Thangthai K, Harvey R. Improving computer lipreading via DNN sequence discriminative training techniques. In proceedings 2017 (pp. 3657-61). ISCA. [Crossref] [Google Scholar] [26]Le CT, Milner B. Reconstructing intelligible audio speech from visual speech features. In interspeech 2015 (pp. 3355-9). ISCA. [Google Scholar] [27]Rekik A, Ben-hamadou A, Mahdi W. An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications. 2016; 75:8609-36. [Crossref] [Google Scholar] [28]Kumar Y, Jain R, Salik KM, Shah RR, Yin Y, Zimmermann R. Lipper: synthesizing thy speech using multi-view lipreading. In proceedings of the AAAI conference on artificial intelligence 2019 (pp. 2588-95). AAAI. [Crossref] [Google Scholar] [29]Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV. Learning individual speaking styles for accurate lip to speech synthesis. In proceedings of the conference on computer vision and pattern recognition 2020 (pp. 13793-802). IEEE. [Crossref] [Google Scholar] [30]Akbari H, Arora H, Cao L, Mesgarani N. Lip2audspec: speech reconstruction from silent lip movements video. In international conference on acoustics, speech and signal processing (ICASSP) 2018 (pp. 2516-20). IEEE. [Crossref] [Google Scholar] [31]Kumar LA, Renuka DK, Rose SL, Wartana IM. Deep learning based assistive technology on audio visual speech recognition for hearing impaired. International Journal of Cognitive Computing in Engineering. 2022; 3:24-30. [Crossref] [Google Scholar] [32]Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In proceedings of interspeech 2017 (pp. 3652-6). [Crossref] [33]Kim M, Hong J, Park SJ, Ro YM. Cromm-vsr: cross-modal memory augmented visual speech recognition. IEEE Transactions on Multimedia. 2021; 24:4342-55. [Crossref] [Google Scholar] [34]Yang Q, Bai Y, Liu F, Zhang W. Integrated visual transformer and flash attention for lip-to-speech generation GAN. Scientific Reports. 2024; 14(1):1-12. [Crossref] [Google Scholar] [35]Qu L, Weber C, Wermter S. Lipsound2: self-supervised pre-training for lip-to-speech reconstruction and lip reading. IEEE Transactions on Neural Networks and Learning Systems. 2022; 35(2):2772-82. [Crossref] [Google Scholar] [36]Kim M, Yeo JH, Choi J, Ro YM. Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In proceedings of the IEEE/CVF international conference on computer vision 2023 (pp. 15313-25). IEEE. [Crossref] [Google Scholar] [37]Weng Z, Qin Z, Tao X, Pan C, Liu G, Li GY. Deep learning enabled semantic communications with speech recognition and synthesis. IEEE Transactions on Wireless Communications. 2023; 22(9):6227-40. [Crossref] [Google Scholar] [38]Ivanko D, Ryumina E, Ryumin D. Improved automatic lip-reading based on the evaluation of intensity level of speaker’s emotion. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2023; 48:89-94. [Crossref] [Google Scholar] [39]Hegde S, Mukhopadhyay R, Jawahar CV, Namboodiri V. Towards accurate lip-to-speech synthesis in-the-wild. In proceedings of the 31st international conference on multimedia 2023 (pp. 5523-31). ACM. [Crossref] [Google Scholar] [40]Yemini Y, Shamsian A, Bracha L, Gannot S, Fetaya E. LipVoicer: generating speech from silent videos guided by lip reading. In the twelfth international conference on learning representations 2024 (pp.1-20). [Google Scholar] [41]Cooke M, Barker J, Cunningham S, Xu S. The grid audio-visual speech corpus (1.0). Zenodo: Geneva, Switzerland. 2006. [Google Scholar] [42]King DE. Dlib-ml: a machine learning toolkit. The Journal of Machine Learning Research. 2009; 10:1755-8. [Google Scholar] [43]He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-8). IEEE. [Google Scholar] [44]Ouyang C, Zhu D, Wang F. A learning sparrow search algorithm. Computational Intelligence and Neuroscience. 2021; 2021(1):1-23. [Crossref] [Google Scholar] [45]Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In international conference on acoustics, speech, and signal processing 2001(pp. 749-52). IEEE. [Crossref] [Google Scholar] [46]Mira R, Vougioukas K, Ma P, Petridis S, Schuller BW, Pantic M. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Transactions on Cybernetics. 2022; 53(6):3454-66. [Crossref] [Google Scholar] [47]Mira R, Haliassos A, Petridis S, Schuller BW, Pantic M. SVTS: scalable video-to-speech synthesis. In proceedings of the interspeech 2022 (pp. 1836-40). [Google Scholar] [48]Kim M, Hong J, Ro YM. Lip to speech synthesis with visual context attentional GAN. Advances in Neural Information Processing Systems. 2021; 34:2758-70. [Google Scholar] [49]Sahipjohn N, Shah N, Tambrahalli V, Gandhi V. RobustL2S: speaker-specific lip-to-speech synthesis exploiting self-supervised representations. In Asia pacific signal and information processing association annual summit and conference 2023 (pp. 1492-9). IEEE. [Crossref] [Google Scholar] [50]Wang D, Yang S, Su D, Liu X, Yu D, Meng H. VCVTS: multi-speaker video-to-speech synthesis via cross-modal knowledge transfer from voice conversion. In international conference on acoustics, speech and signal processing 2022 (pp. 7252-6). IEEE. [Crossref] [Google Scholar]