Analyzing lip dynamics using sparrow search optimized BiLSTM classifier
Shilpa Sonawane and P. Malathi
Abstract
Applications involving voice-based automatic speech recognition (ASR) have recently gained popularity. The voice-based applications fail in noisy backgrounds, overlapping speeches, and when the speech signal is completely distorted. Speech information can be recovered from the mouth region and facial emotions. The effective solution over ASR is visual speech synthesis (VSS) as it provides information about the utterance of the word from lip dynamics. The proposed methodology aims to generate speech directly from lip motion without text as an intermediate representation. A visual-voice embedding is introduced to store vital acoustic knowledge, enabling the production of audio from different speakers. The proposed sparrow search optimized bidirectional long short-term memory (BiLSTM) model takes input from lip movements and relative acoustic information, which are utilized during training. Our major contributions are: (1) suggested the use of visual voice embedding that provides additional audio information and enhances the visual aspects, thus generating superior speech from lip movements (2) the sparrow search algorithm (SSA) is employed to optimize the search for the best solution in generating audio samples from the search space, aiming to reduce loss (3) an autoregression model is proposed to produce speech from silent video without need of transcription of audio. The effectiveness of the model is checked on the GRID corpus. The performance analysis of the model is conducted by comparison between generated speech and ground truth signals concerning mean squared error (MSE), root mean square error (RMSE), signal to noise ratio (SNR), short time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). It is observed that the proposed methodology outperforms in terms of PESQ and STOI parameters. The PESQ score shows a significant improvement of 4.06 over the generative adversarial network (GAN), while the STOI score improves by 0.202.
Keyword
Visual speech synthesis, Automatic speech recognition, Lip dynamics, Sparrow search algorithm, Bidirectional long short-term memory.
Cite this article
Sonawane S, Malathi P.Analyzing lip dynamics using sparrow search optimized BiLSTM classifier. International Journal of Advanced Technology and Engineering Exploration. 2024;11(119):1430-1448. DOI:10.19101/IJATEE.2024.111100169
Refference
[1]Devi S, Chokshi S, Kotian K, Warwatkar J. Visual speech recognition. In 4th Biennial international conference on nascent technologies in engineering 2021 (pp. 1-4). IEEE.
[2]Gabbay A, Ephrat A, Halperin T, Peleg S. Seeing through noise: visually driven speaker separation and enhancement. In international conference on acoustics, speech and signal processing 2018 (pp. 3051-5). IEEE.
[3]Stewart D, Seymour R, Pass A, Ming J. Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics. 2013; 44(2):175-84.
[4]Lesani FS, Ghazvini FF, Dianat R. Mobile phone security using automatic lip reading. In 9th international conference on e-commerce in developing countries: with focus on e-business 2015 (pp. 1-5). IEEE.
[5]Mathulaprangsan S, Wang CY, Kusum AZ, Tai TC, Wang JC. A survey of visual lip reading and lip-password verification. In international conference on orange technologies 2015 (pp. 22-5). IEEE.
[6]Sengupta S, Bhattacharya A, Desai P, Gupta A. Automated lip reading technique for password authentication. International Journal of Applied Information Systems. 2012; 4(3):18-24.
[7]Son CJ, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 3444-53). IEEE.
[8]Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In proceedings of the international conference on computer vision workshops 2017 (pp. 455-62). IEEE.
[9]Liu J, Li C, Ren Y, Chen F, Zhao Z. Diffsinger: singing voice synthesis via shallow diffusion mechanism. In proceedings of the conference on artificial intelligence 2022 (pp. 11020-8). AAAI.
[10]Bocquelet F, Hueber T, Girin L, Savariaux C, Yvert B. Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLoS Computational Biology. 2016; 12(11):e1005119.
[11]Gabbay A, Shamir A, Peleg S. Visual speech enhancement. In proceedings of the workshop on interspeech 2018 (pp. 1170-4).
[12]Mattos AB, Oliveira DA. Multi-view mouth renderization for assisting lip-reading. In proceedings of the 15th international web for all conference 2018 (pp. 1-10). ACM.
[13]Deshmukh N, Ahire A, Bhandari SH, Mali A, Warkari K. Vision based lip reading system using deep learning. In international conference on computing, communication and green engineering 2021 (pp. 1-6). IEEE.
[14]Ali NH, Abdulmunem ME, Ali AE. Constructed model for micro-content recognition in lip reading based deep learning. Bulletin of Electrical Engineering and Informatics. 2021; 10(5):2557-65.
[15]Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(2):198-213.
[16]Cox SJ, Harvey RW, Lan Y, Newman JL, Theobald BJ. The challenge of multispeaker lip-reading. In AVSP 2008 (pp. 179-84).
[17]Ortega A, Sukno F, Lleida E, Frangi AF, Miguel A, Buera L, et al. AVCAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In LREC 2004:763-6.
[18]Mahboob K, Nizami H, Ali F, Alvi F. Sentences prediction based on automatic lip-reading detection with deep learning convolutional neural networks using video-based features. In soft computing in data science: 6th international conference, virtual event, proceedings 2021 (pp. 42-53). Springer Singapore.
[19]Caranica A, Cucu H, Burileanu C, Portet F, Vacher M. Speech recognition results for voice-controlled assistive applications. In international conference on speech technology and human-computer dialogue 2017 (pp. 1-8). IEEE.
[20]Kumar K, Chen T, Stern RM. Profile view lip reading. In international conference on acoustics, speech and signal processing 2007 (pp. 429-32). IEEE.
[21]Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 44(12):8717-27.
[22]Fernandez-lopez A, Sukno FM. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing. 2018; 78:53-72.
[23]Ivanko D, Ryumin D, Karpov A. Automatic lip-reading of hearing impaired people. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2019; 42:97-101.
[24]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence. 2015; 42:722-37.
[25]Thangthai K, Harvey R. Improving computer lipreading via DNN sequence discriminative training techniques. In proceedings 2017 (pp. 3657-61). ISCA.
[26]Le CT, Milner B. Reconstructing intelligible audio speech from visual speech features. In interspeech 2015 (pp. 3355-9). ISCA.
[27]Rekik A, Ben-hamadou A, Mahdi W. An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications. 2016; 75:8609-36.
[28]Kumar Y, Jain R, Salik KM, Shah RR, Yin Y, Zimmermann R. Lipper: synthesizing thy speech using multi-view lipreading. In proceedings of the AAAI conference on artificial intelligence 2019 (pp. 2588-95). AAAI.
[29]Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV. Learning individual speaking styles for accurate lip to speech synthesis. In proceedings of the conference on computer vision and pattern recognition 2020 (pp. 13793-802). IEEE.
[30]Akbari H, Arora H, Cao L, Mesgarani N. Lip2audspec: speech reconstruction from silent lip movements video. In international conference on acoustics, speech and signal processing (ICASSP) 2018 (pp. 2516-20). IEEE.
[31]Kumar LA, Renuka DK, Rose SL, Wartana IM. Deep learning based assistive technology on audio visual speech recognition for hearing impaired. International Journal of Cognitive Computing in Engineering. 2022; 3:24-30.
[32]Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In proceedings of interspeech 2017 (pp. 3652-6).
[33]Kim M, Hong J, Park SJ, Ro YM. Cromm-vsr: cross-modal memory augmented visual speech recognition. IEEE Transactions on Multimedia. 2021; 24:4342-55.
[34]Yang Q, Bai Y, Liu F, Zhang W. Integrated visual transformer and flash attention for lip-to-speech generation GAN. Scientific Reports. 2024; 14(1):1-12.
[35]Qu L, Weber C, Wermter S. Lipsound2: self-supervised pre-training for lip-to-speech reconstruction and lip reading. IEEE Transactions on Neural Networks and Learning Systems. 2022; 35(2):2772-82.
[36]Kim M, Yeo JH, Choi J, Ro YM. Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. In proceedings of the IEEE/CVF international conference on computer vision 2023 (pp. 15313-25). IEEE.
[37]Weng Z, Qin Z, Tao X, Pan C, Liu G, Li GY. Deep learning enabled semantic communications with speech recognition and synthesis. IEEE Transactions on Wireless Communications. 2023; 22(9):6227-40.
[38]Ivanko D, Ryumina E, Ryumin D. Improved automatic lip-reading based on the evaluation of intensity level of speaker’s emotion. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2023; 48:89-94.
[39]Hegde S, Mukhopadhyay R, Jawahar CV, Namboodiri V. Towards accurate lip-to-speech synthesis in-the-wild. In proceedings of the 31st international conference on multimedia 2023 (pp. 5523-31). ACM.
[40]Yemini Y, Shamsian A, Bracha L, Gannot S, Fetaya E. LipVoicer: generating speech from silent videos guided by lip reading. In the twelfth international conference on learning representations 2024 (pp.1-20).
[41]Cooke M, Barker J, Cunningham S, Xu S. The grid audio-visual speech corpus (1.0). Zenodo: Geneva, Switzerland. 2006.
[42]King DE. Dlib-ml: a machine learning toolkit. The Journal of Machine Learning Research. 2009; 10:1755-8.
[43]He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-8). IEEE.
[44]Ouyang C, Zhu D, Wang F. A learning sparrow search algorithm. Computational Intelligence and Neuroscience. 2021; 2021(1):1-23.
[45]Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In international conference on acoustics, speech, and signal processing 2001(pp. 749-52). IEEE.
[46]Mira R, Vougioukas K, Ma P, Petridis S, Schuller BW, Pantic M. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Transactions on Cybernetics. 2022; 53(6):3454-66.
[47]Mira R, Haliassos A, Petridis S, Schuller BW, Pantic M. SVTS: scalable video-to-speech synthesis. In proceedings of the interspeech 2022 (pp. 1836-40).
[48]Kim M, Hong J, Ro YM. Lip to speech synthesis with visual context attentional GAN. Advances in Neural Information Processing Systems. 2021; 34:2758-70.
[49]Sahipjohn N, Shah N, Tambrahalli V, Gandhi V. RobustL2S: speaker-specific lip-to-speech synthesis exploiting self-supervised representations. In Asia pacific signal and information processing association annual summit and conference 2023 (pp. 1492-9). IEEE.
[50]Wang D, Yang S, Su D, Liu X, Yu D, Meng H. VCVTS: multi-speaker video-to-speech synthesis via cross-modal knowledge transfer from voice conversion. In international conference on acoustics, speech and signal processing 2022 (pp. 7252-6). IEEE.