International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (Print): 2394-5443 ISSN (Online): 2394-7454 Volume - 11 Issue - 121 December - 2024

  1. Google Scholar
Lip2Voice: a sequence-to-sequence visual speech recognition system for predicting speech from silent video inputs

Aathira Pillai, Bhavana Mache and Supriya Kelkar

Abstract

Lip reading refers to the understanding of speech without relying on auditory input and is beneficial for individuals with speech impairments, as it enables them to participate in social activities. In this work, a visual speech recognition (VSR) system was developed using a sequence-to-sequence (seq2seq) learning paradigm. The proposed model used a spatio-temporal encoder to capture the sequence of lip movements, which was then complemented by a decoder to generate speech of superior quality. The predicted mel spectrogram was reconstructed utilizing the Griffin-Lim algorithm. In addition, incorporating an inference module has enabled the creation of fixed-length speech from input videos of varying lengths. A different training method termed "alternative training" was adopted to instruct the model to prioritize the sentences themselves over speaker-specific qualities, hence leading to a quicker convergence. The model achieved a training loss of 36.6% on the dual speaker dataset and reduced the word error rate (WER) by 10% compared to the Vid2Speech model. A comprehensive human subjective evaluation was conducted on five audio sets, assessing two metrics—audibility and mispronunciation. The results showed that Lip2Voice had a lower overall percentage error than the Vid2Speech model. A comparative analysis between the proposed and existing models, focusing on audio spectrograms and frequency domain waveforms evaluated on power spectral density (PSD), demonstrated the similarity between the spectrograms generated by Lip2Voice model and the original audio. This research indicates that computer-based lip-reading systems for people with speech impairments are attainable.

Keyword

Lip reading, Visual speech recognition, Spatio-temporal encoder, Alternative training, Speech impairments, Word error rate.

Cite this article

Pillai A, Mache B, Kelkar S.Lip2Voice: a sequence-to-sequence visual speech recognition system for predicting speech from silent video inputs. International Journal of Advanced Technology and Engineering Exploration. 2024;11(121):1747-1767. DOI:10.19101/IJATEE.2023.10102613

Refference

[1]Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV. Learning individual speaking styles for accurate lip to speech synthesis. In proceedings of the conference on computer vision and pattern recognition 2020 (pp. 13796-805). IEEE.

[2]Hassanat AB. Visual speech recognition. Speech and Language Technologies. 2011; 1:279-303.

[3]Son CJ, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 6447-56). IEEE.

[4]Fernandez-lopez A, Karaali A, Harte N, Sukno FM. Cogans for unsupervised visual speech adaptation to new speakers. In international conference on acoustics, speech and signal processing (ICASSP) 2020 (pp. 6294-8). IEEE.

[5]Hao M, Mamut M, Yadikar N, Aysa A, Ubul K. A survey of research on lipreading technology. IEEE Access. 2020; 8:204518-44.

[6]Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In proceedings of the winter conference on applications of computer vision 2021 (pp. 2857-66). IEEE.

[7]Afouras T, Chung JS, Zisserman A. Asr is all you need: cross-modal distillation for lip reading. In international conference on acoustics, speech and signal processing 2020 (pp. 2143-7). IEEE.

[8]Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In proceedings of the international conference on computer vision workshops 2017 (pp. 455-62). IEEE.

[9]Gao W, Hashemi-sakhtsari A, Mcdonnell MD. End-to-end phoneme recognition using models from semantic image segmentation. In international joint conference on neural networks 2020 (pp. 1-7). IEEE.

[10]Sarhan AM, Elshennawy NM, Ibrahim DM. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks. Computers, Materials and Continua. 2021; 68(2):1531-49.

[11]Guan C, Wang S, Liew AW. Lip image segmentation based on a fuzzy convolutional neural network. IEEE Transactions on Fuzzy Systems. 2019; 28(7):1242-51.

[12]Tsourounis D, Kastaniotis D, Fotopoulos S. Lip reading by alternating between spatiotemporal and spatial convolutions. Journal of Imaging. 2021; 7(5):1-17.

[13]Tao F, Busso C. End-to-end audiovisual speech recognition system with multitask learning. IEEE Transactions on Multimedia. 2020; 23:1-11.

[14]Xu K, Li D, Cassimatis N, Wang X. LCANet: end-to-end lipreading with cascaded attention-CTC. In 13th international conference on automatic face & gesture recognition 2018 (pp. 548-55). IEEE.

[15]Burchi M, Timofte R. Audio-visual efficient conformer for robust speech recognition. In proceedings of the winter conference on applications of computer vision 2023 (pp. 2258-67). IEEE.

[16]Serdyuk D, Braga O, Siohan O. Audio-visual speech recognition is worth 32times 32times 8 voxels. In automatic speech recognition and understanding workshop 2021 (pp. 796-802). IEEE.

[17]Shilaskar S, Iramani H. CTC-CNN-bidirectional LSTM based lip reading system. In international conference on emerging smart computing and informatics 2024 (pp. 1-6). IEEE.

[18]Kuriakose LK, Sinciya PO, Joseph MR, Namita R, Nabi S, Lone TA. Dip into: a novel method for visual speech recognition using deep learning. In annual international conference on emerging research areas: international conference on intelligent systems 2023 (pp. 1-6). IEEE.

[19]Burchi M, Puvvada KC, Balam J, Ginsburg B, Timofte R. Multilingual audio-visual speech recognition with hybrid CTC/RNN-T fast conformer. In international conference on acoustics, speech and signal processing 2024 (pp. 10211-5). IEEE.

[20]Liu X, Lakomkin E, Vougioukas K, Ma P, Chen H, Xie R, et al. SynthVSR: scaling up visual speech recognition with synthetic supervision. In proceedings of the conference on computer vision and pattern recognition 2023 (pp. 18806-15). IEEE.

[21]Zhang JX, Ling ZH, Liu LJ, Jiang Y, Dai LR. Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019; 27(3):631-44.

[22]Zhang X, Cheng F, Wang S. Spatio-temporal fusion based convolutional sequence learning for lip reading. In proceedings of the international conference on computer vision 2019 (pp. 713-22). IEEE.

[23]Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 44(12):8717-27.

[24]Sterpu G, Saam C, Harte N. Attention-based audio-visual fusion for robust automatic speech recognition. In proceedings of the 20th international conference on multimodal interaction 2018 (pp. 111-5). ACM.

[25]Luo M, Yang S, Shan S, Chen X. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 15th international conference on automatic face and gesture recognition 2020 (pp. 273-80). IEEE.

[26]Ephrat A, Peleg S. Vid2speech: speech reconstruction from silent video. In international conference on acoustics, speech and signal processing 2017 (pp. 5095-9). IEEE.

[27]Arthur FV, Csapó TG. Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend. In the international conference on artificial intelligence and computer vision 2021 (pp. 441-50). Cham: Springer International Publishing.

[28]Akbari H, Arora H, Cao L, Mesgarani N. Lip2audspec: speech reconstruction from silent lip movements video. In international conference on acoustics, speech and signal processing (ICASSP) 2018 (pp. 2516-20). IEEE.

[29]Shandiz AH, Tóth L, Gosztolya G, Markó A, Csapó TG. Improving neural silent speech interface models by adversarial training. In the international conference on artificial intelligence and computer vision 2021 (pp. 430-40). Cham: Springer International Publishing.

[30]Prajwal KR, Afouras T, Zisserman A. Sub-word level lip reading with visual attention. In proceedings of the conference on computer vision and pattern recognition 2022 (pp. 5162-72). IEEE.

[31]Ivanko D, Ryumin D, Markitantov M. End-to-end visual speech recognition for human-robot interaction. In AIP conference proceedings 2024, AIP Publishing.

[32]Bhaskar S, Thasleema TM. LSTM model for visual speech recognition through facial expressions. Multimedia Tools and Applications. 2023; 82(4):5455-72.

[33]Ajitha D, Dutta D, Saha F, Giri P, Kant R. AI LipReader-transcribing speech from lip movements. In international conference on emerging smart computing and informatics 2024 (pp. 1-6). IEEE.

[34]Adeel A, Gogate M, Hussain A, Whitmer WM. Lip-reading driven deep learning approach for speech enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence. 2019; 5(3):481-90.

[35]Hou JC, Wang SS, Lai YH, Tsao Y, Chang HW, Wang HM. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence. 2018; 2(2):117-28.

[36]Thimmaraja YG, Nagaraja BG, Jayanna HS. Speech enhancement and encoding by combining SS-VAD and LPC. International Journal of Speech Technology. 2021; 24(1):165-72.

[37]Sadeghi M, Leglaive S, Alameda-pineda X, Girin L, Horaud R. Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020; 28:1788-800.

[38]Patilkulkarni S. Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimedia Tools and Applications. 2021; 80(19):28941-52.

[39]Xiao J, Yang S, Zhang Y, Shan S, Chen X. Deformation flow based two-stream network for lip reading. In 15th international conference on automatic face and gesture recognition 2020 (pp. 364-70). IEEE.

[40]Lu Y, Li H. Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Applied Sciences. 2019; 9(8):1-12.

[41]Thangthai K, Harvey RW. Building large-vocabulary speaker-independent lipreading systems. In interspeech 2018 (pp. 2648-52).

[42]Liu L, Feng G, Beautemps D, Zhang XP. Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Transactions on Multimedia. 2020; 23:292-305.

[43]Cuervo S, Grabias M, Chorowski J, Ciesielski G, Łańcucki A, Rychlikowski P, et al. Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. In international conference on acoustics, speech and signal processing (ICASSP) 2022 (pp. 3189-93). IEEE.

[44]Lin B, Wang L. Learning acoustic frame labeling for phoneme segmentation with regularized attention mechanism. In international conference on acoustics, speech and signal processing 2022 (pp. 7882-6). IEEE.

[45]Wang Y. Research on automatic generation algorithm of phoneme conversion learning corpus based on KNN algorithm. In international conference on image processing and computer applications 2023 (pp. 1624-8). IEEE.

[46]Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America. 2006; 120(5):2421-4.

[47]Wang X, Mi J, Li B, Zhao Y, Meng J. CATNet: cross-modal fusion for audio–visual speech recognition. Pattern Recognition Letters. 2024; 178:216-22.

[48]Yeo JH, Kim M, Choi J, Kim DH, Ro YM. AKVSR: audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model. IEEE Transactions on Multimedia. 2024; 26:6462-74.

[49]Liu ZT, Rehman A, Wu M, Cao WH, Hao M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Information Sciences. 2021; 563:309-25.