ACCENTS Journals

Download PDF
Back

Paper Title	:	Lip2Voice: a sequence-to-sequence visual speech recognition system for predicting speech from silent video inputs
Author Name	:	Aathira Pillai, Bhavana Mache and Supriya Kelkar
Abstract	:	Lip reading refers to the understanding of speech without relying on auditory input and is beneficial for individuals with speech impairments, as it enables them to participate in social activities. In this work, a visual speech recognition (VSR) system was developed using a sequence-to-sequence (seq2seq) learning paradigm. The proposed model used a spatio-temporal encoder to capture the sequence of lip movements, which was then complemented by a decoder to generate speech of superior quality. The predicted mel spectrogram was reconstructed utilizing the Griffin-Lim algorithm. In addition, incorporating an inference module has enabled the creation of fixed-length speech from input videos of varying lengths. A different training method termed "alternative training" was adopted to instruct the model to prioritize the sentences themselves over speaker-specific qualities, hence leading to a quicker convergence. The model achieved a training loss of 36.6% on the dual speaker dataset and reduced the word error rate (WER) by 10% compared to the Vid2Speech model. A comprehensive human subjective evaluation was conducted on five audio sets, assessing two metrics—audibility and mispronunciation. The results showed that Lip2Voice had a lower overall percentage error than the Vid2Speech model. A comparative analysis between the proposed and existing models, focusing on audio spectrograms and frequency domain waveforms evaluated on power spectral density (PSD), demonstrated the similarity between the spectrograms generated by Lip2Voice model and the original audio. This research indicates that computer-based lip-reading systems for people with speech impairments are attainable.
Keywords	:	Lip reading, Visual speech recognition, Spatio-temporal encoder, Alternative training, Speech impairments, Word error rate.
Cite this article	:	Pillai A, Mache B, Kelkar S.Lip2Voice: a sequence-to-sequence visual speech recognition system for predicting speech from silent video inputs. International Journal of Advanced Technology and Engineering Exploration. 2024;11(121):1747-1767. DOI:10.19101/IJATEE.2023.10102613
References	:	[1]Prajwal KR, Mukhopadhyay R, Namboodiri VP, Jawahar CV. Learning individual speaking styles for accurate lip to speech synthesis. In proceedings of the conference on computer vision and pattern recognition 2020 (pp. 13796-805). IEEE. [Crossref] [Google Scholar] [2]Hassanat AB. Visual speech recognition. Speech and Language Technologies. 2011; 1:279-303. [Google Scholar] [3]Son CJ, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 6447-56). IEEE. [Crossref] [Google Scholar] [4]Fernandez-lopez A, Karaali A, Harte N, Sukno FM. Cogans for unsupervised visual speech adaptation to new speakers. In international conference on acoustics, speech and signal processing (ICASSP) 2020 (pp. 6294-8). IEEE. [Crossref] [Google Scholar] [5]Hao M, Mamut M, Yadikar N, Aysa A, Ubul K. A survey of research on lipreading technology. IEEE Access. 2020; 8:204518-44. [Crossref] [Google Scholar] [6]Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In proceedings of the winter conference on applications of computer vision 2021 (pp. 2857-66). IEEE. [Crossref] [Google Scholar] [7]Afouras T, Chung JS, Zisserman A. Asr is all you need: cross-modal distillation for lip reading. In international conference on acoustics, speech and signal processing 2020 (pp. 2143-7). IEEE. [Crossref] [Google Scholar] [8]Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In proceedings of the international conference on computer vision workshops 2017 (pp. 455-62). IEEE. [Crossref] [Google Scholar] [9]Gao W, Hashemi-sakhtsari A, Mcdonnell MD. End-to-end phoneme recognition using models from semantic image segmentation. In international joint conference on neural networks 2020 (pp. 1-7). IEEE. [Crossref] [Google Scholar] [10]Sarhan AM, Elshennawy NM, Ibrahim DM. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks. Computers, Materials and Continua. 2021; 68(2):1531-49. [Crossref] [Google Scholar] [11]Guan C, Wang S, Liew AW. Lip image segmentation based on a fuzzy convolutional neural network. IEEE Transactions on Fuzzy Systems. 2019; 28(7):1242-51. [Crossref] [Google Scholar] [12]Tsourounis D, Kastaniotis D, Fotopoulos S. Lip reading by alternating between spatiotemporal and spatial convolutions. Journal of Imaging. 2021; 7(5):1-17. [Crossref] [Google Scholar] [13]Tao F, Busso C. End-to-end audiovisual speech recognition system with multitask learning. IEEE Transactions on Multimedia. 2020; 23:1-11. [Crossref] [Google Scholar] [14]Xu K, Li D, Cassimatis N, Wang X. LCANet: end-to-end lipreading with cascaded attention-CTC. In 13th international conference on automatic face & gesture recognition 2018 (pp. 548-55). IEEE. [Crossref] [Google Scholar] [15]Burchi M, Timofte R. Audio-visual efficient conformer for robust speech recognition. In proceedings of the winter conference on applications of computer vision 2023 (pp. 2258-67). IEEE. [Crossref] [Google Scholar] [16]Serdyuk D, Braga O, Siohan O. Audio-visual speech recognition is worth 32times 32times 8 voxels. In automatic speech recognition and understanding workshop 2021 (pp. 796-802). IEEE. [Crossref] [Google Scholar] [17]Shilaskar S, Iramani H. CTC-CNN-bidirectional LSTM based lip reading system. In international conference on emerging smart computing and informatics 2024 (pp. 1-6). IEEE. [Crossref] [Google Scholar] [18]Kuriakose LK, Sinciya PO, Joseph MR, Namita R, Nabi S, Lone TA. Dip into: a novel method for visual speech recognition using deep learning. In annual international conference on emerging research areas: international conference on intelligent systems 2023 (pp. 1-6). IEEE. [Crossref] [Google Scholar] [19]Burchi M, Puvvada KC, Balam J, Ginsburg B, Timofte R. Multilingual audio-visual speech recognition with hybrid CTC/RNN-T fast conformer. In international conference on acoustics, speech and signal processing 2024 (pp. 10211-5). IEEE. [Crossref] [Google Scholar] [20]Liu X, Lakomkin E, Vougioukas K, Ma P, Chen H, Xie R, et al. SynthVSR: scaling up visual speech recognition with synthetic supervision. In proceedings of the conference on computer vision and pattern recognition 2023 (pp. 18806-15). IEEE. [Crossref] [Google Scholar] [21]Zhang JX, Ling ZH, Liu LJ, Jiang Y, Dai LR. Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019; 27(3):631-44. [Crossref] [Google Scholar] [22]Zhang X, Cheng F, Wang S. Spatio-temporal fusion based convolutional sequence learning for lip reading. In proceedings of the international conference on computer vision 2019 (pp. 713-22). IEEE. [Crossref] [Google Scholar] [23]Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 44(12):8717-27. [Crossref] [Google Scholar] [24]Sterpu G, Saam C, Harte N. Attention-based audio-visual fusion for robust automatic speech recognition. In proceedings of the 20th international conference on multimodal interaction 2018 (pp. 111-5). ACM. [Crossref] [Google Scholar] [25]Luo M, Yang S, Shan S, Chen X. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 15th international conference on automatic face and gesture recognition 2020 (pp. 273-80). IEEE. [Crossref] [Google Scholar] [26]Ephrat A, Peleg S. Vid2speech: speech reconstruction from silent video. In international conference on acoustics, speech and signal processing 2017 (pp. 5095-9). IEEE. [Crossref] [Google Scholar] [27]Arthur FV, Csapó TG. Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend. In the international conference on artificial intelligence and computer vision 2021 (pp. 441-50). Cham: Springer International Publishing. [Crossref] [Google Scholar] [28]Akbari H, Arora H, Cao L, Mesgarani N. Lip2audspec: speech reconstruction from silent lip movements video. In international conference on acoustics, speech and signal processing (ICASSP) 2018 (pp. 2516-20). IEEE. [Crossref] [Google Scholar] [29]Shandiz AH, Tóth L, Gosztolya G, Markó A, Csapó TG. Improving neural silent speech interface models by adversarial training. In the international conference on artificial intelligence and computer vision 2021 (pp. 430-40). Cham: Springer International Publishing. [Crossref] [Google Scholar] [30]Prajwal KR, Afouras T, Zisserman A. Sub-word level lip reading with visual attention. In proceedings of the conference on computer vision and pattern recognition 2022 (pp. 5162-72). IEEE. [Crossref] [Google Scholar] [31]Ivanko D, Ryumin D, Markitantov M. End-to-end visual speech recognition for human-robot interaction. In AIP conference proceedings 2024, AIP Publishing. [Crossref] [Google Scholar] [32]Bhaskar S, Thasleema TM. LSTM model for visual speech recognition through facial expressions. Multimedia Tools and Applications. 2023; 82(4):5455-72. [Crossref] [Google Scholar] [33]Ajitha D, Dutta D, Saha F, Giri P, Kant R. AI LipReader-transcribing speech from lip movements. In international conference on emerging smart computing and informatics 2024 (pp. 1-6). IEEE. [Crossref] [Google Scholar] [34]Adeel A, Gogate M, Hussain A, Whitmer WM. Lip-reading driven deep learning approach for speech enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence. 2019; 5(3):481-90. [Crossref] [Google Scholar] [35]Hou JC, Wang SS, Lai YH, Tsao Y, Chang HW, Wang HM. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence. 2018; 2(2):117-28. [Crossref] [Google Scholar] [36]Thimmaraja YG, Nagaraja BG, Jayanna HS. Speech enhancement and encoding by combining SS-VAD and LPC. International Journal of Speech Technology. 2021; 24(1):165-72. [Crossref] [Google Scholar] [37]Sadeghi M, Leglaive S, Alameda-pineda X, Girin L, Horaud R. Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020; 28:1788-800. [Crossref] [Google Scholar] [38]Patilkulkarni S. Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimedia Tools and Applications. 2021; 80(19):28941-52. [Crossref] [Google Scholar] [39]Xiao J, Yang S, Zhang Y, Shan S, Chen X. Deformation flow based two-stream network for lip reading. In 15th international conference on automatic face and gesture recognition 2020 (pp. 364-70). IEEE. [Crossref] [Google Scholar] [40]Lu Y, Li H. Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Applied Sciences. 2019; 9(8):1-12. [Crossref] [Google Scholar] [41]Thangthai K, Harvey RW. Building large-vocabulary speaker-independent lipreading systems. In interspeech 2018 (pp. 2648-52). [Google Scholar] [42]Liu L, Feng G, Beautemps D, Zhang XP. Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Transactions on Multimedia. 2020; 23:292-305. [Crossref] [Google Scholar] [43]Cuervo S, Grabias M, Chorowski J, Ciesielski G, Łańcucki A, Rychlikowski P, et al. Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. In international conference on acoustics, speech and signal processing (ICASSP) 2022 (pp. 3189-93). IEEE. [Crossref] [Google Scholar] [44]Lin B, Wang L. Learning acoustic frame labeling for phoneme segmentation with regularized attention mechanism. In international conference on acoustics, speech and signal processing 2022 (pp. 7882-6). IEEE. [Crossref] [Google Scholar] [45]Wang Y. Research on automatic generation algorithm of phoneme conversion learning corpus based on KNN algorithm. In international conference on image processing and computer applications 2023 (pp. 1624-8). IEEE. [Crossref] [Google Scholar] [46]Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America. 2006; 120(5):2421-4. [Crossref] [Google Scholar] [47]Wang X, Mi J, Li B, Zhao Y, Meng J. CATNet: cross-modal fusion for audio–visual speech recognition. Pattern Recognition Letters. 2024; 178:216-22. [Crossref] [Google Scholar] [48]Yeo JH, Kim M, Choi J, Kim DH, Ro YM. AKVSR: audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model. IEEE Transactions on Multimedia. 2024; 26:6462-74. [Crossref] [Google Scholar] [49]Liu ZT, Rehman A, Wu M, Cao WH, Hao M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Information Sciences. 2021; 563:309-25. [Crossref] [Google Scholar]