ACCENTS Journals

Download PDF
Back

Paper Title	:	Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques
Author Name	:	Eslam E. El Maghraby, Amr M. Gody and Mohamed Hesham Farouk
Abstract	:	Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV-ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second contribution of this research is the methodology of developing the classification process using deep learning, comparing different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio-visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal.
Keywords	:	AV-ASR, DCT, Blocked DCT, PCA, MFCC, HMM, BiLSTM, CNN, AVletters and GRID.
Cite this article	:	El Maghraby EE, Gody AM, Farouk MH.Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques. International Journal of Advanced Computer Research. 2020;10(47):51-71. DOI:10.19101/IJACR.2019.940134
References	:	[1]Tao F, Busso C. Lipreading approach for isolated digits recognition under whisper and neutral speech. In fifteenth annual conference of the international speech communication association 2014 (pp.1154-8). [Google Scholar] [2]Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(2):198-213. [Crossref] [Google Scholar] [3]Zhao G, Barnard M, Pietikainen M. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia. 2009; 11(7):1254-65. [Crossref] [Google Scholar] [4]Petajan ED. Automatic lipreading to enhance speech recognition (Speech Reading).1985. [Google Scholar] [5]Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, et al. Audio-visual speech recognition final workshop report. Center for language and speech processing, Johns Hopkins University, Baltimore 2000. [Google Scholar] [6]Potamianos G, Graf HP, Cosatto E. An image transform approach for HMM based automatic lipreading. In proceedings international conference on image processing, ICIP98 (Cat. No. 98CB36269) 1998 (pp. 173-7). IEEE. [Crossref] [Google Scholar] [7]Potamianos G, Neti C, Iyengar G, Senior AW, Verma A. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology. 2001; 4(3-4):193-208. [Crossref] [Google Scholar] [8]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Lipreading using convolutional neural network. In fifteenth annual conference of the international speech communication association 2014 (pp.1149-53). [Google Scholar] [9]Chowdhary CL. Application of object recognition with shape-index identification and 2D scale invariant feature transform for key-point detection. In feature dimension reduction for content-based image identification 2018 (pp. 218-31). IGI Global. [Crossref] [Google Scholar] [10]Chan MT. HMM-based audio-visual speech recognition integrating geometric-and appearance-based visual features. In fourth workshop on multimedia signal processing (Cat. No. 01TH8564) 2001 (pp. 9-14). IEEE. [Crossref] [Google Scholar] [11]McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976; 264:746-8. [Crossref] [Google Scholar] [12]Potamianos G, Neti C, Gravier G, Garg A, Senior AW. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE. 2003; 91(9):1306-26. [Crossref] [Google Scholar] [13]El Maghraby EE, Gody AM, Farouk MH. Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal. In international computer engineering conference 2016 (pp. 219-29). IEEE. [Crossref] [Google Scholar] [14]Salama ES, El-Khoribi RA, Shoman ME. Audio-visual speech recognition for people with speech disorders. International Journal of Computer Applications. 2014; 96(2):51-6. [Google Scholar] [15]Schmidhuber J. Deep learning in neural networks: an overview. Neural Networks. 2015; 61:85-117. [Crossref] [Google Scholar] [16]Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine. 2012; 29(6):82-97. [Crossref] [Google Scholar] [17]Petridis S, Pantic M. Deep complementary bottleneck features for visual speech recognition. In international conference on acoustics, speech and signal processing 2016 (pp. 2304-8). IEEE. [Crossref] [Google Scholar] [18]Zhang F, Li W, Zhang Y, Feng Z. Data driven feature selection for machine learning algorithms in computer vision. IEEE Internet of Things Journal. 2018; 5(6):4262-72. [Crossref] [Google Scholar] [19]Koller O, Ney H, Bowden R. Deep learning of mouth shapes for sign language. In proceedings of the international conference on computer vision workshops 2015 (pp. 477-83). [Crossref] [Google Scholar] [20]Goldschen AJ, Garcia ON, Petajan ED. Continuous automatic speech recognition by lipreading. In motion-based recognition 1997 (pp. 321-43). Springer, Dordrecht. [Crossref] [Google Scholar] [21]Tamura S, Ninomiya H, Kitaoka N, Osuga S, Iribe Y, Takeda K, et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In asia-pacific signal and information processing association annual summit and conference 2015 (pp. 575-82). IEEE. [Crossref] [Google Scholar] [22]Galatas G, Potamianos G, Makedon F. Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In proceedings of the European signal processing conference 2012 (pp. 2714-7). IEEE. [Google Scholar] [23]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence. 2015; 42: 722-37. [Crossref] [Google Scholar] [24]Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audio-visual speech recognition. In international conference on acoustics, speech and signal processing 2015 (pp. 2130-4). IEEE. [Crossref] [Google Scholar] [25]Chowdhary CL, Darwish A, Hassanien AE. Cognitive deep learning: future direction in intelligent retrieval. In handbook of research on deep learning innovations and trends 2019 (pp. 220-31). IGI Global. [Crossref] [Google Scholar] [26]Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In international conference on acoustics, speech and signal processing 2018 (pp. 6548-52). IEEE. [Crossref] [Google Scholar] [27]Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105. 2017. [Google Scholar] [28]Feng W, Guan N, Li Y, Zhang X, Luo Z. Audio visual speech recognition with multimodal recurrent neural networks. In international joint conference on neural networks 2017 (pp. 681-8). IEEE. [Crossref] [Google Scholar] [29]Ephrat A, Peleg S. Vid2speech: speech reconstruction from silent video. In IEEE international conference on acoustics, speech and signal processing 2017 (pp. 5095-9). IEEE. [Crossref] [Google Scholar] [30]Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America. 2006; 120(5):2421-4. [Crossref] [Google Scholar] [31]James PE, Mun HK, Vaithilingam CA. A hybrid spoken language processing system for smart device troubleshooting. Electronics. 2019; 8(6):1-16. [Crossref] [Google Scholar] [32]Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classification and recognition. In international conference on artificial neural networks 2005 (pp. 799-804). Springer, Berlin, Heidelberg. [Crossref] [Google Scholar] [33]Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory. In international conference on acoustics, speech and signal processing 2016 (pp. 6115-19). IEEE. [Crossref] [Google Scholar] [34]Chung JS, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In conference on computer vision and pattern recognition 2017 (pp. 3444-53). IEEE. [Crossref] [Google Scholar] [35]Thanda A, Venkatesan SM. Audio visual speech recognition using deep recurrent neural networks. In IAPR workshop on multimodal pattern recognition of social signals in human-computer interaction 2016 (pp. 98-109). Springer, Cham. [Crossref] [Google Scholar] [36]Shillingford B, Whiteson S, Assael ND. Lipnet: sentence-level lipreading. In GPU technology conference 2016. [Google Scholar] [37]Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unregimented sequence data with recurrent neural networks. In proceedings of the international conference on machine learning 2006 (pp. 369-76). [Google Scholar] [38]Barker J, Vincent E, Ma N, Christensen H, Green P. The PASCAL CHiME speech separation and recognition challenge. Computer Speech & Language. 2013; 27(3):621-33. [Crossref] [Google Scholar] [39]Gan T, Menzel W, Yang S. An audio-visual speech recognition framework based on articulatory features. Auditory-Visual Speech Processing 2007. [Google Scholar] [40]Cornu TL, Milner B. Reconstructing intelligible audio speech from visual speech features. In sixteenth annual conference of the international speech communication association 2015 (pp. 3355-9). [Google Scholar] [41]Bear HL, Harvey R. Decoding visemes: improving machine lip-reading. In international conference on acoustics, speech and signal processing 2016 (pp. 2009-13). IEEE. [Crossref] [Google Scholar] [42]http://www.mathworks.com. Accessed 20 October 2019. [43]https://www.phon.ucl.ac.uk/. Accessed 20 October 2019. [44]http://www.opencv.org/ . Accessed 20 October 2019. [45]Jensen OH. Implementing the viola-jones face detection algorithm. Masters thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark. 2008. [Google Scholar] [46]Potamianos G, Scanlon P. Exploiting lower face symmetry in appearance-based automatic speechreading. In AVSP 2005 (pp. 79-84). [Google Scholar] [47]Estellers V, Thiran JP. Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing. 2012. [Crossref] [Google Scholar] [48]Nefian AV, Liang L, Pi X, Liu X, Murphy K. Dynamic bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing. 2002. [Crossref] [Google Scholar] [49]Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004; 60(2):91-110. [Crossref] [Google Scholar] [50]Albiol A, Monzo D, Martin A, Sastre J, Albiol A. Face recognition using HOG–EBGM. Pattern Recognition Letters. 2008; 29(10):1537-43. [Crossref] [Google Scholar] [51]Ghorbani M, Targhi AT, Dehshibi MM. HOG and LBP: towards a robust face recognition system. In tenth international conference on digital information management 2015 (pp. 138-41). IEEE. [Crossref] [Google Scholar] [52]Tiwari V. MFCC and its applications in speaker recognition. International Journal on Emerging Technologies. 2010; 1(1):19-22. [Google Scholar] [53]Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, et al. The HTK book. Cambridge University, Engineering Department. 2006. [Google Scholar] [54]Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks. 1994; 5(2):157-66. [Crossref] [Google Scholar] [55]Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 1997; 45(11):2673-81. [Crossref] [Google Scholar] [56]Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks. 2005; 18(5-6):602-10. [Crossref] [Google Scholar] [57]Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997; 9(8):1735-80. [Crossref] [Google Scholar] [58]Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. In international conference on acoustics, speech and signal processing 2013 (pp. 6645-9). IEEE. [Crossref] [Google Scholar] [59]Graves A, Liwicki M, Bunke H, Schmidhuber J, Fernández S. Unconstrained on-line handwriting recognition with recurrent neural networks. In advances in neural information processing systems 2008 (pp. 577-84). [Google Scholar] [60]Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008; 31(5):855-68. [Crossref] [Google Scholar] [61]Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In advances in neural information processing systems 2012 (pp. 1097-105). [Google Scholar]