International Journal of Advanced Computer Research (IJACR) ISSN (Print): 2249-7277 ISSN (Online): 2277-7970 Volume - 10 Issue - 47 March - 2020
  1. 1
    Google Scholar
Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

Eslam E. El Maghraby, Amr M. Gody and Mohamed Hesham Farouk

Abstract

Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV-ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second contribution of this research is the methodology of developing the classification process using deep learning, comparing different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio-visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal.

Keyword

AV-ASR, DCT, Blocked DCT, PCA, MFCC, HMM, BiLSTM, CNN, AVletters and GRID.

Cite this article

El Maghraby EE, Gody AM, Farouk MH.Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques . International Journal of Advanced Computer Research. 2020;10(47):51-71. DOI:10.19101/IJACR.2019.940134

Refference

[1]Tao F, Busso C. Lipreading approach for isolated digits recognition under whisper and neutral speech. In fifteenth annual conference of the international speech communication association 2014 (pp.1154-8).

[2]Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(2):198-213.

[3]Zhao G, Barnard M, Pietikainen M. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia. 2009; 11(7):1254-65.

[4]Petajan ED. Automatic lipreading to enhance speech recognition (Speech Reading).1985.

[5]Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, et al. Audio-visual speech recognition final workshop report. Center for language and speech processing, Johns Hopkins University, Baltimore 2000.

[6]Potamianos G, Graf HP, Cosatto E. An image transform approach for HMM based automatic lipreading. In proceedings international conference on image processing, ICIP98 (Cat. No. 98CB36269) 1998 (pp. 173-7). IEEE.

[7]Potamianos G, Neti C, Iyengar G, Senior AW, Verma A. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology. 2001; 4(3-4):193-208.

[8]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Lipreading using convolutional neural network. In fifteenth annual conference of the international speech communication association 2014 (pp.1149-53).

[9]Chowdhary CL. Application of object recognition with shape-index identification and 2D scale invariant feature transform for key-point detection. In feature dimension reduction for content-based image identification 2018 (pp. 218-31). IGI Global.

[10]Chan MT. HMM-based audio-visual speech recognition integrating geometric-and appearance-based visual features. In fourth workshop on multimedia signal processing (Cat. No. 01TH8564) 2001 (pp. 9-14). IEEE.

[11]McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976; 264:746-8.

[12]Potamianos G, Neti C, Gravier G, Garg A, Senior AW. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE. 2003; 91(9):1306-26.

[13]El Maghraby EE, Gody AM, Farouk MH. Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal. In international computer engineering conference 2016 (pp. 219-29). IEEE.

[14]Salama ES, El-Khoribi RA, Shoman ME. Audio-visual speech recognition for people with speech disorders. International Journal of Computer Applications. 2014; 96(2):51-6.

[15]Schmidhuber J. Deep learning in neural networks: an overview. Neural Networks. 2015; 61:85-117.

[16]Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine. 2012; 29(6):82-97.

[17]Petridis S, Pantic M. Deep complementary bottleneck features for visual speech recognition. In international conference on acoustics, speech and signal processing 2016 (pp. 2304-8). IEEE.

[18]Zhang F, Li W, Zhang Y, Feng Z. Data driven feature selection for machine learning algorithms in computer vision. IEEE Internet of Things Journal. 2018; 5(6):4262-72.

[19]Koller O, Ney H, Bowden R. Deep learning of mouth shapes for sign language. In proceedings of the international conference on computer vision workshops 2015 (pp. 477-83).

[20]Goldschen AJ, Garcia ON, Petajan ED. Continuous automatic speech recognition by lipreading. In motion-based recognition 1997 (pp. 321-43). Springer, Dordrecht.

[21]Tamura S, Ninomiya H, Kitaoka N, Osuga S, Iribe Y, Takeda K, et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In asia-pacific signal and information processing association annual summit and conference 2015 (pp. 575-82). IEEE.

[22]Galatas G, Potamianos G, Makedon F. Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In proceedings of the European signal processing conference 2012 (pp. 2714-7). IEEE.

[23]Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence. 2015; 42: 722-37.

[24]Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audio-visual speech recognition. In international conference on acoustics, speech and signal processing 2015 (pp. 2130-4). IEEE.

[25]Chowdhary CL, Darwish A, Hassanien AE. Cognitive deep learning: future direction in intelligent retrieval. In handbook of research on deep learning innovations and trends 2019 (pp. 220-31). IGI Global.

[26]Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In international conference on acoustics, speech and signal processing 2018 (pp. 6548-52). IEEE.

[27]Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105. 2017.

[28]Feng W, Guan N, Li Y, Zhang X, Luo Z. Audio visual speech recognition with multimodal recurrent neural networks. In international joint conference on neural networks 2017 (pp. 681-8). IEEE.

[29]Ephrat A, Peleg S. Vid2speech: speech reconstruction from silent video. In IEEE international conference on acoustics, speech and signal processing 2017 (pp. 5095-9). IEEE.

[30]Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America. 2006; 120(5):2421-4.

[31]James PE, Mun HK, Vaithilingam CA. A hybrid spoken language processing system for smart device troubleshooting. Electronics. 2019; 8(6):1-16.

[32]Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classification and recognition. In international conference on artificial neural networks 2005 (pp. 799-804). Springer, Berlin, Heidelberg.

[33]Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory. In international conference on acoustics, speech and signal processing 2016 (pp. 6115-19). IEEE.

[34]Chung JS, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In conference on computer vision and pattern recognition 2017 (pp. 3444-53). IEEE.

[35]Thanda A, Venkatesan SM. Audio visual speech recognition using deep recurrent neural networks. In IAPR workshop on multimodal pattern recognition of social signals in human-computer interaction 2016 (pp. 98-109). Springer, Cham.

[36]Shillingford B, Whiteson S, Assael ND. Lipnet: sentence-level lipreading. In GPU technology conference 2016.

[37]Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unregimented sequence data with recurrent neural networks. In proceedings of the international conference on machine learning 2006 (pp. 369-76).

[38]Barker J, Vincent E, Ma N, Christensen H, Green P. The PASCAL CHiME speech separation and recognition challenge. Computer Speech & Language. 2013; 27(3):621-33.

[39]Gan T, Menzel W, Yang S. An audio-visual speech recognition framework based on articulatory features. Auditory-Visual Speech Processing 2007.

[40]Cornu TL, Milner B. Reconstructing intelligible audio speech from visual speech features. In sixteenth annual conference of the international speech communication association 2015 (pp. 3355-9).

[41]Bear HL, Harvey R. Decoding visemes: improving machine lip-reading. In international conference on acoustics, speech and signal processing 2016 (pp. 2009-13). IEEE.

[42]http://www.mathworks.com. Accessed 20 October 2019.

[43]https://www.phon.ucl.ac.uk/. Accessed 20 October 2019.

[44]http://www.opencv.org/ . Accessed 20 October 2019.

[45]Jensen OH. Implementing the viola-jones face detection algorithm. Masters thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark. 2008.

[46]Potamianos G, Scanlon P. Exploiting lower face symmetry in appearance-based automatic speechreading. In AVSP 2005 (pp. 79-84).

[47]Estellers V, Thiran JP. Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing. 2012.

[48]Nefian AV, Liang L, Pi X, Liu X, Murphy K. Dynamic bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing. 2002.

[49]Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004; 60(2):91-110.

[50]Albiol A, Monzo D, Martin A, Sastre J, Albiol A. Face recognition using HOG–EBGM. Pattern Recognition Letters. 2008; 29(10):1537-43.

[51]Ghorbani M, Targhi AT, Dehshibi MM. HOG and LBP: towards a robust face recognition system. In tenth international conference on digital information management 2015 (pp. 138-41). IEEE.

[52]Tiwari V. MFCC and its applications in speaker recognition. International Journal on Emerging Technologies. 2010; 1(1):19-22.

[53]Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, et al. The HTK book. Cambridge University, Engineering Department. 2006.

[54]Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks. 1994; 5(2):157-66.

[55]Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 1997; 45(11):2673-81.

[56]Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks. 2005; 18(5-6):602-10.

[57]Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997; 9(8):1735-80.

[58]Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. In international conference on acoustics, speech and signal processing 2013 (pp. 6645-9). IEEE.

[59]Graves A, Liwicki M, Bunke H, Schmidhuber J, Fernández S. Unconstrained on-line handwriting recognition with recurrent neural networks. In advances in neural information processing systems 2008 (pp. 577-84).

[60]Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008; 31(5):855-68.

[61]Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In advances in neural information processing systems 2012 (pp. 1097-105).