The Development and Status of Speech Recognition: An Overview

Research Article
Open access

The Development and Status of Speech Recognition: An Overview

Longwei Xiao 1*
  • 1 Beijing University of Posts and Telecommunications    
  • *corresponding author 2020213690@bupt.edu.cn
Published on 1 August 2023 | https://doi.org/10.54254/2755-2721/8/20230127
ACE Vol.8
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-915371-63-8
ISBN (Online): 978-1-915371-64-5

Abstract

Speech recognition has made remarkable progress in the last two decades, gradually moving from the laboratory to many different application scenarios. The core of speech recognition includes two processes: encoding and decoding, among which the decoding process can be divided into two parts: acoustic model and language model. After more than seventy years of development, speech recognition can be broadly divided into three phases in terms of technical direction: GMM-HMM era, DNN-HMM era, and end-to-end era. Finally, this paper will summarize the problems and future development direction of speech recognition through comparative study.

Keywords:

speech recognition, DNN, HMM, deep learning

Xiao,L. (2023). The Development and Status of Speech Recognition: An Overview. Applied and Computational Engineering,8,192-204.
Export citation

References

[1]. Ar E. Turkish Dictation System for Radiology and Broadcast News Applications.

[2]. Future of eCommerce Development - 10 Trends Not to Miss in 2020. https://www.unifiedinfotech.net/blog/e-commerce-web-development-design-trends-for2017/

[3]. Proceedings. 9th IEEE International Workshop on Cellular Neural Networks and their Applications (IEEE Cat. No. 05TH8814) [C]// 2005 9th International Workshop on Cellular Neural Networks and Their Applications.

[4]. Fan, Ruchao. 2018 [IEEE 2018 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2018.7.16-2018.7.17)] 2018 International Conference on Audio, Language and Image Processing (ICALIP) - CNN-Based Audio Front End Processing on Speech Recognition [J]. pp123-127.

[5]. Jing Liu, Xinguang Xiang. [IEEE 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Siem Reap, Cambodia (2017.6.18-2017.6.20)] 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Review of the Anti-Noise Method in the Speech Recognition Technology [C]// 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[6]. Florez-Choque O, Cuadros-Vargas E. 2007 Improving Human Computer Interaction through Spoken Natural Language[C]// IEEE Symposium on Computational Intelligence in Image & Signal Processing. IEEE.

[7]. Zhu H, Deng Y, Ding P, et al. Apparatus and Method for Training A Neutral Network Acoustic Model, And Speech Recognition Apparatus and Method.

[8]. Handling OOVWords in Mandarin Spoken Term Detection with an Hierarchical n-Gram Language Model[J]. Chinese Journal of Electronics, 2017(06):1239-1244.

[9]. Candy J V. Discrete Hidden Markov Model Bayesian Processors[C]// Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods.

[10]. Andrew H, Cheryl H, Louise H, et al. 2016 Does smoke-free Ireland have more smoking inside the home and less in pubs than the United Kingdom? Findings from the international tobacco control policy evaluation project[J]. European journal of public health, Vol. 18, No. 1, 2008:63-65.

[11]. Li L, Yong Z, Jiang D, et al. 2013 Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition[C]// Affective Computing & Intelligent Interaction. IEEE.

[12]. Wang S, Clark R, Wen H, et al. 2018 End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks[J]. The International journal of robotics research, 37(4-5):513-542.

[13]. Gan, Z.Y., Jia, H.J., Ruan, W.B., et al. 2016 Chinese-Tibetan cross-language voice conversion method and system,.

[14]. Franyell Silfa, Jose-Maria Arnau, Antonio Gonzàlez et al. Boosting LSTM Performance Through Dynamic Precision Selection, Computer Architecture Deparment, Universitat Politecnica de Catalunya.

[15]. Shiu, Yu; Kuo, C. -C. J. et al. 2005 Music genre classification via likelihood fusion from multiple feature models, Proceedings of the SPIE, Volume 5682, p. 258-268.

[16]. Abdel-Hamid O, Hui J. 2013 Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code[C]// IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2013. IEEE.

[17]. Saon G. 2006 A Non-Linear Speaker Adaptation Technique using Kernel Ridge Regression[C]// Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE.

[18]. Jayamaha R, Senadheera M, Gamage T, et al. 2009 VoizLock - Human Voice Authentication System using Hidden Markov Model[C]// International Conference on Information & Automation for Sustainability. IEEE.

[19]. Fan R, Liu G. 2018 CNN-Based Audio Front End Processing on Speech Recognition[C]// 349- 354.

[20]. Meng X, Liu C, Zhang Z, et al. 2014 Noisy training for deep neural networks[C]// 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP). IEEE.

[21]. Assael Y M, Shillingford B, Whiteson S, et al. LipNet: End-to-End Sentence-level Lipreading[J]. 2016.

[22]. Ochiai T, Watanabe S, Hori T, et al. 2017 A Unified Architecture for Multichannel End-to-End Speech Recognition with Neural Beamforming[J]. IEEE Journal of Selected Topics in Signal Processing, (8):1-1.

[23]. “Voice Search - Are you ready for the voice revolution?” https://www.innovationvisual.com/services/organic-search-seo/voice-search

[24]. “HOW DOES SPEECH RECOGNITION WORK?” https://masterartificialintelligence.com/how-speech-recognition-work/

[25]. elson, richard n., moutz, mitchell dubu, houte, james k. Acoustic evaluation and / or control of the fluid contents of the reservoir, JP4559218B2[P]. 2010.

[26]. Wang Y, Li J, Wang H, et al. 2021 Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition[J].

[27]. Shinoda K. 2005 Speaker adaptation techniques for speech recognition using probabilistic models[J]. Electronics and Communications in Japan (Part III Fundamental Electronic Science), 88(12):25-42.

[28]. Kim C W, Eom K W, Lee J W, et al. Signal Separation System and Method for Automatically Selecting Threshold to Separate Sound Sources: Us, Us20110182437 A1[P].

[29]. Shiyu Zhou, 2018 Research on Multilingual Speech Recognition for Low-resource Languages, Beijing, Chinese Academy of Sciences (CAS).


Cite this article

Xiao,L. (2023). The Development and Status of Speech Recognition: An Overview. Applied and Computational Engineering,8,192-204.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Software Engineering and Machine Learning

ISBN:978-1-915371-63-8(Print) / 978-1-915371-64-5(Online)
Editor:Anil Fernando, Marwan Omar
Conference website: http://www.confseml.org
Conference date: 19 April 2023
Series: Applied and Computational Engineering
Volume number: Vol.8
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Ar E. Turkish Dictation System for Radiology and Broadcast News Applications.

[2]. Future of eCommerce Development - 10 Trends Not to Miss in 2020. https://www.unifiedinfotech.net/blog/e-commerce-web-development-design-trends-for2017/

[3]. Proceedings. 9th IEEE International Workshop on Cellular Neural Networks and their Applications (IEEE Cat. No. 05TH8814) [C]// 2005 9th International Workshop on Cellular Neural Networks and Their Applications.

[4]. Fan, Ruchao. 2018 [IEEE 2018 International Conference on Audio, Language and Image Processing (ICALIP) - Shanghai, China (2018.7.16-2018.7.17)] 2018 International Conference on Audio, Language and Image Processing (ICALIP) - CNN-Based Audio Front End Processing on Speech Recognition [J]. pp123-127.

[5]. Jing Liu, Xinguang Xiang. [IEEE 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Siem Reap, Cambodia (2017.6.18-2017.6.20)] 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Review of the Anti-Noise Method in the Speech Recognition Technology [C]// 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[6]. Florez-Choque O, Cuadros-Vargas E. 2007 Improving Human Computer Interaction through Spoken Natural Language[C]// IEEE Symposium on Computational Intelligence in Image & Signal Processing. IEEE.

[7]. Zhu H, Deng Y, Ding P, et al. Apparatus and Method for Training A Neutral Network Acoustic Model, And Speech Recognition Apparatus and Method.

[8]. Handling OOVWords in Mandarin Spoken Term Detection with an Hierarchical n-Gram Language Model[J]. Chinese Journal of Electronics, 2017(06):1239-1244.

[9]. Candy J V. Discrete Hidden Markov Model Bayesian Processors[C]// Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods.

[10]. Andrew H, Cheryl H, Louise H, et al. 2016 Does smoke-free Ireland have more smoking inside the home and less in pubs than the United Kingdom? Findings from the international tobacco control policy evaluation project[J]. European journal of public health, Vol. 18, No. 1, 2008:63-65.

[11]. Li L, Yong Z, Jiang D, et al. 2013 Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition[C]// Affective Computing & Intelligent Interaction. IEEE.

[12]. Wang S, Clark R, Wen H, et al. 2018 End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks[J]. The International journal of robotics research, 37(4-5):513-542.

[13]. Gan, Z.Y., Jia, H.J., Ruan, W.B., et al. 2016 Chinese-Tibetan cross-language voice conversion method and system,.

[14]. Franyell Silfa, Jose-Maria Arnau, Antonio Gonzàlez et al. Boosting LSTM Performance Through Dynamic Precision Selection, Computer Architecture Deparment, Universitat Politecnica de Catalunya.

[15]. Shiu, Yu; Kuo, C. -C. J. et al. 2005 Music genre classification via likelihood fusion from multiple feature models, Proceedings of the SPIE, Volume 5682, p. 258-268.

[16]. Abdel-Hamid O, Hui J. 2013 Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code[C]// IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2013. IEEE.

[17]. Saon G. 2006 A Non-Linear Speaker Adaptation Technique using Kernel Ridge Regression[C]// Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE.

[18]. Jayamaha R, Senadheera M, Gamage T, et al. 2009 VoizLock - Human Voice Authentication System using Hidden Markov Model[C]// International Conference on Information & Automation for Sustainability. IEEE.

[19]. Fan R, Liu G. 2018 CNN-Based Audio Front End Processing on Speech Recognition[C]// 349- 354.

[20]. Meng X, Liu C, Zhang Z, et al. 2014 Noisy training for deep neural networks[C]// 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP). IEEE.

[21]. Assael Y M, Shillingford B, Whiteson S, et al. LipNet: End-to-End Sentence-level Lipreading[J]. 2016.

[22]. Ochiai T, Watanabe S, Hori T, et al. 2017 A Unified Architecture for Multichannel End-to-End Speech Recognition with Neural Beamforming[J]. IEEE Journal of Selected Topics in Signal Processing, (8):1-1.

[23]. “Voice Search - Are you ready for the voice revolution?” https://www.innovationvisual.com/services/organic-search-seo/voice-search

[24]. “HOW DOES SPEECH RECOGNITION WORK?” https://masterartificialintelligence.com/how-speech-recognition-work/

[25]. elson, richard n., moutz, mitchell dubu, houte, james k. Acoustic evaluation and / or control of the fluid contents of the reservoir, JP4559218B2[P]. 2010.

[26]. Wang Y, Li J, Wang H, et al. 2021 Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition[J].

[27]. Shinoda K. 2005 Speaker adaptation techniques for speech recognition using probabilistic models[J]. Electronics and Communications in Japan (Part III Fundamental Electronic Science), 88(12):25-42.

[28]. Kim C W, Eom K W, Lee J W, et al. Signal Separation System and Method for Automatically Selecting Threshold to Separate Sound Sources: Us, Us20110182437 A1[P].

[29]. Shiyu Zhou, 2018 Research on Multilingual Speech Recognition for Low-resource Languages, Beijing, Chinese Academy of Sciences (CAS).