
Speech emotion detection based on MFCC and CNN-LSTM architecture
- 1 Department of Information and Software Engineering, University of Electronic Science and Technology of China, No.2006, Xiyuan Ave, West Hi-tech Zone Chengdu, Sichuan, 611731
* Author to whom correspondence should be addressed.
Abstract
Emotion detection techniques have been applied to multiple cases mainly from facial image features and vocal audio features, of which the latter aspect is disputed yet not only due to the complexity of speech audio processing but also the difficulties of extracting appropriate features. Part of the SAVEE and RAVDESS datasets are selected and combined as the dataset, containing seven sorts of common emotions (i.e. happy, neutral, sad, anger, disgust, fear, and surprise) and thousands of samples. Based on the Librosa package, this paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction. The hybrid CNN-LSTM architecture is adopted by virtue of its strong capability to deal with sequential data and time series, which mainly consists of four convolutional layers and three long short-term memory layers. As a result, the architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively. It can also be concluded that the classification accuracy is dependent on the properties of emotion to some extent, with frequently-used and distinct-featured emotions having less probability to be misclassified into other categories. Emotions like surprise whose meaning depends on the specific context are more likely to confuse with positive or negative emotions, and negative emotions also have a possibility to get mixed with each other.
Keywords
emotion detection, CNN-LSTM, MFCC, audio processing.
[1]. Ruiz L Z et al. 2017 Human emotion detection through facial expressions for commercial analysis 2017IEEE 9th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM) pp 1-6
[2]. Felnhofer A et al. 2015 Is virtual reality emotionally arousing? Investigating five emotion inducing virtual park scenarios International journal of human-computer studies 82 pp 48-56
[3]. Yu J 2014 A video, text, and speech-driven realistic 3-D virtual head for human machine interface IEEE transactions on cybernetics 45(5) pp 991-1002
[4]. Adeyanju I A et al. 2015 Performance evaluation of different support vector machine kernels for face emotion recognition 2015 SAI Intelligent Systems Conference (IntelliSys) pp 804-806
[5]. Vydana H K et al. 2015 Improved emotion recognition using GMM-UBMs 2015 International Conference on Signal Processing and Communication Engineering Systems pp 53-57
[6]. Shahin I et al. 2019 Emotion recognition using hybrid Gaussian mixture model and deep neural network IEEE access 7 pp 26777-26787
[7]. Eu J L 2019 Surrey Audio-Visual Expressed Emotion 2022 https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed- emotion-savee
[8]. Steven R L 2018 RAVDESS Emotional speech audio 2022 https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech- audio
[9]. McFee B et al. 2015 librosa: Audio and music signal analysis in python Proceedings of the 14th python in science conference Vol 8 pp 18-25
[10]. Logan B 2000 Mel frequency cepstral coefficients for music modeling International Symposium on Music Information Retrieval
[11]. [Donahue J et al. 2015 Long-term recurrent convolutional networks for visual recognition and description Proceedings of the IEEE conference on computer vision and pattern recognition pp 2625-2634
[12]. description Proceedings of the IEEE conference on computer vision and pattern recognition pp 2625-2634
[13]. Palo H K et al. 2015 Classification of emotions of angry and disgust SmartCR 5(3) pp 151-158
Cite this article
Ouyang,Q. (2023). Speech emotion detection based on MFCC and CNN-LSTM architecture. Applied and Computational Engineering,5,243-249.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 3rd International Conference on Signal Processing and Machine Learning
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).