A Bi-Lingual Speech Emotion Recognition Model through Image Processing on Spectral Features

Author: Xiaoyu Chen

Chen, Xiaoyu, 2023 A Bi-Lingual Speech Emotion Recognition Model through Image Processing on Spectral Features, Flinders University, College of Science and Engineering

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.


Speech Emotion Recognition is an emerging research field due to its potential applications in medical fields, commercial settings, and voice assistance development. This thesis provides a comprehensive investigation into Speech Emotion Recognition (SER) with a focus on the influence of various image processing techniques and feature fusion. The study begins with a Background chapter, introducing the significance of SER in human-computer interaction and its potential applications in diverse fields. In the Literature Review, existing research on SER and its challenges are discussed, leading to the identification of gaps which highlight the need for cross-lingual robustness, feature enhancement, and reduction in model bias toward specific emotions.

The Aim and Hypothesis chapters articulate the study's objectives and the hypothesis that different image processing techniques and feature fusion can enhance SER model performance. The methodology utilized various tools and packages, a diverse dataset, and specific acoustic features, such as Mel Spectrograms and MFCCs, alongside image processing techniques including DoG, Sobel, and CLAHE filters. The SER model in this paper was TIM-Net. A parallel bi-lingual dataset, ESD was also used for training. 10-fold cross validation was used for training. For comparing performances, accuracy, average recall, and confusion matrix were used. Statistical significance was indicated by a p value less than 0.05 between baseline results and other results.

The Results chapter reveals significant findings. Visual analysis highlights spectral differences between emotional and neutral speech, particularly across languages. DoG filtering enhances model accuracy, but also increases confusion between certain emotions. In contrast, Sobel and CLAHE filtering reduce accuracy and increase language-based disparities, which could potentially be due to loss of relevant and subtle spectral information with overprocessing.

In conclusion, the model trained with the fused feature displayed the best average accuracies and more balanced predictions across different emotions and languages. For future direction, spectral features across different emotions could be analysed on a deeper and for each filter could be fine- tuned to balance between the feature enhancement and potential drawbacks. By visually understanding the spectral features, a multi-stage customised adaptive filter could also be developed to enhance the most relevant features. Before applying the model to different applications, the model should be trained with real-world data to further improve the robustness against noises and generalisability for cross-lingual model.

Keywords: Speech Emotion Recognition, MFCC, Edge Detection, Mel Spectrogram, DoG, Sobel, CLAHE

Subject: Medical Biotechnology thesis

Thesis type: Masters
Completed: 2023
School: College of Science and Engineering
Supervisor: Russell Brinkworth