Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Speech plays the most significant role in communication between people. The voice enables a speaker's unique characteristics to be mapped with biometric properties as well as carrying emotions. Emotion contains many non-linguistic signals to express ourselves as humans. Emotion recognition in human speech is a challenging task in different applications in fields such as healthcare, services, telecommunications, video conferencing, and human-computer interaction (HCI). Deep learning techniques are becoming a significant focus in recent research in the speech emotion recognition (SER) domain. In this paper, we present an ensemble learning approach based on various combinations of CNN and LSTM networks to address the limitations of the existing SER models. The proposed system is evaluated using the RAVDESS dataset. More specifically, the LSTM, CNN, and CNN and LSTM models achieved an accuracy rate of 0.64, 0.73, and 0.71, respectively. The simulation outcomes confirm that ensemble learning of the three deep model combinations contributes to the effectiveness of SER.












