Acoustic Characteristics of Emotional Speech Using Spectrogram Image Classification

2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS) Pub Date : 2018-12-01 DOI:10.1109/ICSPCS.2018.8631752

Melissa Stola, M. Lech, R. Bolia, Michael Skinner

引用次数: 14

Abstract

One of the problems limiting the accuracy of speech emotion recognition (SER) is difficulty in the differentiation between acoustically-similar emotions. Since it is not clear how emotions differ in acoustic terms, it is difficult to design new, more efficient SER strategies. In this study, amplitude-frequency analysis of emotional speech was performed to determine relative differences between seven emotional categories of speech in the Berlin Emotional Speech (EMO-DB) database. The analysis transformed short J-second blocks of speech into RGB images of spectrograms using four different frequency scales. The images were used to train a convolutional neural network (CNN) to recognize emotions. By training the network with different combinations of frequency scales and color components of the RGB images that emphasized different frequency and spectral amplitude values, links between different emotions and corresponding amplitude-frequency characteristics of speech were determined.

查看原文本刊更多论文

基于谱图图像分类的情绪语音声学特征研究

语音情感识别的难点之一是难以区分声音相似的情感。由于尚不清楚情绪在声学方面是如何不同的，因此很难设计出新的、更有效的SER策略。本研究对情绪言语进行幅频分析，以确定柏林情绪言语数据库(EMO-DB)中七种情绪言语类别之间的相对差异。该分析将短的j秒语音块转换为使用四种不同频率尺度的RGB频谱图图像。这些图像被用来训练卷积神经网络(CNN)来识别情绪。通过对强调不同频率和谱幅值的RGB图像的不同频率尺度和颜色分量组合进行训练，确定不同情绪之间的联系以及相应的语音幅频特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS)

自引率

0.00%

发文量