CoordViT: A Novel Method of Improve Vision Transformer-Based Speech Emotion Recognition using Coordinate Information Concatenate

2023 International Conference on Electronics, Information, and Communication (ICEIC) Pub Date : 2023-02-05 DOI:10.1109/ICEIC57457.2023.10049941

Jeongho Kim, Seung-Ho Lee

引用次数: 1

Abstract

Recently, in speech emotion recognition, a Transformer-based method using spectrogram images instead of sound data showed improved accuracy than Convolutional Neural Networks (CNNs). Vision Transformer (ViT), a Transformer-based method, achieves high classification accuracy by using divided patches from the input image, but has a problem in that pixel position information is not retained due to embedding layers such as linear projection. Therefore, in this paper, we propose a novel method of improve ViT-based speech emotion recognition using coordinate information concatenate. Since the proposed method retains pixel position information by concatenating coordinate information to the input image, the accuracy of CREMA-D is greatly improved by 82.96% compared to the state-of-art about CREMA-D. As a result, it proved that the coordinate information concatenate proposed in this paper is effective not only for CNNs but also for Transformers.

查看原文本刊更多论文

坐标信息拼接:一种改进视觉变换语音情感识别的新方法

最近，在语音情感识别中，一种基于transformer的方法使用频谱图图像代替声音数据，其准确性比卷积神经网络(cnn)有所提高。Vision Transformer (ViT)是一种基于Transformer的分类方法，通过对输入图像进行分割，获得了较高的分类精度，但由于线性投影等嵌入层的存在，导致像素位置信息无法保留。因此，本文提出了一种基于坐标信息拼接的语音情感识别方法。由于该方法通过将坐标信息与输入图像拼接，保留了像素位置信息，因此与目前的CREMA-D方法相比，准确率提高了82.96%。结果表明，本文提出的坐标信息拼接方法不仅对cnn有效，对变压器也有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Electronics, Information, and Communication (ICEIC)

自引率

0.00%

发文量