{"title":"比较数据驱动和手工制作的维度情感识别特征","authors":"Bogdan Vlasenko, Sargam Vyas, Mathew Magimai.-Doss","doi":"10.1109/icassp48485.2024.10446134","DOIUrl":null,"url":null,"abstract":"Speech Emotion Recognition (SER) has garnered significant attention over the past two decades. In the early stages of SER technology, ’brute force’-based techniques led to a significant expansion in knowledge-based acoustic feature representation (FR) for modeling sparse emotional data. However, as deep learning techniques have become more powerful, their direct application has been limited by the scarcity of well-annotated emotional data. As a result, pre-trained neural embeddings on large speech corpora have gained popularity for SER tasks. These embeddings leverage existing transfer learning methods suitable for general-purpose self-supervised learning (SSL) representations. Recent studies on downstream SSL techniques for dimensional SER have shown promising results. In this research, we aim to evaluate the emotion-discriminative characteristics of neural embeddings in general cases (out-of-domain) and when fine-tuned for SER (in-domain). Given that most SSL techniques are pre-trained primarily on English speech, we plan to use speech emotion corpora in both language-matched and mismatched conditions. We will assess the discriminative characteristics of both handcrafted and standalone neural embeddings as FRs.","PeriodicalId":517764,"journal":{"name":"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"62 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing data-Driven and Handcrafted Features for Dimensional Emotion Recognition\",\"authors\":\"Bogdan Vlasenko, Sargam Vyas, Mathew Magimai.-Doss\",\"doi\":\"10.1109/icassp48485.2024.10446134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech Emotion Recognition (SER) has garnered significant attention over the past two decades. In the early stages of SER technology, ’brute force’-based techniques led to a significant expansion in knowledge-based acoustic feature representation (FR) for modeling sparse emotional data. However, as deep learning techniques have become more powerful, their direct application has been limited by the scarcity of well-annotated emotional data. As a result, pre-trained neural embeddings on large speech corpora have gained popularity for SER tasks. These embeddings leverage existing transfer learning methods suitable for general-purpose self-supervised learning (SSL) representations. Recent studies on downstream SSL techniques for dimensional SER have shown promising results. In this research, we aim to evaluate the emotion-discriminative characteristics of neural embeddings in general cases (out-of-domain) and when fine-tuned for SER (in-domain). Given that most SSL techniques are pre-trained primarily on English speech, we plan to use speech emotion corpora in both language-matched and mismatched conditions. We will assess the discriminative characteristics of both handcrafted and standalone neural embeddings as FRs.\",\"PeriodicalId\":517764,\"journal\":{\"name\":\"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"62 8\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icassp48485.2024.10446134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp48485.2024.10446134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
过去二十年来,语音情感识别(SER)备受关注。在 SER 技术的早期阶段,基于 "蛮力 "的技术极大地扩展了基于知识的声学特征表示(FR),用于对稀疏的情感数据建模。然而,随着深度学习技术变得越来越强大,它们的直接应用却因缺乏完善标注的情感数据而受到限制。因此,在 SER 任务中,在大型语音语料库中进行预训练的神经嵌入越来越受欢迎。这些嵌入利用现有的转移学习方法,适用于通用的自我监督学习(SSL)表征。最近对用于维度 SER 的下游 SSL 技术的研究显示出了良好的效果。在这项研究中,我们旨在评估神经嵌入在一般情况下(域外)和针对 SER 进行微调时(域内)的情感鉴别特性。鉴于大多数 SSL 技术主要是在英语语音基础上进行预训练的,我们计划在语言匹配和不匹配条件下使用语音情感语料库。我们将评估手工制作的神经嵌入和独立神经嵌入作为 FR 的判别特性。
Comparing data-Driven and Handcrafted Features for Dimensional Emotion Recognition
Speech Emotion Recognition (SER) has garnered significant attention over the past two decades. In the early stages of SER technology, ’brute force’-based techniques led to a significant expansion in knowledge-based acoustic feature representation (FR) for modeling sparse emotional data. However, as deep learning techniques have become more powerful, their direct application has been limited by the scarcity of well-annotated emotional data. As a result, pre-trained neural embeddings on large speech corpora have gained popularity for SER tasks. These embeddings leverage existing transfer learning methods suitable for general-purpose self-supervised learning (SSL) representations. Recent studies on downstream SSL techniques for dimensional SER have shown promising results. In this research, we aim to evaluate the emotion-discriminative characteristics of neural embeddings in general cases (out-of-domain) and when fine-tuned for SER (in-domain). Given that most SSL techniques are pre-trained primarily on English speech, we plan to use speech emotion corpora in both language-matched and mismatched conditions. We will assess the discriminative characteristics of both handcrafted and standalone neural embeddings as FRs.