视听影响相关关系的时间条件Wasserstein gan

2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) Pub Date : 2021-09-28 DOI:10.1109/aciiw52867.2021.9666277

C. Athanasiadis, E. Hortal, Stelios Asteriadis

{"title":"视听影响相关关系的时间条件Wasserstein gan","authors":"C. Athanasiadis, E. Hortal, Stelios Asteriadis","doi":"10.1109/aciiw52867.2021.9666277","DOIUrl":null,"url":null,"abstract":"Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.","PeriodicalId":105376,"journal":{"name":"2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Temporal conditional Wasserstein GANs for audio-visual affect-related ties\",\"authors\":\"C. Athanasiadis, E. Hortal, Stelios Asteriadis\",\"doi\":\"10.1109/aciiw52867.2021.9666277\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.\",\"PeriodicalId\":105376,\"journal\":{\"name\":\"2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/aciiw52867.2021.9666277\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/aciiw52867.2021.9666277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过音频进行情感识别是一项相当具有挑战性的任务，需要适当的特征提取和分类。同时，最先进的分类策略通常基于深度学习架构。训练复杂的深度学习网络通常需要非常大的视听语料库和可用的情感注释。然而，这种可用性并不总是得到保证，因为收集和注释这样的数据集是一项耗时的任务。在这项工作中，引入了时间条件Wasserstein生成对抗网络(tc- wgan)，通过利用来自面部模态的信息来生成鲁棒的音频数据。使用动态深度学习架构(基于3dCNN、LSTM和Transformer网络)提取时间面部特征作为输入，此外，还有与注释相关的条件信息，我们的系统设法生成逼真的频谱图，代表与特定情感背景对应的音频片段。为了证明它们的有效性，除了三个质量指标(Frechet Inception Distance, Inception Score和Structural Similarity index)外，我们还使用基于音频的情感识别模式验证了生成的样本。当生成的样本与初始真实样本融合时，两个最先进的数据集的音频情感识别性能提高了3.5 - 5.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Temporal conditional Wasserstein GANs for audio-visual affect-related ties

Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

自引率

0.00%

发文量