Unsupervised Representation Learning with Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech

Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop Pub Date : 2020-10-15 DOI:10.1145/3423327.3423670

S. Amiriparian, Pawel Winokurow, Vincent Karas, Sandra Ottl, Maurice Gerczuk, Björn Schuller

{"title":"Unsupervised Representation Learning with Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech","authors":"S. Amiriparian, Pawel Winokurow, Vincent Karas, Sandra Ottl, Maurice Gerczuk, Björn Schuller","doi":"10.1145/3423327.3423670","DOIUrl":null,"url":null,"abstract":"Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the learnt representations from both autoencoders, and conduct an early fusion to ascertain possible complementarity between them. In our frameworks, we first extract Mel-spectrograms from raw audio. Second, we train recurrent autoencoders on these spectrograms which are considered as time-dependent frequency vectors. Afterwards, we extract the activations of specific fully connected layers of the autoencoders which represent the learnt features of spectrograms for the corresponding audio instances. Finally, we train support vector regressors on these representations to obtain the predictions. On the development partition of the data, we achieve Spearman's correlation coefficients of .324, .283, and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation coefficients on the test data, indicating the suitability of our proposed fusion strategy.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3423327.3423670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the learnt representations from both autoencoders, and conduct an early fusion to ascertain possible complementarity between them. In our frameworks, we first extract Mel-spectrograms from raw audio. Second, we train recurrent autoencoders on these spectrograms which are considered as time-dependent frequency vectors. Afterwards, we extract the activations of specific fully connected layers of the autoencoders which represent the learnt features of spectrograms for the corresponding audio instances. Finally, we train support vector regressors on these representations to obtain the predictions. On the development partition of the data, we achieve Spearman's correlation coefficients of .324, .283, and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation coefficients on the test data, indicating the suitability of our proposed fusion strategy.

查看原文本刊更多论文

基于注意和序列到序列自编码器的无监督表示学习预测语音困倦程度

受人类视觉系统的注意机制和机器翻译领域的最新发展的启发，我们引入了基于注意和循环序列的序列自编码器，用于从音频文件中进行完全无监督表示学习。特别是，我们测试了我们的新方法在基于语音的困倦识别任务上的有效性。我们评估从两个自编码器学习到的表示，并进行早期融合以确定它们之间可能的互补性。在我们的框架中，我们首先从原始音频中提取mel谱图。其次，我们在这些频谱图上训练循环自编码器，这些频谱图被认为是与时间相关的频率矢量。然后，我们提取了代表相应音频实例的频谱图学习特征的自编码器的特定全连接层的激活。最后，我们在这些表示上训练支持向量回归器以获得预测。在数据的发展划分上，我们分别利用注意和非注意自编码器以及两种自编码器表示的融合，实现了与卡罗林斯卡嗜睡量表目标的Spearman相关系数为0.324、0.283和0.320。在相同的顺序下，我们在测试数据上获得了。311，。359和。367的Spearman相关系数，表明我们提出的融合策略的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop

自引率

0.00%

发文量