Exploring the Intersection Between Speaker Verification and Emotion Recognition

2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) Pub Date : 2019-09-01 DOI:10.1109/ACIIW.2019.8925044

Michelle I Bancroft, Reza Lotfian, J. Hansen, C. Busso

{"title":"Exploring the Intersection Between Speaker Verification and Emotion Recognition","authors":"Michelle I Bancroft, Reza Lotfian, J. Hansen, C. Busso","doi":"10.1109/ACIIW.2019.8925044","DOIUrl":null,"url":null,"abstract":"Many scenarios in practical applications require the use of speaker verification systems using audio with high emotional content (e.g., calls from 911, forensic analysis of threatening recordings). For these cases, it is important to explore the intersection between speaker and emotion recognition tasks. A key challenge to address this problem is the lack of resources, since current emotional databases are commonly limited in size and number of speakers. This paper (1) creates the infrastructure to study this challenging problems, and (2) presents an exploratory analysis to evaluate the accuracy of state-of-the-art speaker and emotion recognition systems to automatically retrieve specific emotional behaviors from target speakers. We collected a pool of sentences from multiple speakers (132,930 segments), where some of these speaking turns belong to 146 speakers in the MSP-Podcast database. Our framework trains speaking verification models, which are used to retrieve candidate speaking turns from the pool of sentences. The emotional content in these sentences are detected using state-of-the-art emotion recognition algorithms. The experimental evaluation provides promising results, where most of the retrieved sentences belong to the target speakers and has the target emotion. The results highlight the need for emotional compensation in speaker recognition systems, especially if these models are intended for commercial applications.","PeriodicalId":193568,"journal":{"name":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","volume":"433 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACIIW.2019.8925044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Many scenarios in practical applications require the use of speaker verification systems using audio with high emotional content (e.g., calls from 911, forensic analysis of threatening recordings). For these cases, it is important to explore the intersection between speaker and emotion recognition tasks. A key challenge to address this problem is the lack of resources, since current emotional databases are commonly limited in size and number of speakers. This paper (1) creates the infrastructure to study this challenging problems, and (2) presents an exploratory analysis to evaluate the accuracy of state-of-the-art speaker and emotion recognition systems to automatically retrieve specific emotional behaviors from target speakers. We collected a pool of sentences from multiple speakers (132,930 segments), where some of these speaking turns belong to 146 speakers in the MSP-Podcast database. Our framework trains speaking verification models, which are used to retrieve candidate speaking turns from the pool of sentences. The emotional content in these sentences are detected using state-of-the-art emotion recognition algorithms. The experimental evaluation provides promising results, where most of the retrieved sentences belong to the target speakers and has the target emotion. The results highlight the need for emotional compensation in speaker recognition systems, especially if these models are intended for commercial applications.

查看原文本刊更多论文

探讨说话人验证与情绪识别的交集

实际应用中的许多场景需要使用带有高情感内容的音频的说话人验证系统(例如，来自911的电话，威胁录音的法医分析)。在这些情况下，探索说话者和情绪识别任务之间的交集是很重要的。解决这个问题的一个关键挑战是缺乏资源，因为目前的情感数据库通常在规模和说话者数量上受到限制。本文(1)创建了研究这一具有挑战性问题的基础设施;(2)提出了一种探索性分析，以评估最先进的说话人和情绪识别系统在自动检索目标说话人特定情绪行为方面的准确性。我们从多个说话者(132930个片段)中收集了一个句子池，其中一些说话回合属于MSP-Podcast数据库中的146个说话者。我们的框架训练口语验证模型，该模型用于从句子池中检索候选口语转弯。这些句子中的情感内容使用最先进的情感识别算法进行检测。实验结果表明，检索到的句子大部分属于目标说话者，并且具有目标情感。研究结果强调了说话人识别系统中情感补偿的必要性，特别是当这些模型用于商业应用时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

自引率

0.00%

发文量