Michelle I Bancroft, Reza Lotfian, J. Hansen, C. Busso
{"title":"探讨说话人验证与情绪识别的交集","authors":"Michelle I Bancroft, Reza Lotfian, J. Hansen, C. Busso","doi":"10.1109/ACIIW.2019.8925044","DOIUrl":null,"url":null,"abstract":"Many scenarios in practical applications require the use of speaker verification systems using audio with high emotional content (e.g., calls from 911, forensic analysis of threatening recordings). For these cases, it is important to explore the intersection between speaker and emotion recognition tasks. A key challenge to address this problem is the lack of resources, since current emotional databases are commonly limited in size and number of speakers. This paper (1) creates the infrastructure to study this challenging problems, and (2) presents an exploratory analysis to evaluate the accuracy of state-of-the-art speaker and emotion recognition systems to automatically retrieve specific emotional behaviors from target speakers. We collected a pool of sentences from multiple speakers (132,930 segments), where some of these speaking turns belong to 146 speakers in the MSP-Podcast database. Our framework trains speaking verification models, which are used to retrieve candidate speaking turns from the pool of sentences. The emotional content in these sentences are detected using state-of-the-art emotion recognition algorithms. The experimental evaluation provides promising results, where most of the retrieved sentences belong to the target speakers and has the target emotion. The results highlight the need for emotional compensation in speaker recognition systems, especially if these models are intended for commercial applications.","PeriodicalId":193568,"journal":{"name":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","volume":"433 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Exploring the Intersection Between Speaker Verification and Emotion Recognition\",\"authors\":\"Michelle I Bancroft, Reza Lotfian, J. Hansen, C. Busso\",\"doi\":\"10.1109/ACIIW.2019.8925044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many scenarios in practical applications require the use of speaker verification systems using audio with high emotional content (e.g., calls from 911, forensic analysis of threatening recordings). For these cases, it is important to explore the intersection between speaker and emotion recognition tasks. A key challenge to address this problem is the lack of resources, since current emotional databases are commonly limited in size and number of speakers. This paper (1) creates the infrastructure to study this challenging problems, and (2) presents an exploratory analysis to evaluate the accuracy of state-of-the-art speaker and emotion recognition systems to automatically retrieve specific emotional behaviors from target speakers. We collected a pool of sentences from multiple speakers (132,930 segments), where some of these speaking turns belong to 146 speakers in the MSP-Podcast database. Our framework trains speaking verification models, which are used to retrieve candidate speaking turns from the pool of sentences. The emotional content in these sentences are detected using state-of-the-art emotion recognition algorithms. The experimental evaluation provides promising results, where most of the retrieved sentences belong to the target speakers and has the target emotion. The results highlight the need for emotional compensation in speaker recognition systems, especially if these models are intended for commercial applications.\",\"PeriodicalId\":193568,\"journal\":{\"name\":\"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)\",\"volume\":\"433 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACIIW.2019.8925044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACIIW.2019.8925044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploring the Intersection Between Speaker Verification and Emotion Recognition
Many scenarios in practical applications require the use of speaker verification systems using audio with high emotional content (e.g., calls from 911, forensic analysis of threatening recordings). For these cases, it is important to explore the intersection between speaker and emotion recognition tasks. A key challenge to address this problem is the lack of resources, since current emotional databases are commonly limited in size and number of speakers. This paper (1) creates the infrastructure to study this challenging problems, and (2) presents an exploratory analysis to evaluate the accuracy of state-of-the-art speaker and emotion recognition systems to automatically retrieve specific emotional behaviors from target speakers. We collected a pool of sentences from multiple speakers (132,930 segments), where some of these speaking turns belong to 146 speakers in the MSP-Podcast database. Our framework trains speaking verification models, which are used to retrieve candidate speaking turns from the pool of sentences. The emotional content in these sentences are detected using state-of-the-art emotion recognition algorithms. The experimental evaluation provides promising results, where most of the retrieved sentences belong to the target speakers and has the target emotion. The results highlight the need for emotional compensation in speaker recognition systems, especially if these models are intended for commercial applications.