Mismatched feature detection with finer granularity for emotional speaker recognition

Journal of Zhejiang University-Science C-Computers & Electronics Pub Date : 2014-10-01 DOI:10.1631/jzus.C1400002

Li Chen, Yingchun Yang, Zhaohui Wu

{"title":"Mismatched feature detection with finer granularity for emotional speaker recognition","authors":"Li Chen, Yingchun Yang, Zhaohui Wu","doi":"10.1631/jzus.C1400002","DOIUrl":null,"url":null,"abstract":"The shapes of speakers’ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition—phoneme classes, Gaussian mixture model (GMM) tokenizer, and probabilistic GMM tokenizer—are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus (MASC) show that our feature pruning and feature regulation methods increase the identification rate (IR) by 3.64% and 6.77%, compared with the baseline GMM-UBM (universal background model) algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector.","PeriodicalId":49947,"journal":{"name":"Journal of Zhejiang University-Science C-Computers & Electronics","volume":"15 1","pages":"903 - 916"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1631/jzus.C1400002","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Zhejiang University-Science C-Computers & Electronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1631/jzus.C1400002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The shapes of speakers’ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition—phoneme classes, Gaussian mixture model (GMM) tokenizer, and probabilistic GMM tokenizer—are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus (MASC) show that our feature pruning and feature regulation methods increase the identification rate (IR) by 3.64% and 6.77%, compared with the baseline GMM-UBM (universal background model) algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector.

查看原文本刊更多论文

更细粒度的不匹配特征检测用于情感说话人识别

说话人在不同的情绪状态下，其发声器官的形状会发生变化，导致短时特征的情绪声空间偏离中性声空间，从而导致说话人识别性能的下降。大大偏离中性声空间的特征被认为是不匹配的特征，它们对说话人识别系统有负面影响。情绪变化对不同的音素会产生不同的特征变形，因此建立更精细的模型来检测每个音素下的不匹配特征是合理的。然而，考虑到音素识别的困难，提出了三种声学类识别方法——音素类、高斯混合模型(GMM)分词器和概率GMM分词器来代替音素识别。我们提出了特征修剪和特征调节的方法来处理不匹配的特征，以提高说话人识别的性能。特征调节方法采用类间距离最大化、类内距离最小化的策略对变换矩阵进行训练，以调节不匹配的特征。在普通话情感语音语料库(MASC)上进行的实验表明，与通用背景模型(GMM-UBM)算法相比，我们的特征修剪和特征调节方法的识别率分别提高了3.64%和6.77%。当应用于最先进的i-vector算法时，我们的方法可以获得相应的2.09%和3.32%的红外增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Zhejiang University-Science C-Computers & Electronics

自引率

0.00%

发文量

审稿时长

2.66667 months