Multi-task learning for X-vector based speaker recognition

Q1 Arts and Humanities
Yingjie Zhang, Liu Liu
{"title":"Multi-task learning for X-vector based speaker recognition","authors":"Yingjie Zhang, Liu Liu","doi":"10.1007/s10772-023-10058-5","DOIUrl":null,"url":null,"abstract":"Abstract In this paper, we propose a speaker recognition system that leverages multi-task learning and features integration (MTFI), to improve the performance of x-vector based speaker recognition models. It is important to integrate complementary information from different features such as MFCC, Fbank, spectrogram and LPCC, as often a single feature usually cannot cover all information about a speaker and generalization is insufficient. Since the x-vector model outputs affine transformation values with the penultimate hidden layer in the trained model, the parameter distribution of this layer should be stable and should not be affected by tasks that are not current branches when switching tasks. Therefore, we propose a shared unit (SU) in multi-task learning, which is useful for sharing common representations and other auxiliary tasks. Then, an attention mechanism is designed to calculate the frame weight in the statistical pooling layer, so as to enhance the key frame information. The proposed system had an EER of 0.98% in voxceleb1 and the average score fusion obtained the EER of 0.65%.","PeriodicalId":14305,"journal":{"name":"International Journal of Speech Technology","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Speech Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10772-023-10058-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract In this paper, we propose a speaker recognition system that leverages multi-task learning and features integration (MTFI), to improve the performance of x-vector based speaker recognition models. It is important to integrate complementary information from different features such as MFCC, Fbank, spectrogram and LPCC, as often a single feature usually cannot cover all information about a speaker and generalization is insufficient. Since the x-vector model outputs affine transformation values with the penultimate hidden layer in the trained model, the parameter distribution of this layer should be stable and should not be affected by tasks that are not current branches when switching tasks. Therefore, we propose a shared unit (SU) in multi-task learning, which is useful for sharing common representations and other auxiliary tasks. Then, an attention mechanism is designed to calculate the frame weight in the statistical pooling layer, so as to enhance the key frame information. The proposed system had an EER of 0.98% in voxceleb1 and the average score fusion obtained the EER of 0.65%.
基于x向量的说话人识别多任务学习
摘要本文提出了一种基于多任务学习和特征集成(MTFI)的说话人识别系统,以提高基于x向量的说话人识别模型的性能。由于单个特征通常不能涵盖说话人的所有信息,泛化是不够的,因此整合来自不同特征(如MFCC、Fbank、频谱图和LPCC)的互补信息非常重要。由于x向量模型输出的是训练模型中倒数第二隐层的仿射变换值,所以在切换任务时,该层的参数分布应该是稳定的,不应该受到非当前支路任务的影响。因此,我们提出了多任务学习中的共享单元(SU),它有助于共享公共表征和其他辅助任务。然后,设计了一种关注机制来计算统计池层的帧权值,以增强关键帧信息。该系统在voxceleb1中的EER为0.98%,平均分数融合的EER为0.65%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Speech Technology
International Journal of Speech Technology ENGINEERING, ELECTRICAL & ELECTRONIC-
CiteScore
5.00
自引率
0.00%
发文量
65
期刊介绍: The International Journal of Speech Technology is a research journal that focuses on speech technology and its applications. It promotes research and description on all aspects of speech input and output, including theory, experiment, testing, base technology, applications. The journal is an international forum for the dissemination of research related to the applications of speech technology as well as to the technology itself as it relates to real-world applications. Articles describing original work in all aspects of speech technology are included. Sample topics include but are not limited to the following: applications employing digitized speech, synthesized speech or automatic speech recognition technological issues of speech input or output human factors, intelligent interfaces, robust applications integration of aspects of artificial intelligence and natural language processing international and local language implementations of speech synthesis and recognition development of new algorithms interface description techniques, tools and languages testing of intelligibility, naturalness and accuracy computational issues in speech technology software development tools speech-enabled robotics speech technology as a diagnostic tool for treating language disorders voice technology for managing serious laryngeal disabilities the use of speech in multimedia This is the only journal which presents papers on both the base technology and theory as well as all varieties of applications. It encompasses all aspects of the three major technologies: text-to-speech synthesis, automatic speech recognition and stored (digitized) speech.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信