Multi-task learning for X-vector based speaker recognition

Q1 Arts and Humanities

International Journal of Speech Technology Pub Date : 2023-10-28 DOI:10.1007/s10772-023-10058-5

Yingjie Zhang, Liu Liu

引用次数: 0

Abstract

Abstract In this paper, we propose a speaker recognition system that leverages multi-task learning and features integration (MTFI), to improve the performance of x-vector based speaker recognition models. It is important to integrate complementary information from different features such as MFCC, Fbank, spectrogram and LPCC, as often a single feature usually cannot cover all information about a speaker and generalization is insufficient. Since the x-vector model outputs affine transformation values with the penultimate hidden layer in the trained model, the parameter distribution of this layer should be stable and should not be affected by tasks that are not current branches when switching tasks. Therefore, we propose a shared unit (SU) in multi-task learning, which is useful for sharing common representations and other auxiliary tasks. Then, an attention mechanism is designed to calculate the frame weight in the statistical pooling layer, so as to enhance the key frame information. The proposed system had an EER of 0.98% in voxceleb1 and the average score fusion obtained the EER of 0.65%.

查看原文本刊更多论文

基于x向量的说话人识别多任务学习

摘要本文提出了一种基于多任务学习和特征集成(MTFI)的说话人识别系统，以提高基于x向量的说话人识别模型的性能。由于单个特征通常不能涵盖说话人的所有信息，泛化是不够的，因此整合来自不同特征(如MFCC、Fbank、频谱图和LPCC)的互补信息非常重要。由于x向量模型输出的是训练模型中倒数第二隐层的仿射变换值，所以在切换任务时，该层的参数分布应该是稳定的，不应该受到非当前支路任务的影响。因此，我们提出了多任务学习中的共享单元(SU)，它有助于共享公共表征和其他辅助任务。然后，设计了一种关注机制来计算统计池层的帧权值，以增强关键帧信息。该系统在voxceleb1中的EER为0.98%，平均分数融合的EER为0.65%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Speech Technology ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

5.00

自引率

0.00%

发文量

期刊介绍： The International Journal of Speech Technology is a research journal that focuses on speech technology and its applications. It promotes research and description on all aspects of speech input and output, including theory, experiment, testing, base technology, applications. The journal is an international forum for the dissemination of research related to the applications of speech technology as well as to the technology itself as it relates to real-world applications. Articles describing original work in all aspects of speech technology are included. Sample topics include but are not limited to the following: applications employing digitized speech, synthesized speech or automatic speech recognition technological issues of speech input or output human factors, intelligent interfaces, robust applications integration of aspects of artificial intelligence and natural language processing international and local language implementations of speech synthesis and recognition development of new algorithms interface description techniques, tools and languages testing of intelligibility, naturalness and accuracy computational issues in speech technology software development tools speech-enabled robotics speech technology as a diagnostic tool for treating language disorders voice technology for managing serious laryngeal disabilities the use of speech in multimedia This is the only journal which presents papers on both the base technology and theory as well as all varieties of applications. It encompasses all aspects of the three major technologies: text-to-speech synthesis, automatic speech recognition and stored (digitized) speech.