联合扬声器编码器和神经后端模型,实现具有多个登记语料的完全端到端自动扬声器验证

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chang Zeng , Xiaoxiao Miao , Xin Wang , Erica Cooper , Junichi Yamagishi
{"title":"联合扬声器编码器和神经后端模型,实现具有多个登记语料的完全端到端自动扬声器验证","authors":"Chang Zeng ,&nbsp;Xiaoxiao Miao ,&nbsp;Xin Wang ,&nbsp;Erica Cooper ,&nbsp;Junichi Yamagishi","doi":"10.1016/j.csl.2024.101619","DOIUrl":null,"url":null,"abstract":"<div><p>Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101619"},"PeriodicalIF":3.1000,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000020/pdfft?md5=ef4d8f62c6e421e3a3accd1ee4ea9a64&pid=1-s2.0-S0885230824000020-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances\",\"authors\":\"Chang Zeng ,&nbsp;Xiaoxiao Miao ,&nbsp;Xin Wang ,&nbsp;Erica Cooper ,&nbsp;Junichi Yamagishi\",\"doi\":\"10.1016/j.csl.2024.101619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"86 \",\"pages\":\"Article 101619\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-01-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000020/pdfft?md5=ef4d8f62c6e421e3a3accd1ee4ea9a64&pid=1-s2.0-S0885230824000020-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000020\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000020","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

传统的自动语音验证系统通常可分解为用于提取说话人嵌入的前端模型(如时延神经网络(TDNN))和用于相似性评分的后端模型(如基于统计的概率线性判别分析(PLDA)或基于神经网络的神经线性判别分析(NPLDA))。然而,前端和后端模型的顺序优化可能会导致局部最小值,这在理论上会阻碍整个系统实现最佳优化。虽然已经提出了一些对两个模型进行联合优化的方法,如广义端到端(GE2E)模型和 NPLDA E2E 模型,但这些方法大多没有充分研究如何对多个报名语篇之间的内在关系进行建模。在本文中,我们提出了一种新的 E2E 联合方法来验证说话人,这种方法是专门针对多报名语料的实际场景而设计的。为了充分利用多个注册语篇之间的内在关系,我们的模型配备了帧级和语篇级关注机制。此外,在训练过程中,我们还利用焦点损失(focal loss)来平衡迷你批次中正负样本的重要性,并将重点放在困难样本上。我们还采用了多种数据增强技术,包括使用 MUSAN 和 RIRs 数据集的传统噪声增强技术,以及独特的扬声器嵌入级混合策略,以实现更好的优化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信