利用说话人匿名数据进行多说话人文本到语音训练

IF 3.2 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Wen-Chin Huang;Yi-Chiao Wu;Tomoki Toda
{"title":"利用说话人匿名数据进行多说话人文本到语音训练","authors":"Wen-Chin Huang;Yi-Chiao Wu;Tomoki Toda","doi":"10.1109/LSP.2024.3482701","DOIUrl":null,"url":null,"abstract":"The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this letter, we investigate the training of multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the usefulness of the SA system for multi-speaker TTS training.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"2995-2999"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10720809","citationCount":"0","resultStr":"{\"title\":\"Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data\",\"authors\":\"Wen-Chin Huang;Yi-Chiao Wu;Tomoki Toda\",\"doi\":\"10.1109/LSP.2024.3482701\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this letter, we investigate the training of multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the usefulness of the SA system for multi-speaker TTS training.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"31 \",\"pages\":\"2995-2999\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10720809\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10720809/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10720809/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

语音生成模型的规模不断扩大,这一趋势带来了训练数据中语音身份生物识别信息泄露的威胁,引发了隐私和安全问题。在这封信中,我们研究了使用经过说话人匿名化(SA)处理的数据训练多说话人文本到语音(TTS)模型的问题。我们使用两种基于信号处理的匿名化方法和三种基于深度神经网络的匿名化方法对多说话人文本到语音数据集 VCTK 进行了匿名化处理,并将其进一步用于训练端到端 TTS 模型 VITS,以便在测试阶段执行未见说话人的 TTS。我们进行了大量客观和主观实验,以评估匿名训练数据以及使用这些数据训练的下游 TTS 模型的性能。重要的是,我们发现数据驱动的主观评分预测模型 UTMOS 和衡量语音独特性增益的指标 GVD 是下游 TTS 性能的良好指标。我们总结了这些见解,希望能帮助未来的研究人员确定 SA 系统在多发言人 TTS 培训中的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this letter, we investigate the training of multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the usefulness of the SA system for multi-speaker TTS training.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Signal Processing Letters
IEEE Signal Processing Letters 工程技术-工程:电子与电气
CiteScore
7.40
自引率
12.80%
发文量
339
审稿时长
2.8 months
期刊介绍: The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信