SpoofCeleb:语音深度假检测和SASV在野外

IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Jee-weon Jung;Yihan Wu;Xin Wang;Ji-Hoon Kim;Soumi Maiti;Yuta Matsunaga;Hye-jin Shim;Jinchuan Tian;Nicholas Evans;Joon Son Chung;Wangyou Zhang;Seyun Um;Shinnosuke Takamichi;Shinji Watanabe
{"title":"SpoofCeleb:语音深度假检测和SASV在野外","authors":"Jee-weon Jung;Yihan Wu;Xin Wang;Ji-Hoon Kim;Soumi Maiti;Yuta Matsunaga;Hye-jin Shim;Jinchuan Tian;Nicholas Evans;Joon Son Chung;Wangyou Zhang;Seyun Um;Shinnosuke Takamichi;Shinji Watanabe","doi":"10.1109/OJSP.2025.3529377","DOIUrl":null,"url":null,"abstract":"This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at <uri>https://jungjee.github.io/spoofceleb</uri>.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"68-77"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10839331","citationCount":"0","resultStr":"{\"title\":\"SpoofCeleb: Speech Deepfake Detection and SASV in the Wild\",\"authors\":\"Jee-weon Jung;Yihan Wu;Xin Wang;Ji-Hoon Kim;Soumi Maiti;Yuta Matsunaga;Hye-jin Shim;Jinchuan Tian;Nicholas Evans;Joon Son Chung;Wangyou Zhang;Seyun Um;Shinnosuke Takamichi;Shinji Watanabe\",\"doi\":\"10.1109/OJSP.2025.3529377\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at <uri>https://jungjee.github.io/spoofceleb</uri>.\",\"PeriodicalId\":73300,\"journal\":{\"name\":\"IEEE open journal of signal processing\",\"volume\":\"6 \",\"pages\":\"68-77\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10839331\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of signal processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10839331/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10839331/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

本文介绍了SpoofCeleb,这是一个专为语音深度伪造检测(SDD)和欺骗鲁棒自动说话人验证(SASV)而设计的数据集,利用来自现实世界的源数据和由文本到语音(TTS)系统生成的欺骗攻击,这些攻击也在相同的现实世界数据上训练。鲁棒识别系统需要在不同噪声水平的不同声学环境中记录语音数据进行训练。然而,由于TTS培训的要求,目前的数据集通常包括干净、高质量的记录(真实数据);训练TTS模型通常需要演播室质量或良好录制的读语音。由于说话人的多样性不足,目前的SDD数据集对训练SASV模型的有用性也有限。SpoofCeleb利用我们开发的全自动管道处理VoxCeleb1数据集,将其转换为适合TTS培训的形式。我们随后培训了23个当代TTS系统。SpoofCeleb收录了来自1251位不同说话者的250多万句话语,这些话语都是在自然的、真实的条件下收集的。该数据集包括精心划分的训练、验证和评估集,具有良好控制的实验协议。我们给出了SDD和SASV任务的基线结果。所有数据、协议和基线都可以在https://jungjee.github.io/spoofceleb上公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SpoofCeleb: Speech Deepfake Detection and SASV in the Wild
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
5.30
自引率
0.00%
发文量
0
审稿时长
22 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信