{"title":"悟空语音:深度伪语音检测基准测试","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":null,"url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\nvoice conversion (VC), detecting deepfake voices has become increasingly\ncrucial. However, both academia and industry lack a comprehensive and intuitive\nbenchmark for evaluating detectors. Existing datasets are limited in language\ndiversity and lack many manipulations encountered in real-world production\nenvironments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\nthe performance of deepfake voice detectors. To build the dataset, we first\ncollected deepfake voices generated by 19 advanced and widely recognized\ncommercial tools and 15 open-source tools. We then created 38 data variants\ncovering six types of manipulations, constructing the evaluation dataset for\ndeepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\nChinese deepfake voice samples. Using VoiceWukong, we evaluated 12\nstate-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\n13.50%, while all others exceeded 20%. Our findings reveal that these detectors\nface significant challenges in real-world applications, with dramatically\ndeclining performance. In addition, we conducted a user study with more than\n300 participants. The results are compared with the performance of the 12\ndetectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\nwhere different detectors and humans exhibit varying identification\ncapabilities for deepfake voices at different deception levels, while the LALM\ndemonstrates no detection ability at all. Furthermore, we provide a leaderboard\nfor deepfake voice detection, publicly available at\n{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VoiceWukong: Benchmarking Deepfake Voice Detection\",\"authors\":\"Ziwei Yan, Yanjie Zhao, Haoyu Wang\",\"doi\":\"arxiv-2409.06348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the rapid advancement of technologies like text-to-speech (TTS) and\\nvoice conversion (VC), detecting deepfake voices has become increasingly\\ncrucial. However, both academia and industry lack a comprehensive and intuitive\\nbenchmark for evaluating detectors. Existing datasets are limited in language\\ndiversity and lack many manipulations encountered in real-world production\\nenvironments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\\nthe performance of deepfake voice detectors. To build the dataset, we first\\ncollected deepfake voices generated by 19 advanced and widely recognized\\ncommercial tools and 15 open-source tools. We then created 38 data variants\\ncovering six types of manipulations, constructing the evaluation dataset for\\ndeepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\\nChinese deepfake voice samples. Using VoiceWukong, we evaluated 12\\nstate-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\\n13.50%, while all others exceeded 20%. Our findings reveal that these detectors\\nface significant challenges in real-world applications, with dramatically\\ndeclining performance. In addition, we conducted a user study with more than\\n300 participants. The results are compared with the performance of the 12\\ndetectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\\nwhere different detectors and humans exhibit varying identification\\ncapabilities for deepfake voices at different deception levels, while the LALM\\ndemonstrates no detection ability at all. Furthermore, we provide a leaderboard\\nfor deepfake voice detection, publicly available at\\n{https://voicewukong.github.io}.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"39 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06348\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
With the rapid advancement of technologies like text-to-speech (TTS) and
voice conversion (VC), detecting deepfake voices has become increasingly
crucial. However, both academia and industry lack a comprehensive and intuitive
benchmark for evaluating detectors. Existing datasets are limited in language
diversity and lack many manipulations encountered in real-world production
environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate
the performance of deepfake voice detectors. To build the dataset, we first
collected deepfake voices generated by 19 advanced and widely recognized
commercial tools and 15 open-source tools. We then created 38 data variants
covering six types of manipulations, constructing the evaluation dataset for
deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200
Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12
state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of
13.50%, while all others exceeded 20%. Our findings reveal that these detectors
face significant challenges in real-world applications, with dramatically
declining performance. In addition, we conducted a user study with more than
300 participants. The results are compared with the performance of the 12
detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,
where different detectors and humans exhibit varying identification
capabilities for deepfake voices at different deception levels, while the LALM
demonstrates no detection ability at all. Furthermore, we provide a leaderboard
for deepfake voice detection, publicly available at
{https://voicewukong.github.io}.