{"title":"Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0","authors":"Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11909","DOIUrl":null,"url":null,"abstract":"Speech synthesis technology has posed a serious threat to speaker\nverification systems. Currently, the most effective fake audio detection methods utilize pretrained\nmodels, and integrating features from various layers of pretrained model\nfurther enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning\nthe pretrained models, resulting in excessively long training times and\nhindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on\nthe Mixture of Experts, which extracts and integrates features relevant to fake\naudio detection from layer features, guided by a gating network based on the\nlast layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets\ndemonstrate that the proposed method achieves competitive performance compared\nto those requiring fine-tuning.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech synthesis technology has posed a serious threat to speaker
verification systems. Currently, the most effective fake audio detection methods utilize pretrained
models, and integrating features from various layers of pretrained model
further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning
the pretrained models, resulting in excessively long training times and
hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on
the Mixture of Experts, which extracts and integrates features relevant to fake
audio detection from layer features, guided by a gating network based on the
last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets
demonstrate that the proposed method achieves competitive performance compared
to those requiring fine-tuning.