{"title":"基于变压器编码器的残差网络假语音检测","authors":"Zhenyu Zhang, Xiaowei Yi, Xianfeng Zhao","doi":"10.1145/3437880.3460408","DOIUrl":null,"url":null,"abstract":"Fake speech detection aims to distinguish fake speech from natural speech. This paper presents an effective fake speech detection scheme based on residual network with transformer encoder (TE-ResNet) for improving the performance of fake speech detection. Firstly, considering inter-frame correlation of the speech signal, we utilize transformer encoder to extract contextual representations of the acoustic features. Then, a residual network is used to process deep features and calculate score that the speech is fake. Besides, to increase the quantity of training data, we apply five speech data augmentation techniques on the training dataset. Finally, we fuse the different fake speech detection models on score-level by logistic regression for compensating the shortcomings of each single model. The proposed scheme is evaluated on two public speech datasets. Our experiments demonstrate that the proposed TE-ResNet outperforms the existing state-of-the-art methods both on development and evaluation datasets. In addition, the proposed fused model achieves improved performance for detection of unseen fake speech technology, which can obtain equal error rates (EERs) of 3.99% and 5.89% on evaluation set of FoR-normal dataset and ASVspoof 2019 LA dataset respectively.","PeriodicalId":120300,"journal":{"name":"Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Fake Speech Detection Using Residual Network with Transformer Encoder\",\"authors\":\"Zhenyu Zhang, Xiaowei Yi, Xianfeng Zhao\",\"doi\":\"10.1145/3437880.3460408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fake speech detection aims to distinguish fake speech from natural speech. This paper presents an effective fake speech detection scheme based on residual network with transformer encoder (TE-ResNet) for improving the performance of fake speech detection. Firstly, considering inter-frame correlation of the speech signal, we utilize transformer encoder to extract contextual representations of the acoustic features. Then, a residual network is used to process deep features and calculate score that the speech is fake. Besides, to increase the quantity of training data, we apply five speech data augmentation techniques on the training dataset. Finally, we fuse the different fake speech detection models on score-level by logistic regression for compensating the shortcomings of each single model. The proposed scheme is evaluated on two public speech datasets. Our experiments demonstrate that the proposed TE-ResNet outperforms the existing state-of-the-art methods both on development and evaluation datasets. In addition, the proposed fused model achieves improved performance for detection of unseen fake speech technology, which can obtain equal error rates (EERs) of 3.99% and 5.89% on evaluation set of FoR-normal dataset and ASVspoof 2019 LA dataset respectively.\",\"PeriodicalId\":120300,\"journal\":{\"name\":\"Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3437880.3460408\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437880.3460408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fake Speech Detection Using Residual Network with Transformer Encoder
Fake speech detection aims to distinguish fake speech from natural speech. This paper presents an effective fake speech detection scheme based on residual network with transformer encoder (TE-ResNet) for improving the performance of fake speech detection. Firstly, considering inter-frame correlation of the speech signal, we utilize transformer encoder to extract contextual representations of the acoustic features. Then, a residual network is used to process deep features and calculate score that the speech is fake. Besides, to increase the quantity of training data, we apply five speech data augmentation techniques on the training dataset. Finally, we fuse the different fake speech detection models on score-level by logistic regression for compensating the shortcomings of each single model. The proposed scheme is evaluated on two public speech datasets. Our experiments demonstrate that the proposed TE-ResNet outperforms the existing state-of-the-art methods both on development and evaluation datasets. In addition, the proposed fused model achieves improved performance for detection of unseen fake speech technology, which can obtain equal error rates (EERs) of 3.99% and 5.89% on evaluation set of FoR-normal dataset and ASVspoof 2019 LA dataset respectively.