{"title":"SiamNet:基于Siamese CNN的相似度模型,用于对抗生成的环境声音","authors":"Aswathy Madhu, S. Kumaraswamy","doi":"10.1109/mlsp52302.2021.9596435","DOIUrl":null,"url":null,"abstract":"Recently, Generative Adversarial Networks (GANs) are being used extensively in machine learning applications for synthetic generation of image and audio samples. However efficient methods for the evaluation of the quality of GAN generated samples are not available yet. Moreover, most of the existing evaluation metrics are developed exclusively for images which may not work well with other types of data such as audio. Evaluation metrics developed specifically for audio are rare and hence the generation of perceptually acceptable audio using GAN is difficult. In this work, we address this problem. We propose Siamese CNN which simultaneously learns feature representation and similarity measure to evaluate the quality of synthetic audio generated by GAN. The proposed method estimates the perceptual proximity between the original and generated samples. Our similarity model is trained on two standard datasets of environmental sounds. The pre-trained model is evaluated on the environmental sounds generated using GAN. The predicted mean similarity score of the SiamNet are highly correlated with human ratings at the class level. This indicates that our model successfully captures the perceptual similarity between the generated and original audio samples.","PeriodicalId":156116,"journal":{"name":"2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SiamNet: Siamese CNN Based Similarity Model for Adversarially Generated Environmental Sounds\",\"authors\":\"Aswathy Madhu, S. Kumaraswamy\",\"doi\":\"10.1109/mlsp52302.2021.9596435\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Generative Adversarial Networks (GANs) are being used extensively in machine learning applications for synthetic generation of image and audio samples. However efficient methods for the evaluation of the quality of GAN generated samples are not available yet. Moreover, most of the existing evaluation metrics are developed exclusively for images which may not work well with other types of data such as audio. Evaluation metrics developed specifically for audio are rare and hence the generation of perceptually acceptable audio using GAN is difficult. In this work, we address this problem. We propose Siamese CNN which simultaneously learns feature representation and similarity measure to evaluate the quality of synthetic audio generated by GAN. The proposed method estimates the perceptual proximity between the original and generated samples. Our similarity model is trained on two standard datasets of environmental sounds. The pre-trained model is evaluated on the environmental sounds generated using GAN. The predicted mean similarity score of the SiamNet are highly correlated with human ratings at the class level. This indicates that our model successfully captures the perceptual similarity between the generated and original audio samples.\",\"PeriodicalId\":156116,\"journal\":{\"name\":\"2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/mlsp52302.2021.9596435\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/mlsp52302.2021.9596435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SiamNet: Siamese CNN Based Similarity Model for Adversarially Generated Environmental Sounds
Recently, Generative Adversarial Networks (GANs) are being used extensively in machine learning applications for synthetic generation of image and audio samples. However efficient methods for the evaluation of the quality of GAN generated samples are not available yet. Moreover, most of the existing evaluation metrics are developed exclusively for images which may not work well with other types of data such as audio. Evaluation metrics developed specifically for audio are rare and hence the generation of perceptually acceptable audio using GAN is difficult. In this work, we address this problem. We propose Siamese CNN which simultaneously learns feature representation and similarity measure to evaluate the quality of synthetic audio generated by GAN. The proposed method estimates the perceptual proximity between the original and generated samples. Our similarity model is trained on two standard datasets of environmental sounds. The pre-trained model is evaluated on the environmental sounds generated using GAN. The predicted mean similarity score of the SiamNet are highly correlated with human ratings at the class level. This indicates that our model successfully captures the perceptual similarity between the generated and original audio samples.