{"title":"基于生成对抗网络的普通话唱腔合成","authors":"Yun Zhou, Hongwu Yang, Ziyan Chen, Yajing Yan","doi":"10.1109/ICICSP50920.2020.9232118","DOIUrl":null,"url":null,"abstract":"This paper proposed a method for statistical parametric singing synthesis incorporating GAN (Generative Adversarial Network) that trained acoustic model. In GAN, the acoustic model was trained to minimize the weighted sum of the conventional minimum generation loss and adversarial loss, which was minimizing the distance between the natural and generated samples parameter, thus effectively solved the problem of over-smoothing. In the experimental part, we established a singing voice corpus with 60 songs and divided them that have been recorded and labeled into about 1000 sentences, of which 950 sentences were for training model. Comparing the generated songs of the method proposed in this paper and HMM, through 10 people MOS scores, the score of the former was 3.12 that was better than the latter of 2.81.","PeriodicalId":117760,"journal":{"name":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mandarin Singing Synthesis Based on Generative Adversarial Network\",\"authors\":\"Yun Zhou, Hongwu Yang, Ziyan Chen, Yajing Yan\",\"doi\":\"10.1109/ICICSP50920.2020.9232118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposed a method for statistical parametric singing synthesis incorporating GAN (Generative Adversarial Network) that trained acoustic model. In GAN, the acoustic model was trained to minimize the weighted sum of the conventional minimum generation loss and adversarial loss, which was minimizing the distance between the natural and generated samples parameter, thus effectively solved the problem of over-smoothing. In the experimental part, we established a singing voice corpus with 60 songs and divided them that have been recorded and labeled into about 1000 sentences, of which 950 sentences were for training model. Comparing the generated songs of the method proposed in this paper and HMM, through 10 people MOS scores, the score of the former was 3.12 that was better than the latter of 2.81.\",\"PeriodicalId\":117760,\"journal\":{\"name\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICSP50920.2020.9232118\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICSP50920.2020.9232118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mandarin Singing Synthesis Based on Generative Adversarial Network
This paper proposed a method for statistical parametric singing synthesis incorporating GAN (Generative Adversarial Network) that trained acoustic model. In GAN, the acoustic model was trained to minimize the weighted sum of the conventional minimum generation loss and adversarial loss, which was minimizing the distance between the natural and generated samples parameter, thus effectively solved the problem of over-smoothing. In the experimental part, we established a singing voice corpus with 60 songs and divided them that have been recorded and labeled into about 1000 sentences, of which 950 sentences were for training model. Comparing the generated songs of the method proposed in this paper and HMM, through 10 people MOS scores, the score of the former was 3.12 that was better than the latter of 2.81.