语音到语音转换使用变压器网络*

Phonetics and Speech Sciences Pub Date : 2020-09-01 DOI:10.13064/ksss.2020.12.3.055

June-Woo Kim, H. Jung

{"title":"语音到语音转换使用变压器网络*","authors":"June-Woo Kim, H. Jung","doi":"10.13064/ksss.2020.12.3.055","DOIUrl":null,"url":null,"abstract":"Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.","PeriodicalId":255285,"journal":{"name":"Phonetics and Speech Sciences","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Voice-to-voice conversion using transformer network*\",\"authors\":\"June-Woo Kim, H. Jung\",\"doi\":\"10.13064/ksss.2020.12.3.055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.\",\"PeriodicalId\":255285,\"journal\":{\"name\":\"Phonetics and Speech Sciences\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Phonetics and Speech Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13064/ksss.2020.12.3.055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Phonetics and Speech Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13064/ksss.2020.12.3.055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

声音转换可以应用于各种处理应用程序。它还可以在语音识别的数据增强中发挥重要作用。传统方法采用语音转换与语音合成的结构，以Mel滤波器组为主要参数。Mel滤波器组非常适合神经网络的快速计算，但如果没有声码器的帮助，就无法转换成高质量的波形。此外，它在获取语音识别数据方面效果不佳。在本文中，我们专注于仅使用原始频谱进行语音到语音转换。我们提出一种基于变压器网络深度学习模型,很快学习语音转换的属性使用一个源和目标光谱组件之间的注意机制。实验是在TIDIGITS数据上进行的，TIDIGITS数据是说英语的人所说的一系列数字。使用从30名参与者中获得的平均意见得分(MOS)来评估转换声音的自然性和相似性。我们的最终结果为自然度3.52±0.22，相似度3.89±0.19。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Voice-to-voice conversion using transformer network*

Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Phonetics and Speech Sciences

自引率

0.00%

发文量