Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher

Heyang Xue, Shan Yang, Yinjiao Lei, Lei Xie, Xiulin Li
{"title":"Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher","authors":"Heyang Xue, Shan Yang, Yinjiao Lei, Lei Xie, Xiulin Li","doi":"10.1109/SLT48900.2021.9383585","DOIUrl":null,"url":null,"abstract":"Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it’s hard for many of us to sing like a professional singer. In this paper, we propose an approach – Learn2Sing that only needs a singing teacher to generate the target speakers’ singing voice without their singing voice data. In our approach, a teacher’s singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an auxiliary feature as the inputs of the acoustic model for building a unified input representation. In order to enable the target speaker to sing without singing reference audio in the inference stage, a duration model and an LF0 prediction model are also trained. Particularly, we employ domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Our experiments indicate that the proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"670 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383585","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it’s hard for many of us to sing like a professional singer. In this paper, we propose an approach – Learn2Sing that only needs a singing teacher to generate the target speakers’ singing voice without their singing voice data. In our approach, a teacher’s singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an auxiliary feature as the inputs of the acoustic model for building a unified input representation. In order to enable the target speaker to sing without singing reference audio in the inference stage, a duration model and an LF0 prediction model are also trained. Particularly, we employ domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Our experiments indicate that the proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples.
Learn2Sing:目标说话者从歌唱老师那里学习歌唱声音合成
随着语音合成领域的迅速发展,歌声合成越来越受到人们的重视。一般来说,一个录音室级别的歌唱语料库通常是必要的,以产生一个自然的歌声从歌词和音乐相关的转录。然而,这样的语料库很难收集,因为我们中的许多人很难像专业歌手那样唱歌。在本文中,我们提出了一种方法——Learn2Sing,它只需要一个歌唱教师来生成目标说话者的歌唱声音,而不需要目标说话者的歌唱声音数据。在我们的方法中,教师的歌唱语料库和来自多个目标说话者的演讲在帧级自回归声学模型中进行训练,其中唱歌和说话共享共同的说话者嵌入和风格标签嵌入。同时,由于目标说话者没有与音乐相关的转录,我们使用对数尺度基频(LF0)作为声学模型的辅助特征作为输入,以构建统一的输入表示。为了使目标说话人在推理阶段能够在没有唱歌参考音频的情况下唱歌,我们还训练了duration模型和LF0预测模型。特别是,我们在声学模型中采用了领域对抗训练(DAT),旨在通过从歌唱和说话数据的声学特征中分离风格来提高目标说话者的演唱性能。实验结果表明,该方法能够在给定目标说话人语音样本的情况下合成目标说话人的歌声。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信