Audio representations for deep learning in sound synthesis: A review

Anastasia Natsiou, Seán O'Leary
{"title":"Audio representations for deep learning in sound synthesis: A review","authors":"Anastasia Natsiou, Seán O'Leary","doi":"10.1109/AICCSA53542.2021.9686838","DOIUrl":null,"url":null,"abstract":"The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound’s original waveform can be too dense and rich for deep learning models to deal with efficiently -and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.","PeriodicalId":423896,"journal":{"name":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA53542.2021.9686838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound’s original waveform can be too dense and rich for deep learning models to deal with efficiently -and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.
声音合成中深度学习的音频表示:综述
深度学习算法的兴起使得许多研究人员不再使用经典的信号处理方法来生成声音。深度学习模型已经实现了富有表现力的语音合成、逼真的声音纹理和来自虚拟乐器的音符。然而,最合适的深度学习架构仍在研究中。架构的选择与音频表示紧密耦合。声音的原始波形可能过于密集和丰富,深度学习模型无法有效处理,而且复杂性增加了训练时间和计算成本。同样,它也不以被感知的方式来表现声音。因此,在许多情况下,原始音频已经转换成压缩和更有意义的形式使用上采样,特征提取,甚至通过采用更高层次的波形说明。此外,根据所选择的形式、附加条件表征、不同的模型架构和评估重建声音的众多指标也进行了研究。本文概述了使用深度学习应用于声音合成的音频表示。此外,它还提出了使用深度学习模型开发和评估声音合成体系结构的最重要方法,这些方法总是依赖于音频表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信