Audio representations for deep learning in sound synthesis: A review

2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA) Pub Date : 2021-11-01 DOI:10.1109/AICCSA53542.2021.9686838

Anastasia Natsiou, Seán O'Leary

{"title":"Audio representations for deep learning in sound synthesis: A review","authors":"Anastasia Natsiou, Seán O'Leary","doi":"10.1109/AICCSA53542.2021.9686838","DOIUrl":null,"url":null,"abstract":"The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound’s original waveform can be too dense and rich for deep learning models to deal with efficiently -and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.","PeriodicalId":423896,"journal":{"name":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA53542.2021.9686838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound’s original waveform can be too dense and rich for deep learning models to deal with efficiently -and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

查看原文本刊更多论文

声音合成中深度学习的音频表示:综述

深度学习算法的兴起使得许多研究人员不再使用经典的信号处理方法来生成声音。深度学习模型已经实现了富有表现力的语音合成、逼真的声音纹理和来自虚拟乐器的音符。然而，最合适的深度学习架构仍在研究中。架构的选择与音频表示紧密耦合。声音的原始波形可能过于密集和丰富，深度学习模型无法有效处理，而且复杂性增加了训练时间和计算成本。同样，它也不以被感知的方式来表现声音。因此，在许多情况下，原始音频已经转换成压缩和更有意义的形式使用上采样，特征提取，甚至通过采用更高层次的波形说明。此外，根据所选择的形式、附加条件表征、不同的模型架构和评估重建声音的众多指标也进行了研究。本文概述了使用深度学习应用于声音合成的音频表示。此外，它还提出了使用深度学习模型开发和评估声音合成体系结构的最重要方法，这些方法总是依赖于音频表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)

自引率

0.00%

发文量