Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling

arXiv - CS - Sound Pub Date : 2024-05-07 DOI:arxiv-2405.04124

Riccardo Simionato, Stefano Fasciani

{"title":"Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling","authors":"Riccardo Simionato, Stefano Fasciani","doi":"arxiv-2405.04124","DOIUrl":null,"url":null,"abstract":"Analog electronic circuits are at the core of an important category of\nmusical devices. The nonlinear features of their electronic components give\nanalog musical devices a distinctive timbre and sound quality, making them\nhighly desirable. Artificial neural networks have rapidly gained popularity for\nthe emulation of analog audio effects circuits, particularly recurrent\nnetworks. While neural approaches have been successful in accurately modeling\ndistortion circuits, they require architectural improvements that account for\nparameter conditioning and low latency response. In this article, we explore\nthe application of recent machine learning advancements for virtual analog\nmodeling. We compare State Space models and Linear Recurrent Units against the\nmore common Long Short Term Memory networks. These have shown promising ability\nin sequence to sequence modeling tasks, showing a notable improvement in signal\nhistory encoding. Our comparative study uses these black box neural modeling\ntechniques with a variety of audio effects. We evaluate the performance and\nlimitations using multiple metrics aiming to assess the models' ability to\naccurately replicate energy envelopes, frequency contents, and transients in\nthe audio signal. To incorporate control parameters we employ the Feature wise\nLinear Modulation method. Long Short Term Memory networks exhibit better\naccuracy in emulating distortions and equalizers, while the State Space model,\nfollowed by Long Short Term Memory networks when integrated in an encoder\ndecoder structure, outperforms others in emulating saturation and compression.\nWhen considering long time variant characteristics, the State Space model\ndemonstrates the greatest accuracy. The Long Short Term Memory and, in\nparticular, Linear Recurrent Unit networks present more tendency to introduce\naudio artifacts.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.04124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Analog electronic circuits are at the core of an important category of musical devices. The nonlinear features of their electronic components give analog musical devices a distinctive timbre and sound quality, making them highly desirable. Artificial neural networks have rapidly gained popularity for the emulation of analog audio effects circuits, particularly recurrent networks. While neural approaches have been successful in accurately modeling distortion circuits, they require architectural improvements that account for parameter conditioning and low latency response. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. We compare State Space models and Linear Recurrent Units against the more common Long Short Term Memory networks. These have shown promising ability in sequence to sequence modeling tasks, showing a notable improvement in signal history encoding. Our comparative study uses these black box neural modeling techniques with a variety of audio effects. We evaluate the performance and limitations using multiple metrics aiming to assess the models' ability to accurately replicate energy envelopes, frequency contents, and transients in the audio signal. To incorporate control parameters we employ the Feature wise Linear Modulation method. Long Short Term Memory networks exhibit better accuracy in emulating distortions and equalizers, while the State Space model, followed by Long Short Term Memory networks when integrated in an encoder decoder structure, outperforms others in emulating saturation and compression. When considering long time variant characteristics, the State Space model demonstrates the greatest accuracy. The Long Short Term Memory and, in particular, Linear Recurrent Unit networks present more tendency to introduce audio artifacts.

查看原文本刊更多论文

用于虚拟模拟音频效果建模的递归神经网络比较研究

模拟电子电路是一类重要的音乐设备的核心。其电子元件的非线性特征赋予了模拟音乐设备独特的音色和音质，使其备受青睐。在模拟音频效果电路的仿真方面，人工神经网络，尤其是递归网络，迅速得到普及。虽然神经方法在准确模拟失真电路方面取得了成功，但它们需要在结构上进行改进，以考虑参数调节和低延迟响应。在本文中，我们探讨了最近机器学习技术在虚拟模拟中的应用。我们将状态空间模型和线性递归单元与更常见的长短期记忆网络进行了比较。这些模型在序列到序列建模任务中表现出了良好的能力，在信号历史编码方面也有明显的改进。我们的比较研究采用了这些黑盒神经建模技术，并使用了多种音频效果。我们使用多个指标来评估模型的性能和局限性，旨在评估模型准确复制音频信号中的能量包络、频率内容和瞬态的能力。为了加入控制参数，我们采用了特征线性调制方法。长短期记忆网络在仿真失真和均衡器方面表现出更高的精度，而状态空间模型，其次是长短期记忆网络，当集成到编码器和解码器结构中时，在仿真饱和和压缩方面优于其他模型。长短期记忆网络，尤其是线性递归单元网络更容易引入音频伪音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量