用神经网络确定乐器声音的低级音频描述符

2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA) Pub Date : 2020-09-23 DOI:10.23919/spa50552.2020.9241264

Maciej Blaszke, Damian Koszewski

{"title":"用神经网络确定乐器声音的低级音频描述符","authors":"Maciej Blaszke, Damian Koszewski","doi":"10.23919/spa50552.2020.9241264","DOIUrl":null,"url":null,"abstract":"Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.","PeriodicalId":157578,"journal":{"name":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","volume":"34 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Determination of Low-Level Audio Descriptors of a Musical Instrument Sound Using Neural Network\",\"authors\":\"Maciej Blaszke, Damian Koszewski\",\"doi\":\"10.23919/spa50552.2020.9241264\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.\",\"PeriodicalId\":157578,\"journal\":{\"name\":\"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)\",\"volume\":\"34 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/spa50552.2020.9241264\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/spa50552.2020.9241264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

音频文件和视频文件的音频通道可以用时间、频谱、倒谱和感知音频描述符来描述。所谓的低级描述符与信号特性密切相关。我们可以从信号中分辨出至少三个层次的提取粒度:在信号中的任何一点，在小的任意区域(即帧)和更长的预分割区域。尽管有工具(例如，MIRToolbox, Python/libROSA)可用于计算这些描述符，但所得到的特征向量总是冗余的，因为它包含许多高度相关的描述符，并且这些工具的性能存在一些限制。这就是为什么在本研究中，提出了一种使用具有深层结构的人工神经网络(ANN)(即DNN)获得这些描述符的方法。在这种方案中，代表给定乐器的原始音频信号被馈送到深度神经网络输入。这样的网络可以用作独立模块，也可以用作更大架构的预训练部分。给出了在MPEG-7描述符派生的背景下深度网络性能的结果，以及损失函数的收敛性和行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Determination of Low-Level Audio Descriptors of a Musical Instrument Sound Using Neural Network

Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)

自引率

0.00%

发文量