{"title":"Determination of Low-Level Audio Descriptors of a Musical Instrument Sound Using Neural Network","authors":"Maciej Blaszke, Damian Koszewski","doi":"10.23919/spa50552.2020.9241264","DOIUrl":null,"url":null,"abstract":"Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.","PeriodicalId":157578,"journal":{"name":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","volume":"34 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/spa50552.2020.9241264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.