{"title":"用神经网络确定乐器声音的低级音频描述符","authors":"Maciej Blaszke, Damian Koszewski","doi":"10.23919/spa50552.2020.9241264","DOIUrl":null,"url":null,"abstract":"Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.","PeriodicalId":157578,"journal":{"name":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","volume":"34 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Determination of Low-Level Audio Descriptors of a Musical Instrument Sound Using Neural Network\",\"authors\":\"Maciej Blaszke, Damian Koszewski\",\"doi\":\"10.23919/spa50552.2020.9241264\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.\",\"PeriodicalId\":157578,\"journal\":{\"name\":\"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)\",\"volume\":\"34 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/spa50552.2020.9241264\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/spa50552.2020.9241264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Determination of Low-Level Audio Descriptors of a Musical Instrument Sound Using Neural Network
Audio files and the audio channel of video files can be described with temporal, spectral, cepstral, and perceptual audio descriptors. The so-called low-level descriptors are closely related to the signal characteristics. One can discern at least three levels of extraction granularity from the signal: at any point in the signal, in small arbitrary regions (i.e., frames) and longer pre-segmented regions. Even though there are tools (e.g., MIRToolbox, Python/libROSA) available for computing these descriptors, the resulting feature vector is always redundant as it contains many high-correlated descriptors and there are some limitations connected to the performance of these tools. That is why, in this study, a method for obtaining those descriptors using Artificial Neural Network (ANN) with a deep structure (i.e., DNN) is proposed. In such a scheme, the raw audio signal representing a given musical instrument is fed to the DNN input. Such a network can be used as a standalone module or as a pre-trained part of the bigger architecture. The results of deep network performance in the context of MPEG-7 descriptor derivation are shown along with the loss function convergence and behavior.