使用卷积神经网络的音乐边界检测:组合输入特征的比较分析

Int. J. Interact. Multim. Artif. Intell. Pub Date : 2020-08-17 DOI:10.9781/ijimai.2021.10.005

Carlos Hernandez-Olivan, J. R. Beltrán, David Diaz-Guerra

{"title":"使用卷积神经网络的音乐边界检测:组合输入特征的比较分析","authors":"Carlos Hernandez-Olivan, J. R. Beltrán, David Diaz-Guerra","doi":"10.9781/ijimai.2021.10.005","DOIUrl":null,"url":null,"abstract":"The analysis of the structure of musical pieces is a task that remains a challenge for Artificial Intelligence, especially in the field of Deep Learning. It requires prior identification of structural boundaries of the music pieces. This structural boundary analysis has recently been studied with unsupervised methods and \\textit{end-to-end} techniques such as Convolutional Neural Networks (CNN) using Mel-Scaled Log-magnitude Spectograms features (MLS), Self-Similarity Matrices (SSM) or Self-Similarity Lag Matrices (SSLM) as inputs and trained with human annotations. Several studies have been published divided into unsupervised and \\textit{end-to-end} methods in which pre-processing is done in different ways, using different distance metrics and audio characteristics, so a generalized pre-processing method to compute model inputs is missing. The objective of this work is to establish a general method of pre-processing these inputs by comparing the inputs calculated from different pooling strategies, distance metrics and audio characteristics, also taking into account the computing time to obtain them. We also establish the most effective combination of inputs to be delivered to the CNN in order to establish the most efficient way to extract the limits of the structure of the music pieces. With an adequate combination of input matrices and pooling strategies we obtain a measurement accuracy $F_1$ of 0.411 that outperforms the current one obtained under the same conditions.","PeriodicalId":143152,"journal":{"name":"Int. J. Interact. Multim. Artif. Intell.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Music Boundary Detection using Convolutional Neural Networks: A comparative analysis of combined input features\",\"authors\":\"Carlos Hernandez-Olivan, J. R. Beltrán, David Diaz-Guerra\",\"doi\":\"10.9781/ijimai.2021.10.005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The analysis of the structure of musical pieces is a task that remains a challenge for Artificial Intelligence, especially in the field of Deep Learning. It requires prior identification of structural boundaries of the music pieces. This structural boundary analysis has recently been studied with unsupervised methods and \\\\textit{end-to-end} techniques such as Convolutional Neural Networks (CNN) using Mel-Scaled Log-magnitude Spectograms features (MLS), Self-Similarity Matrices (SSM) or Self-Similarity Lag Matrices (SSLM) as inputs and trained with human annotations. Several studies have been published divided into unsupervised and \\\\textit{end-to-end} methods in which pre-processing is done in different ways, using different distance metrics and audio characteristics, so a generalized pre-processing method to compute model inputs is missing. The objective of this work is to establish a general method of pre-processing these inputs by comparing the inputs calculated from different pooling strategies, distance metrics and audio characteristics, also taking into account the computing time to obtain them. We also establish the most effective combination of inputs to be delivered to the CNN in order to establish the most efficient way to extract the limits of the structure of the music pieces. With an adequate combination of input matrices and pooling strategies we obtain a measurement accuracy $F_1$ of 0.411 that outperforms the current one obtained under the same conditions.\",\"PeriodicalId\":143152,\"journal\":{\"name\":\"Int. J. Interact. Multim. Artif. Intell.\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Interact. Multim. Artif. Intell.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.9781/ijimai.2021.10.005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Interact. Multim. Artif. Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.9781/ijimai.2021.10.005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

对音乐作品结构的分析仍然是人工智能的一个挑战，特别是在深度学习领域。它需要事先识别乐曲的结构边界。这种结构边界分析最近用无监督方法和\textit{端到端}技术进行了研究，如卷积神经网络(CNN)，使用mel - scale Log-magnitude spectrum feature (MLS)、自相似矩阵(SSM)或自相似滞后矩阵(SSLM)作为输入，并使用人工注释进行训练。已经发表的一些研究分为无监督和\textit{端到端}方法，其中预处理以不同的方式完成，使用不同的距离度量和音频特征，因此缺乏一种通用的预处理方法来计算模型输入。这项工作的目的是通过比较不同池化策略、距离度量和音频特征计算的输入，并考虑获得它们的计算时间，建立一种预处理这些输入的通用方法。我们还建立了传递给CNN的最有效的输入组合，以便建立最有效的方法来提取音乐片段的结构极限。通过输入矩阵和池化策略的适当组合，我们获得了0.411的测量精度$F_1$，优于在相同条件下获得的当前测量精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Music Boundary Detection using Convolutional Neural Networks: A comparative analysis of combined input features

The analysis of the structure of musical pieces is a task that remains a challenge for Artificial Intelligence, especially in the field of Deep Learning. It requires prior identification of structural boundaries of the music pieces. This structural boundary analysis has recently been studied with unsupervised methods and \textit{end-to-end} techniques such as Convolutional Neural Networks (CNN) using Mel-Scaled Log-magnitude Spectograms features (MLS), Self-Similarity Matrices (SSM) or Self-Similarity Lag Matrices (SSLM) as inputs and trained with human annotations. Several studies have been published divided into unsupervised and \textit{end-to-end} methods in which pre-processing is done in different ways, using different distance metrics and audio characteristics, so a generalized pre-processing method to compute model inputs is missing. The objective of this work is to establish a general method of pre-processing these inputs by comparing the inputs calculated from different pooling strategies, distance metrics and audio characteristics, also taking into account the computing time to obtain them. We also establish the most effective combination of inputs to be delivered to the CNN in order to establish the most efficient way to extract the limits of the structure of the music pieces. With an adequate combination of input matrices and pooling strategies we obtain a measurement accuracy $F_1$ of 0.411 that outperforms the current one obtained under the same conditions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Interact. Multim. Artif. Intell.

自引率

0.00%

发文量