{"title":"MusicTalk:识别乐器的微服务方法","authors":"Yi-Bing Lin;Chang-Chieh Cheng;Shih-Chuan Chiu","doi":"10.1109/OJCS.2024.3476416","DOIUrl":null,"url":null,"abstract":"Musical instrument recognition is the process of using machine learning or audio signal processing to identify and classify different musical instruments from an audio recording. This capability enables more precise analysis of musical pieces, aiding in tasks like transcription, music recommendation, and automated composition. The challenges include (1) recognition models not being accurate enough, (2) the need to retrain the entire model when a new instrument is added, and (3) differences in audio formats that prevent direct usage. To address these challenges, this article introduces MusicTalk, a microservice based musical instrument (MI) detection system, with several key contributions. Firstly, MusicTalk introduces a novel patchout mechanism named Brightness Characteristic Based Patchout for the ViT algorithm, which enhances MI detection accuracy compared to existing solutions. Secondly, MusicTalk integrates individual MI detectors as microservices, facilitating efficient interaction with other microservices. Thirdly, MusicTalk incorporates an audio shaper that unifies diverse music open datasets such as Audioset, Openmic-2018, MedleyDB, URMP, and INSTDB. By employing Grad-CAM analysis on Mel-Spectrograms, we elucidate the characteristics of the MI detection model. This analysis allows us to optimize ensemble combinations of ViT with patchout and CNNs within MusicTalk, resulting in high accuracy rates. For instance, the system achieves precision and recall rates of 96.17% and 95.77% respectively for violin detection, which are the highest among previous approaches. An additional advantage of MusicTalk lies in its microservice-driven visualization capabilities. By integrating MI detectors as microservices, MusicTalk enables seamless visualization of songs using animated avatars. In a case study featuring “Peter and the Wolf,” we demonstrate that improved MI detection accuracy enhances the visual storytelling impact of music. The overall F1-score improvement of MusicTalk over previous approaches for this song is up to 12%.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"5 ","pages":"612-623"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10709650","citationCount":"0","resultStr":"{\"title\":\"MusicTalk: A Microservice Approach for Musical Instrument Recognition\",\"authors\":\"Yi-Bing Lin;Chang-Chieh Cheng;Shih-Chuan Chiu\",\"doi\":\"10.1109/OJCS.2024.3476416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Musical instrument recognition is the process of using machine learning or audio signal processing to identify and classify different musical instruments from an audio recording. This capability enables more precise analysis of musical pieces, aiding in tasks like transcription, music recommendation, and automated composition. The challenges include (1) recognition models not being accurate enough, (2) the need to retrain the entire model when a new instrument is added, and (3) differences in audio formats that prevent direct usage. To address these challenges, this article introduces MusicTalk, a microservice based musical instrument (MI) detection system, with several key contributions. Firstly, MusicTalk introduces a novel patchout mechanism named Brightness Characteristic Based Patchout for the ViT algorithm, which enhances MI detection accuracy compared to existing solutions. Secondly, MusicTalk integrates individual MI detectors as microservices, facilitating efficient interaction with other microservices. Thirdly, MusicTalk incorporates an audio shaper that unifies diverse music open datasets such as Audioset, Openmic-2018, MedleyDB, URMP, and INSTDB. By employing Grad-CAM analysis on Mel-Spectrograms, we elucidate the characteristics of the MI detection model. This analysis allows us to optimize ensemble combinations of ViT with patchout and CNNs within MusicTalk, resulting in high accuracy rates. For instance, the system achieves precision and recall rates of 96.17% and 95.77% respectively for violin detection, which are the highest among previous approaches. An additional advantage of MusicTalk lies in its microservice-driven visualization capabilities. By integrating MI detectors as microservices, MusicTalk enables seamless visualization of songs using animated avatars. In a case study featuring “Peter and the Wolf,” we demonstrate that improved MI detection accuracy enhances the visual storytelling impact of music. The overall F1-score improvement of MusicTalk over previous approaches for this song is up to 12%.\",\"PeriodicalId\":13205,\"journal\":{\"name\":\"IEEE Open Journal of the Computer Society\",\"volume\":\"5 \",\"pages\":\"612-623\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10709650\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of the Computer Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10709650/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10709650/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
乐器识别是利用机器学习或音频信号处理技术,从录音中识别不同乐器并对其进行分类的过程。这种能力可以对音乐作品进行更精确的分析,有助于完成转录、音乐推荐和自动作曲等任务。面临的挑战包括:(1) 识别模型不够准确;(2) 添加新乐器时需要重新训练整个模型;(3) 音频格式的差异导致无法直接使用。为了应对这些挑战,本文介绍了基于微服务的乐器(MI)检测系统 MusicTalk,其主要贡献有以下几点。首先,MusicTalk 为 ViT 算法引入了一种名为 "基于亮度特征的补丁输出"(Brightness Characteristic Based Patchout)的新型补丁输出机制,与现有解决方案相比,该机制提高了乐器检测的准确性。其次,MusicTalk 将单个 MI 检测器整合为微服务,促进了与其他微服务的高效交互。第三,MusicTalk 整合了音频整形器,统一了 Audioset、Openmic-2018、MedleyDB、URMP 和 INSTDB 等不同的音乐开放数据集。通过对 Mel-Spectrograms 进行 Grad-CAM 分析,我们阐明了 MI 检测模型的特征。这一分析使我们能够优化 MusicTalk 中的 ViT 与 patchout 和 CNN 的集合组合,从而实现高准确率。例如,该系统在小提琴检测方面的精确率和召回率分别达到了 96.17% 和 95.77%,是以往方法中最高的。MusicTalk 的另一个优势在于其微服务驱动的可视化功能。通过将MI检测器集成为微服务,MusicTalk可以使用动画头像实现歌曲的无缝可视化。在以 "彼得与狼 "为主题的案例研究中,我们展示了 MI 检测准确率的提高增强了音乐的视觉故事效果。在这首歌曲中,MusicTalk 的整体 F1 分数比以前的方法提高了 12%。
MusicTalk: A Microservice Approach for Musical Instrument Recognition
Musical instrument recognition is the process of using machine learning or audio signal processing to identify and classify different musical instruments from an audio recording. This capability enables more precise analysis of musical pieces, aiding in tasks like transcription, music recommendation, and automated composition. The challenges include (1) recognition models not being accurate enough, (2) the need to retrain the entire model when a new instrument is added, and (3) differences in audio formats that prevent direct usage. To address these challenges, this article introduces MusicTalk, a microservice based musical instrument (MI) detection system, with several key contributions. Firstly, MusicTalk introduces a novel patchout mechanism named Brightness Characteristic Based Patchout for the ViT algorithm, which enhances MI detection accuracy compared to existing solutions. Secondly, MusicTalk integrates individual MI detectors as microservices, facilitating efficient interaction with other microservices. Thirdly, MusicTalk incorporates an audio shaper that unifies diverse music open datasets such as Audioset, Openmic-2018, MedleyDB, URMP, and INSTDB. By employing Grad-CAM analysis on Mel-Spectrograms, we elucidate the characteristics of the MI detection model. This analysis allows us to optimize ensemble combinations of ViT with patchout and CNNs within MusicTalk, resulting in high accuracy rates. For instance, the system achieves precision and recall rates of 96.17% and 95.77% respectively for violin detection, which are the highest among previous approaches. An additional advantage of MusicTalk lies in its microservice-driven visualization capabilities. By integrating MI detectors as microservices, MusicTalk enables seamless visualization of songs using animated avatars. In a case study featuring “Peter and the Wolf,” we demonstrate that improved MI detection accuracy enhances the visual storytelling impact of music. The overall F1-score improvement of MusicTalk over previous approaches for this song is up to 12%.