Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Manav Garg, Pranshav Gajjar, Pooja Shah, Madhu Shukla, Biswaranjan Acharya, Vassilis C. Gerogiannis, Andreas Kanavos
{"title":"Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation","authors":"Manav Garg, Pranshav Gajjar, Pooja Shah, Madhu Shukla, Biswaranjan Acharya, Vassilis C. Gerogiannis, Andreas Kanavos","doi":"10.3390/info14100527","DOIUrl":null,"url":null,"abstract":"The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and automatic music transcription, making it relevant across academic and industrial domains. This paper presents a comprehensive comparison between standard deep learning architectures and emerging vision transformers, leveraging their success in various domains. We evaluate their performance on a specific subset of the GTZAN dataset, analyzing six different deep learning models. Our results demonstrate that DenseNet, a conventional deep learning architecture, achieves remarkable accuracy of 91.64%, outperforming vision transformers. However, we delve deeper into the analysis to shed light on the temporal characteristics of each deep learning model. Notably, the vision transformer and SWIN transformer exhibit a slight decrease in overall performance (1.82% and 2.29%, respectively), yet they demonstrate superior performance in temporal metrics compared to the DenseNet architecture. The significance of our findings lies in their contribution to the field of musical key estimation, where accurate and efficient algorithms play a pivotal role. By examining the strengths and weaknesses of deep learning architectures and vision transformers, we can gain valuable insights for practical implementations, particularly in music recommendation systems and automatic music transcription. Our research provides a foundation for future advancements and encourages further exploration in this area.","PeriodicalId":38479,"journal":{"name":"Information (Switzerland)","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information (Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14100527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and automatic music transcription, making it relevant across academic and industrial domains. This paper presents a comprehensive comparison between standard deep learning architectures and emerging vision transformers, leveraging their success in various domains. We evaluate their performance on a specific subset of the GTZAN dataset, analyzing six different deep learning models. Our results demonstrate that DenseNet, a conventional deep learning architecture, achieves remarkable accuracy of 91.64%, outperforming vision transformers. However, we delve deeper into the analysis to shed light on the temporal characteristics of each deep learning model. Notably, the vision transformer and SWIN transformer exhibit a slight decrease in overall performance (1.82% and 2.29%, respectively), yet they demonstrate superior performance in temporal metrics compared to the DenseNet architecture. The significance of our findings lies in their contribution to the field of musical key estimation, where accurate and efficient algorithms play a pivotal role. By examining the strengths and weaknesses of deep learning architectures and vision transformers, we can gain valuable insights for practical implementations, particularly in music recommendation systems and automatic music transcription. Our research provides a foundation for future advancements and encourages further exploration in this area.
深度学习架构与视觉变换在音乐键估计中的比较分析
音阶是乐曲中至关重要的元素,它提供了对音调中心、和声结构和和弦进行的重要见解,同时也实现了诸如换位和编曲等任务。此外,准确的键估计在音乐推荐系统和自动音乐转录中找到了实际应用,使其在学术和工业领域都具有相关性。本文介绍了标准深度学习架构和新兴视觉转换器之间的全面比较,利用它们在各个领域的成功。我们在GTZAN数据集的特定子集上评估了它们的性能,分析了六种不同的深度学习模型。我们的研究结果表明,传统深度学习架构DenseNet的准确率达到了91.64%,优于视觉变压器。然而,我们更深入地分析了每个深度学习模型的时间特征。值得注意的是,视觉变压器和SWIN变压器的整体性能略有下降(分别为1.82%和2.29%),但与DenseNet架构相比,它们在时间指标上表现出优越的性能。我们的发现的意义在于它们对音乐键估计领域的贡献,其中准确和高效的算法起着关键作用。通过研究深度学习架构和视觉转换器的优缺点,我们可以为实际实现获得有价值的见解,特别是在音乐推荐系统和自动音乐转录方面。我们的研究为未来的发展奠定了基础,并鼓励了该领域的进一步探索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information (Switzerland)
Information (Switzerland) Computer Science-Information Systems
CiteScore
6.90
自引率
0.00%
发文量
515
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信