PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
{"title":"PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing","authors":"Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley","doi":"arxiv-2409.10831","DOIUrl":null,"url":null,"abstract":"The recent explosion of generative AI-Music systems has raised numerous\nconcerns over data copyright, licensing music from musicians, and the conflict\nbetween open-source AI and large prestige companies. Such issues highlight the\nneed for publicly available, copyright-free musical data, in which there is a\nlarge shortage, particularly for symbolic music data. To alleviate this issue,\nwe present PDMX: a large-scale open-source dataset of over 250K public domain\nMusicXML scores collected from the score-sharing forum MuseScore, making it the\nlargest available copyright-free symbolic music dataset to our knowledge. PDMX\nadditionally includes a wealth of both tag and user interaction metadata,\nallowing us to efficiently analyze the dataset and filter for high quality\nuser-generated scores. Given the additional metadata afforded by our data\ncollection process, we conduct multitrack music generation experiments\nevaluating how different representative subsets of PDMX lead to different\nbehaviors in downstream models, and how user-rating statistics can be used as\nan effective measure of data quality. Examples can be found at\nhttps://pnlong.github.io/PDMX.demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.
PDMX:用于符号音乐处理的大规模公共领域音乐 XML 数据集
最近,人工智能音乐生成系统的爆炸式发展引发了许多关于数据版权、音乐家音乐授权以及开源人工智能与大型知名公司之间冲突的问题。这些问题凸显了对可公开获取、无版权限制的音乐数据的需求,而这方面的数据尤其是符号音乐数据非常缺乏。为了缓解这一问题,我们推出了 PDMX:一个大型开源数据集,其中包含从乐谱共享论坛 MuseScore 收集的超过 25 万个公共领域的 MusicXML 乐谱,是我们所知的最大的可用无版权符号音乐数据集。PDMX 还包括大量标签和用户交互元数据,使我们能够高效地分析数据集,并筛选出高质量的用户生成乐谱。鉴于我们的数据收集过程提供了额外的元数据,我们进行了多轨音乐生成实验,评估 PDMX 的不同代表性子集如何导致下游模型的不同行为,以及用户评分统计如何用作数据质量的有效衡量标准。示例可在https://pnlong.github.io/PDMX.demo/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信