Time-frequency scattering accurately models auditory similarities between instrumental playing techniques.

IF 1.7 3区 计算机科学 Q2 ACOUSTICS
Vincent Lostanlen, Christian El-Hajj, Mathias Rossignol, Grégoire Lafay, Joakim Andén, Mathieu Lagrange
{"title":"Time-frequency scattering accurately models auditory similarities between instrumental playing techniques.","authors":"Vincent Lostanlen,&nbsp;Christian El-Hajj,&nbsp;Mathias Rossignol,&nbsp;Grégoire Lafay,&nbsp;Joakim Andén,&nbsp;Mathieu Lagrange","doi":"10.1186/s13636-020-00187-z","DOIUrl":null,"url":null,"abstract":"<p><p>Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called \"ordinary\" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time-frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99<i>.</i>0<i>%</i>±1. An ablation study demonstrates that removing either the joint time-frequency scattering transform or the metric learning algorithm noticeably degrades performance.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"3"},"PeriodicalIF":1.7000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-020-00187-z","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-020-00187-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/1/11 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 10

Abstract

Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called "ordinary" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time-frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99.0%±1. An ablation study demonstrates that removing either the joint time-frequency scattering transform or the metric learning algorithm noticeably degrades performance.

Abstract Image

Abstract Image

Abstract Image

时频散射准确地模拟了乐器演奏技术之间的听觉相似性。
器乐演奏技巧,如颤音、滑音和颤音,在古典和民间语境中通常表示音乐表现力。然而,大多数现有的音乐相似性检索方法无法描述所谓的“普通”技术之外的音色,使用乐器身份作为音质的代理,并且不允许定制新主题的感知特性。在本文中,我们要求31名人类参与者将78个孤立的音符组织成一组音色簇。分析他们的反应表明,音色感知在一种更灵活的分类中运作,而不仅仅是由乐器或演奏技术提供的。此外,我们提出了一个机器聆听模型来恢复跨乐器,静音和技术的听觉相似性的聚类图。我们的模型依赖于联合时频散射特征来提取作为声学特征的光谱时间调制。在此基础上,利用大边界最近邻(LMNN)度量学习算法最小化聚类图中的三元组损失。在9346个独立音符的数据集上,我们报告了排名5 (AP@5)的最先进的平均精度为99.0%±1。一项消融研究表明,去除联合时频散射变换或度量学习算法都会显著降低性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Eurasip Journal on Audio Speech and Music Processing
Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
4.10
自引率
4.20%
发文量
0
审稿时长
12 months
期刊介绍: The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信