International Society for Music Information Retrieval Conference最新文献

筛选
英文 中文
Mel Spectrogram Inversion with Stable Pitch 具有稳定音高的梅尔谱图反演
International Society for Music Information Retrieval Conference Pub Date : 2022-08-26 DOI: 10.48550/arXiv.2208.12782
Bruno Di Giorgi, M. Levy, Richard Sharp
{"title":"Mel Spectrogram Inversion with Stable Pitch","authors":"Bruno Di Giorgi, M. Levy, Richard Sharp","doi":"10.48550/arXiv.2208.12782","DOIUrl":"https://doi.org/10.48550/arXiv.2208.12782","url":null,"abstract":"Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network. We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125613684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Concept-Based Techniques for "Musicologist-friendly" Explanations in a Deep Music Classifier 深度音乐分类器中“音乐学家友好”解释的基于概念的技术
International Society for Music Information Retrieval Conference Pub Date : 2022-08-26 DOI: 10.48550/arXiv.2208.12485
Francesco Foscarin, Katharina Hoedt, Verena Praher, A. Flexer, G. Widmer
{"title":"Concept-Based Techniques for \"Musicologist-friendly\" Explanations in a Deep Music Classifier","authors":"Francesco Foscarin, Katharina Hoedt, Verena Praher, A. Flexer, G. Widmer","doi":"10.48550/arXiv.2208.12485","DOIUrl":"https://doi.org/10.48550/arXiv.2208.12485","url":null,"abstract":"Current approaches for explaining deep learning systems applied to musical data provide results in a low-level feature space, e.g., by highlighting potentially relevant time-frequency bins in a spectrogram or time-pitch bins in a piano roll. This can be difficult to understand, particularly for musicologists without technical knowledge. To address this issue, we focus on more human-friendly explanations based on high-level musical concepts. Our research targets trained systems (post-hoc explanations) and explores two approaches: a supervised one, where the user can define a musical concept and test if it is relevant to the system; and an unsupervised one, where musical excerpts containing relevant concepts are automatically selected and given to the user for interpretation. We demonstrate both techniques on an existing symbolic composer classification system, showcase their potential, and highlight their intrinsic limitations.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"87 21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126302273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic music mixing with deep learning and out-of-domain data 自动音乐混合与深度学习和领域外的数据
International Society for Music Information Retrieval Conference Pub Date : 2022-08-24 DOI: 10.48550/arXiv.2208.11428
Marco A. Mart'inez-Ram'irez, Wei-Hsiang Liao, Giorgio Fabbro, S. Uhlich, Chihiro Nagashima, Yuki Mitsufuji
{"title":"Automatic music mixing with deep learning and out-of-domain data","authors":"Marco A. Mart'inez-Ram'irez, Wei-Hsiang Liao, Giorgio Fabbro, S. Uhlich, Chihiro Nagashima, Yuki Mitsufuji","doi":"10.48550/arXiv.2208.11428","DOIUrl":"https://doi.org/10.48550/arXiv.2208.11428","url":null,"abstract":"Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge (e.g., a mixing engineer). The automation of music production tasks has become an emerging field in recent years, where rule-based methods and machine learning approaches have been explored. Nevertheless, the lack of dry or clean instrument recordings limits the performance of such models, which is still far from professional human-made mixes. We explore whether we can use out-of-domain data such as wet or processed multitrack music recordings and repurpose it to train supervised deep learning models that can bridge the current gap in automatic mixing quality. To achieve this we propose a novel data preprocessing method that allows the models to perform automatic music mixing. We also redesigned a listening test method for evaluating music mixing systems. We validate our results through such subjective tests using highly experienced mixing engineers as participants.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130543720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model 用预先训练的语言模型来解释歌词
International Society for Music Information Retrieval Conference Pub Date : 2022-08-24 DOI: 10.48550/arXiv.2208.11671
Yixiao Zhang, Junyan Jiang, Gus G. Xia, S. Dixon
{"title":"Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model","authors":"Yixiao Zhang, Junyan Jiang, Gus G. Xia, S. Dixon","doi":"10.48550/arXiv.2208.11671","DOIUrl":"https://doi.org/10.48550/arXiv.2208.11671","url":null,"abstract":"Lyric interpretations can help people understand songs and their lyrics quickly, and can also make it easier to manage, retrieve and discover songs efficiently from the growing mass of music archives. In this paper we propose BART-fusion, a novel model for generating lyric interpretations from lyrics and music audio that combines a large-scale pre-trained language model with an audio encoder. We employ a cross-modal attention module to incorporate the audio representation into the lyrics representation to help the pre-trained language model understand the song from an audio perspective, while preserving the language model's original generative performance. We also release the Song Interpretation Dataset, a new large-scale dataset for training and evaluating our model. Experimental results show that the additional audio information helps our model to understand words and music better, and to generate precise and fluent interpretations. An additional experiment on cross-modal music retrieval shows that interpretations generated by BART-fusion can also help people retrieve music more accurately than with the original BART.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132377019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures 基于深度特征的音频纹理评价指标的参数灵敏度
International Society for Music Information Retrieval Conference Pub Date : 2022-08-23 DOI: 10.48550/arXiv.2208.10743
Chitralekha Gupta, Yize Wei, Zequn Gong, Purnima Kamath, Zhuoyao Li, L. Wyse
{"title":"Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures","authors":"Chitralekha Gupta, Yize Wei, Zequn Gong, Purnima Kamath, Zhuoyao Li, L. Wyse","doi":"10.48550/arXiv.2208.10743","DOIUrl":"https://doi.org/10.48550/arXiv.2208.10743","url":null,"abstract":"Standard evaluation metrics such as the Inception score and Fr'echet Audio Distance provide a general audio quality distance metric between the synthesized audio and reference clean audio. However, the sensitivity of these metrics to variations in the statistical parameters that define an audio texture is not well studied. In this work, we provide a systematic study of the sensitivity of some of the existing audio quality evaluation metrics to parameter variations in audio textures. Furthermore, we also study three more potentially parameter-sensitive metrics for audio texture synthesis, (a) a Gram matrix based distance, (b) an Accumulated Gram metric using a summarized version of the Gram matrices, and (c) a cochlear-model based statistical features metric. These metrics use deep features that summarize the statistics of any given audio texture, thus being inherently sensitive to variations in the statistical parameters that define an audio texture. We study and evaluate the sensitivity of existing standard metrics as well as Gram matrix and cochlear-model based metrics to control-parameter variations in audio textures across a wide range of texture and parameter types, and validate with subjective evaluation. We find that each of the metrics is sensitive to different sets of texture-parameter types. This is the first step towards investigating objective metrics for assessing parameter sensitivity in audio textures.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117063161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Musika! Fast Infinite Waveform Music Generation Musika !快速无限波形音乐生成
International Society for Music Information Retrieval Conference Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.08706
Marco Pasini, Jan Schlüter
{"title":"Musika! Fast Infinite Waveform Music Generation","authors":"Marco Pasini, Jan Schlüter","doi":"10.48550/arXiv.2208.08706","DOIUrl":"https://doi.org/10.48550/arXiv.2208.08706","url":null,"abstract":"Fast and user-controllable music generation could enable novel ways of composing or performing music. However, state-of-the-art music generation systems require large amounts of data and computational resources for training, and are slow at inference. This makes them impractical for real-time interactive use. In this work, we introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and that allows for much faster than real-time generation of music of arbitrary length on a consumer CPU. We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders, then training a Generative Adversarial Network (GAN) on this representation for a particular music domain. A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time. We perform quantitative evaluations to assess the quality of the generated samples and showcase options for user control in piano and techno music generation. We release the source code and pretrained autoencoder weights at github.com/marcoppasini/musika, such that a GAN can be trained on a new music domain with a single GPU in a matter of hours.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124231441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Representation Learning for the Automatic Indexing of Sound Effects Libraries 声音效果库自动索引的表示学习
International Society for Music Information Retrieval Conference Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.09096
Alison B. Ma, Alexander Lerch
{"title":"Representation Learning for the Automatic Indexing of Sound Effects Libraries","authors":"Alison B. Ma, Alexander Lerch","doi":"10.48550/arXiv.2208.09096","DOIUrl":"https://doi.org/10.48550/arXiv.2208.09096","url":null,"abstract":"Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overcome dataset-dependent limitations that inhibit the successful training of deep learning models, we pursue representation learning to train generalized embeddings that can be used for a wide variety of sound effects libraries and are a taxonomy-agnostic representation of sound. We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size, outperforming established representations such as OpenL3. Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116658783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DDX7: Differentiable FM Synthesis of Musical Instrument Sounds DDX7:乐器声音的可微分调频合成
International Society for Music Information Retrieval Conference Pub Date : 2022-08-12 DOI: 10.48550/arXiv.2208.06169
Franco Caspe, Andrew Mcpherson, M. Sandler
{"title":"DDX7: Differentiable FM Synthesis of Musical Instrument Sounds","authors":"Franco Caspe, Andrew Mcpherson, M. Sandler","doi":"10.48550/arXiv.2208.06169","DOIUrl":"https://doi.org/10.48550/arXiv.2208.06169","url":null,"abstract":"FM Synthesis is a well-known algorithm used to generate complex timbre from a compact set of design primitives. Typically featuring a MIDI interface, it is usually impractical to control it from an audio source. On the other hand, Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs) that learn to control differentiable synthesis layers from arbitrary sound inputs. The training process involves a corpus of audio for supervision, and spectral reconstruction loss functions. Such functions, while being great to match spectral amplitudes, present a lack of pitch direction which can hinder the joint optimization of the parameters of FM synthesizers. In this paper, we take steps towards enabling continuous control of a well-established FM synthesis architecture from an audio input. Firstly, we discuss a set of design constraints that ease spectral optimization of a differentiable FM synthesizer via a standard reconstruction loss. Next, we present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds in terms of a compact set of parameters. We train the model on instrument samples extracted from the URMP dataset, and quantitatively demonstrate its comparable audio quality against selected benchmarks.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132576369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Symbolic Music Loop Generation with Neural Discrete Representations 基于神经离散表示的符号音乐循环生成
International Society for Music Information Retrieval Conference Pub Date : 2022-08-11 DOI: 10.48550/arXiv.2208.05605
Sangjun Han, H. Ihm, Moontae Lee, Woohyung Lim
{"title":"Symbolic Music Loop Generation with Neural Discrete Representations","authors":"Sangjun Han, H. Ihm, Moontae Lee, Woohyung Lim","doi":"10.48550/arXiv.2208.05605","DOIUrl":"https://doi.org/10.48550/arXiv.2208.05605","url":null,"abstract":"Since most of music has repetitive structures from motifs to phrases, repeating musical ideas can be a basic operation for music composition. The basic block that we focus on is conceptualized as loops which are essential ingredients of music. Furthermore, meaningful note patterns can be formed in a finite space, so it is sufficient to represent them with combinations of discrete symbols as done in other domains. In this work, we propose symbolic music loop generation via learning discrete representations. We first extract loops from MIDI datasets using a loop detector and then learn an autoregressive model trained by discrete latent codes of the extracted loops. We show that our model outperforms well-known music generative models in terms of both fidelity and diversity, evaluating on random space. Our code and supplementary materials are available at https://github.com/sjhan91/Loop_VQVAE_Official.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114943528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation 基于ddsp的歌唱声码器:一种新型减法合成器及综合评价
International Society for Music Information Retrieval Conference Pub Date : 2022-08-09 DOI: 10.48550/arXiv.2208.04756
Da-Yi Wu, Wen-Yi Hsiao, Fu-Rong Yang, Oscar D. Friedman, Warren Jackson, Scott Bruzenak, Yi-Wen Liu, Yi-Hsuan Yang
{"title":"DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation","authors":"Da-Yi Wu, Wen-Yi Hsiao, Fu-Rong Yang, Oscar D. Friedman, Warren Jackson, Scott Bruzenak, Yi-Wen Liu, Yi-Hsuan Yang","doi":"10.48550/arXiv.2208.04756","DOIUrl":"https://doi.org/10.48550/arXiv.2208.04756","url":null,"abstract":"A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121937834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信