International Society for Music Information Retrieval Conference最新文献_第5页

Mel Spectrogram Inversion with Stable Pitch 具有稳定音高的梅尔谱图反演

International Society for Music Information Retrieval Conference Pub Date : 2022-08-26 DOI: 10.48550/arXiv.2208.12782

Bruno Di Giorgi, M. Levy, Richard Sharp

{"title":"Mel Spectrogram Inversion with Stable Pitch","authors":"Bruno Di Giorgi, M. Levy, Richard Sharp","doi":"10.48550/arXiv.2208.12782","DOIUrl":"https://doi.org/10.48550/arXiv.2208.12782","url":null,"abstract":"Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network. We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125613684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Concept-Based Techniques for "Musicologist-friendly" Explanations in a Deep Music Classifier 深度音乐分类器中“音乐学家友好”解释的基于概念的技术

International Society for Music Information Retrieval Conference Pub Date : 2022-08-26 DOI: 10.48550/arXiv.2208.12485

Francesco Foscarin, Katharina Hoedt, Verena Praher, A. Flexer, G. Widmer

引用次数: 3

Automatic music mixing with deep learning and out-of-domain data 自动音乐混合与深度学习和领域外的数据

International Society for Music Information Retrieval Conference Pub Date : 2022-08-24 DOI: 10.48550/arXiv.2208.11428

Marco A. Mart'inez-Ram'irez, Wei-Hsiang Liao, Giorgio Fabbro, S. Uhlich, Chihiro Nagashima, Yuki Mitsufuji

引用次数: 10

Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model 用预先训练的语言模型来解释歌词

International Society for Music Information Retrieval Conference Pub Date : 2022-08-24 DOI: 10.48550/arXiv.2208.11671

Yixiao Zhang, Junyan Jiang, Gus G. Xia, S. Dixon

{"title":"Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model","authors":"Yixiao Zhang, Junyan Jiang, Gus G. Xia, S. Dixon","doi":"10.48550/arXiv.2208.11671","DOIUrl":"https://doi.org/10.48550/arXiv.2208.11671","url":null,"abstract":"Lyric interpretations can help people understand songs and their lyrics quickly, and can also make it easier to manage, retrieve and discover songs efficiently from the growing mass of music archives. In this paper we propose BART-fusion, a novel model for generating lyric interpretations from lyrics and music audio that combines a large-scale pre-trained language model with an audio encoder. We employ a cross-modal attention module to incorporate the audio representation into the lyrics representation to help the pre-trained language model understand the song from an audio perspective, while preserving the language model's original generative performance. We also release the Song Interpretation Dataset, a new large-scale dataset for training and evaluating our model. Experimental results show that the additional audio information helps our model to understand words and music better, and to generate precise and fluent interpretations. An additional experiment on cross-modal music retrieval shows that interpretations generated by BART-fusion can also help people retrieve music more accurately than with the original BART.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132377019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures 基于深度特征的音频纹理评价指标的参数灵敏度

International Society for Music Information Retrieval Conference Pub Date : 2022-08-23 DOI: 10.48550/arXiv.2208.10743

Chitralekha Gupta, Yize Wei, Zequn Gong, Purnima Kamath, Zhuoyao Li, L. Wyse

{"title":"Parameter Sensitivity of Deep-Feature based Evaluation Metrics for Audio Textures","authors":"Chitralekha Gupta, Yize Wei, Zequn Gong, Purnima Kamath, Zhuoyao Li, L. Wyse","doi":"10.48550/arXiv.2208.10743","DOIUrl":"https://doi.org/10.48550/arXiv.2208.10743","url":null,"abstract":"Standard evaluation metrics such as the Inception score and Fr'echet Audio Distance provide a general audio quality distance metric between the synthesized audio and reference clean audio. However, the sensitivity of these metrics to variations in the statistical parameters that define an audio texture is not well studied. In this work, we provide a systematic study of the sensitivity of some of the existing audio quality evaluation metrics to parameter variations in audio textures. Furthermore, we also study three more potentially parameter-sensitive metrics for audio texture synthesis, (a) a Gram matrix based distance, (b) an Accumulated Gram metric using a summarized version of the Gram matrices, and (c) a cochlear-model based statistical features metric. These metrics use deep features that summarize the statistics of any given audio texture, thus being inherently sensitive to variations in the statistical parameters that define an audio texture. We study and evaluate the sensitivity of existing standard metrics as well as Gram matrix and cochlear-model based metrics to control-parameter variations in audio textures across a wide range of texture and parameter types, and validate with subjective evaluation. We find that each of the metrics is sensitive to different sets of texture-parameter types. This is the first step towards investigating objective metrics for assessing parameter sensitivity in audio textures.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117063161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Musika! Fast Infinite Waveform Music Generation Musika !快速无限波形音乐生成

International Society for Music Information Retrieval Conference Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.08706

Marco Pasini, Jan Schlüter

{"title":"Musika! Fast Infinite Waveform Music Generation","authors":"Marco Pasini, Jan Schlüter","doi":"10.48550/arXiv.2208.08706","DOIUrl":"https://doi.org/10.48550/arXiv.2208.08706","url":null,"abstract":"Fast and user-controllable music generation could enable novel ways of composing or performing music. However, state-of-the-art music generation systems require large amounts of data and computational resources for training, and are slow at inference. This makes them impractical for real-time interactive use. In this work, we introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and that allows for much faster than real-time generation of music of arbitrary length on a consumer CPU. We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders, then training a Generative Adversarial Network (GAN) on this representation for a particular music domain. A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time. We perform quantitative evaluations to assess the quality of the generated samples and showcase options for user control in piano and techno music generation. We release the source code and pretrained autoencoder weights at github.com/marcoppasini/musika, such that a GAN can be trained on a new music domain with a single GPU in a matter of hours.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124231441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Representation Learning for the Automatic Indexing of Sound Effects Libraries 声音效果库自动索引的表示学习

International Society for Music Information Retrieval Conference Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.09096

Alison B. Ma, Alexander Lerch

引用次数: 0

DDX7: Differentiable FM Synthesis of Musical Instrument Sounds DDX7:乐器声音的可微分调频合成

International Society for Music Information Retrieval Conference Pub Date : 2022-08-12 DOI: 10.48550/arXiv.2208.06169

Franco Caspe, Andrew Mcpherson, M. Sandler

{"title":"DDX7: Differentiable FM Synthesis of Musical Instrument Sounds","authors":"Franco Caspe, Andrew Mcpherson, M. Sandler","doi":"10.48550/arXiv.2208.06169","DOIUrl":"https://doi.org/10.48550/arXiv.2208.06169","url":null,"abstract":"FM Synthesis is a well-known algorithm used to generate complex timbre from a compact set of design primitives. Typically featuring a MIDI interface, it is usually impractical to control it from an audio source. On the other hand, Differentiable Digital Signal Processing (DDSP) has enabled nuanced audio rendering by Deep Neural Networks (DNNs) that learn to control differentiable synthesis layers from arbitrary sound inputs. The training process involves a corpus of audio for supervision, and spectral reconstruction loss functions. Such functions, while being great to match spectral amplitudes, present a lack of pitch direction which can hinder the joint optimization of the parameters of FM synthesizers. In this paper, we take steps towards enabling continuous control of a well-established FM synthesis architecture from an audio input. Firstly, we discuss a set of design constraints that ease spectral optimization of a differentiable FM synthesizer via a standard reconstruction loss. Next, we present Differentiable DX7 (DDX7), a lightweight architecture for neural FM resynthesis of musical instrument sounds in terms of a compact set of parameters. We train the model on instrument samples extracted from the URMP dataset, and quantitatively demonstrate its comparable audio quality against selected benchmarks.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132576369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Symbolic Music Loop Generation with Neural Discrete Representations 基于神经离散表示的符号音乐循环生成

International Society for Music Information Retrieval Conference Pub Date : 2022-08-11 DOI: 10.48550/arXiv.2208.05605

Sangjun Han, H. Ihm, Moontae Lee, Woohyung Lim

引用次数: 3

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation 基于ddsp的歌唱声码器:一种新型减法合成器及综合评价

International Society for Music Information Retrieval Conference Pub Date : 2022-08-09 DOI: 10.48550/arXiv.2208.04756

Da-Yi Wu, Wen-Yi Hsiao, Fu-Rong Yang, Oscar D. Friedman, Warren Jackson, Scott Bruzenak, Yi-Wen Liu, Yi-Hsuan Yang

引用次数: 14