Mel Spectrogram Inversion with Stable Pitch

International Society for Music Information Retrieval Conference Pub Date : 2022-08-26 DOI:10.48550/arXiv.2208.12782

Bruno Di Giorgi, M. Levy, Richard Sharp

{"title":"Mel Spectrogram Inversion with Stable Pitch","authors":"Bruno Di Giorgi, M. Levy, Richard Sharp","doi":"10.48550/arXiv.2208.12782","DOIUrl":null,"url":null,"abstract":"Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network. We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Society for Music Information Retrieval Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.12782","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network. We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.

查看原文本刊更多论文

具有稳定音高的梅尔谱图反演

声码器是能够将音频信号的低维频谱表示(通常是mel频谱图)转换为波形的模型。现代语音生成管道使用声码器作为其最终组件。最近为语音开发的声码器模型实现了高度的真实感，因此很自然地想知道它们在音乐信号上的表现。与语音相比，音乐音织体的异质性和结构提出了新的挑战。在这项工作中，我们专注于一个特定的伪像，一些为语音设计的声码器模型在应用于音乐时往往会表现出:在合成持续音符时，音调的感知不稳定性。我们认为，这种伪信号的特征声音是由于缺乏水平相位相干性，这通常是使用时域目标空间和对时移不变的模型(如卷积神经网络)的结果。我们提出了一个新的声码器模型，是专门为音乐设计的。提高俯仰稳定性的关键是选择由幅度谱和相位梯度组成的平移不变目标空间。我们讨论了激励我们重新制定声码器任务的原因，概述了一个工作示例，并对音乐信号进行了评估。我们的方法使用一种新的谐波误差度量，与现有模型相比，持续音符和和弦的重建效果分别提高了60%和10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Society for Music Information Retrieval Conference

自引率

0.00%

发文量