Joint Chord and Key Estimation Based on a Hierarchical Variational Autoencoder with Multi-task Learning

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2022-01-01 DOI:10.1561/116.00000052

Yiming Wu, Kazuyoshi Yoshii

{"title":"Joint Chord and Key Estimation Based on a Hierarchical Variational Autoencoder with Multi-task Learning","authors":"Yiming Wu, Kazuyoshi Yoshii","doi":"10.1561/116.00000052","DOIUrl":null,"url":null,"abstract":"This paper describes a deep generative approach to joint chord and key estimation for music signals. The limited amount of music signals with complete annotations has been the major bottleneck in supervised multi-task learning of a classification model. To overcome this limitation, we integrate the supervised multi-task learning approach with the unsupervised autoencoding approach in a mutually complementary manner. Considering the typical process of music composition, we formulate a hierarchical latent variable model that sequentially generates keys, chords, and chroma vectors. The keys and chords are assumed to follow a language model that represents their relationships and dynamics. In the framework of amortized variational inference (AVI), we introduce a classification model that jointly infers discrete chord and key labels and a recognition model that infers continuous latent features. These models are combined to form a variational autoencoder (VAE) and are trained jointly in a (semi-)supervised manner, where the generative and language models act as regularizers for the classification model. We comprehensively investigate three different architectures for the chord and key classification model, and three different architectures for the language model. Experimental results demonstrate that the VAE-based multi-task learning improves chord estimation as well as key estimation.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":"1 1","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/116.00000052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

This paper describes a deep generative approach to joint chord and key estimation for music signals. The limited amount of music signals with complete annotations has been the major bottleneck in supervised multi-task learning of a classification model. To overcome this limitation, we integrate the supervised multi-task learning approach with the unsupervised autoencoding approach in a mutually complementary manner. Considering the typical process of music composition, we formulate a hierarchical latent variable model that sequentially generates keys, chords, and chroma vectors. The keys and chords are assumed to follow a language model that represents their relationships and dynamics. In the framework of amortized variational inference (AVI), we introduce a classification model that jointly infers discrete chord and key labels and a recognition model that infers continuous latent features. These models are combined to form a variational autoencoder (VAE) and are trained jointly in a (semi-)supervised manner, where the generative and language models act as regularizers for the classification model. We comprehensively investigate three different architectures for the chord and key classification model, and three different architectures for the language model. Experimental results demonstrate that the VAE-based multi-task learning improves chord estimation as well as key estimation.

查看原文本刊更多论文

基于多任务学习的分层变分自编码器联合和弦和键估计

本文介绍了一种基于深度生成的音乐信号联合和弦和键估计方法。具有完整注释的音乐信号数量有限一直是分类模型监督多任务学习的主要瓶颈。为了克服这一限制，我们将有监督的多任务学习方法与无监督的自动编码方法以互补的方式集成在一起。考虑到音乐创作的典型过程，我们制定了一个分层潜变量模型，依次生成键、和弦和色度向量。假设键和和弦遵循代表其关系和动态的语言模型。在平摊变分推理(AVI)的框架下，我们引入了一个联合推断离散和弦和键标签的分类模型和一个推断连续潜在特征的识别模型。这些模型被组合成一个变分自编码器(VAE)，并以一种(半)监督的方式联合训练，其中生成模型和语言模型作为分类模型的正则化器。我们全面研究了和弦和键分类模型的三种不同架构，以及语言模型的三种不同架构。实验结果表明，基于vae的多任务学习提高了和弦估计和键估计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊