{"title":"Learning Source Disentanglement in Neural Audio Codec","authors":"Xiaoyu Bie, Xubo Liu, Gaël Richard","doi":"arxiv-2409.11228","DOIUrl":null,"url":null,"abstract":"Neural audio codecs have significantly advanced audio compression by\nefficiently converting continuous audio signals into discrete tokens. These\ncodecs preserve high-quality sound and enable sophisticated sound generation\nthrough generative models trained on these tokens. However, existing neural\ncodec models are typically trained on large, undifferentiated audio datasets,\nneglecting the essential discrepancies between sound domains like speech,\nmusic, and environmental sound effects. This oversight complicates data\nmodeling and poses additional challenges to the controllability of sound\ngeneration. To tackle these issues, we introduce the Source-Disentangled Neural\nAudio Codec (SD-Codec), a novel approach that combines audio coding and source\nseparation. By jointly learning audio resynthesis and separation, SD-Codec\nexplicitly assigns audio signals from different domains to distinct codebooks,\nsets of discrete representations. Experimental results indicate that SD-Codec\nnot only maintains competitive resynthesis quality but also, supported by the\nseparation results, demonstrates successful disentanglement of different\nsources in the latent space, thereby enhancing interpretability in audio codec\nand providing potential finer control over the audio generation process.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11228","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Neural audio codecs have significantly advanced audio compression by
efficiently converting continuous audio signals into discrete tokens. These
codecs preserve high-quality sound and enable sophisticated sound generation
through generative models trained on these tokens. However, existing neural
codec models are typically trained on large, undifferentiated audio datasets,
neglecting the essential discrepancies between sound domains like speech,
music, and environmental sound effects. This oversight complicates data
modeling and poses additional challenges to the controllability of sound
generation. To tackle these issues, we introduce the Source-Disentangled Neural
Audio Codec (SD-Codec), a novel approach that combines audio coding and source
separation. By jointly learning audio resynthesis and separation, SD-Codec
explicitly assigns audio signals from different domains to distinct codebooks,
sets of discrete representations. Experimental results indicate that SD-Codec
not only maintains competitive resynthesis quality but also, supported by the
separation results, demonstrates successful disentanglement of different
sources in the latent space, thereby enhancing interpretability in audio codec
and providing potential finer control over the audio generation process.