{"title":"PromptCodec:使用基于分离表征学习的自适应特征感知提示编码器的高保真神经语音编解码器","authors":"Yu Pan, Lei Ma, Jianjun Zhao","doi":"arxiv-2404.02702","DOIUrl":null,"url":null,"abstract":"Neural speech codec has recently gained widespread attention in generative\nspeech modeling domains, like voice conversion, text-to-speech synthesis, etc.\nHowever, ensuring high-fidelity audio reconstruction of speech codecs under\nhigh compression rates remains an open and challenging issue. In this paper, we\npropose PromptCodec, a novel end-to-end neural speech codec model using\ndisentangled representation learning based feature-aware prompt encoders. By\nincorporating additional feature representations from prompt encoders,\nPromptCodec can distribute the speech information requiring processing and\nenhance its capabilities. Moreover, a simple yet effective adaptive feature\nweighted fusion approach is introduced to integrate features of different\nencoders. Meanwhile, we propose a novel disentangled representation learning\nstrategy based on cosine distance to optimize PromptCodec's encoders to ensure\ntheir efficiency, thereby further improving the performance of PromptCodec.\nExperiments on LibriTTS demonstrate that our proposed PromptCodec consistently\noutperforms state-of-the-art neural speech codec models under all different\nbitrate conditions while achieving impressive performance with low bitrates.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders\",\"authors\":\"Yu Pan, Lei Ma, Jianjun Zhao\",\"doi\":\"arxiv-2404.02702\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural speech codec has recently gained widespread attention in generative\\nspeech modeling domains, like voice conversion, text-to-speech synthesis, etc.\\nHowever, ensuring high-fidelity audio reconstruction of speech codecs under\\nhigh compression rates remains an open and challenging issue. In this paper, we\\npropose PromptCodec, a novel end-to-end neural speech codec model using\\ndisentangled representation learning based feature-aware prompt encoders. By\\nincorporating additional feature representations from prompt encoders,\\nPromptCodec can distribute the speech information requiring processing and\\nenhance its capabilities. Moreover, a simple yet effective adaptive feature\\nweighted fusion approach is introduced to integrate features of different\\nencoders. Meanwhile, we propose a novel disentangled representation learning\\nstrategy based on cosine distance to optimize PromptCodec's encoders to ensure\\ntheir efficiency, thereby further improving the performance of PromptCodec.\\nExperiments on LibriTTS demonstrate that our proposed PromptCodec consistently\\noutperforms state-of-the-art neural speech codec models under all different\\nbitrate conditions while achieving impressive performance with low bitrates.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2404.02702\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.02702","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders
Neural speech codec has recently gained widespread attention in generative
speech modeling domains, like voice conversion, text-to-speech synthesis, etc.
However, ensuring high-fidelity audio reconstruction of speech codecs under
high compression rates remains an open and challenging issue. In this paper, we
propose PromptCodec, a novel end-to-end neural speech codec model using
disentangled representation learning based feature-aware prompt encoders. By
incorporating additional feature representations from prompt encoders,
PromptCodec can distribute the speech information requiring processing and
enhance its capabilities. Moreover, a simple yet effective adaptive feature
weighted fusion approach is introduced to integrate features of different
encoders. Meanwhile, we propose a novel disentangled representation learning
strategy based on cosine distance to optimize PromptCodec's encoders to ensure
their efficiency, thereby further improving the performance of PromptCodec.
Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently
outperforms state-of-the-art neural speech codec models under all different
bitrate conditions while achieving impressive performance with low bitrates.