Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin
{"title":"通过逆向解题进行音频解码","authors":"Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin","doi":"arxiv-2409.07858","DOIUrl":null,"url":null,"abstract":"We consider audio decoding as an inverse problem and solve it through\ndiffusion posterior sampling. Explicit conditioning functions are developed for\ninput signal measurements provided by an example of a transform domain\nperceptual audio codec. Viability is demonstrated by evaluating arbitrary\npairings of a set of bitrates and task-agnostic prior models. For instance, we\nobserve significant improvements on piano while maintaining speech performance\nwhen a speech model is replaced by a joint model trained on both speech and\npiano. With a more general music model, improved decoding compared to legacy\nmethods is obtained for a broad range of content types and bitrates. The noisy\nmean model, underlying the proposed derivation of conditioning, enables a\nsignificant reduction of gradient evaluations for diffusion posterior sampling,\ncompared to methods based on Tweedie's mean. Combining Tweedie's mean with our\nconditioning functions improves the objective performance. An audio demo is\navailable at https://dpscodec-demo.github.io/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio Decoding by Inverse Problem Solving\",\"authors\":\"Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin\",\"doi\":\"arxiv-2409.07858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider audio decoding as an inverse problem and solve it through\\ndiffusion posterior sampling. Explicit conditioning functions are developed for\\ninput signal measurements provided by an example of a transform domain\\nperceptual audio codec. Viability is demonstrated by evaluating arbitrary\\npairings of a set of bitrates and task-agnostic prior models. For instance, we\\nobserve significant improvements on piano while maintaining speech performance\\nwhen a speech model is replaced by a joint model trained on both speech and\\npiano. With a more general music model, improved decoding compared to legacy\\nmethods is obtained for a broad range of content types and bitrates. The noisy\\nmean model, underlying the proposed derivation of conditioning, enables a\\nsignificant reduction of gradient evaluations for diffusion posterior sampling,\\ncompared to methods based on Tweedie's mean. Combining Tweedie's mean with our\\nconditioning functions improves the objective performance. An audio demo is\\navailable at https://dpscodec-demo.github.io/.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07858\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07858","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
We consider audio decoding as an inverse problem and solve it through
diffusion posterior sampling. Explicit conditioning functions are developed for
input signal measurements provided by an example of a transform domain
perceptual audio codec. Viability is demonstrated by evaluating arbitrary
pairings of a set of bitrates and task-agnostic prior models. For instance, we
observe significant improvements on piano while maintaining speech performance
when a speech model is replaced by a joint model trained on both speech and
piano. With a more general music model, improved decoding compared to legacy
methods is obtained for a broad range of content types and bitrates. The noisy
mean model, underlying the proposed derivation of conditioning, enables a
significant reduction of gradient evaluations for diffusion posterior sampling,
compared to methods based on Tweedie's mean. Combining Tweedie's mean with our
conditioning functions improves the objective performance. An audio demo is
available at https://dpscodec-demo.github.io/.