TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-12 DOI:arxiv-2409.07841

Beilong Tang, Bang Zeng, Ming Li

引用次数: 0

Abstract

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

查看原文本刊更多论文

TSELM：使用离散时标和语言模型提取目标发言人

我们提出的 TSELM 是一种利用离散标记和语言模型的新型目标发言人提取网络。TSELM 利用来自 WavLM 的多个离散层作为输入标记，并结合交叉注意机制来整合目标扬声器信息。语言模型用于捕捉序列依赖关系，而可扩展的 HiFi-GAN 则用于从标记重建音频。通过应用交叉熵损失，TSELM 对输出标记的概率分布进行建模，从而将复杂的音频生成回归问题转换为分类任务。实验结果表明，TSELM 在语音质量和语音可懂度方面都取得了优异的成绩。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量