LORT: Locally refined convolution and Taylor transformer for monaural speech enhancement

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-09-30 DOI:10.1016/j.specom.2025.103314

Junyu Wang , Zizhen Lin , Tianrui Wang , Meng Ge , Longbiao Wang , Jianwu Dang

{"title":"LORT: Locally refined convolution and Taylor transformer for monaural speech enhancement","authors":"Junyu Wang , Zizhen Lin , Tianrui Wang , Meng Ge , Longbiao Wang , Jianwu Dang","doi":"10.1016/j.specom.2025.103314","DOIUrl":null,"url":null,"abstract":"<div><div>Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time–frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder–decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103314"},"PeriodicalIF":3.0000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325001293","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time–frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder–decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.

查看原文本刊更多论文

局部精细卷积和Taylor变压器用于单音语音增强

如何在保持低参数数和计算复杂度的同时获得优异的增强性能，一直是语音增强领域面临的挑战。在本文中，我们介绍了LORT，一种集成了空间信道增强泰勒变压器和局部精细卷积的新架构，用于高效和鲁棒的语音增强。我们提出了一种基于空间通道增强注意（SCEA）的Taylor多头自注意（T-MSA）模块，旨在促进通道间信息交换并缓解基于泰勒的变形器固有的空间注意限制。为了补充全局建模，我们进一步提出了一个局部精细卷积（LRC）块，该块集成了卷积前馈层、时频密集局部卷积和门控单元，以捕获细粒度的局部细节。LORT基于类似u - net的编码器-解码器结构，在编码器中只有16个输出通道，通过使用交替下采样和上采样操作的多分辨率T-MSA模块处理噪声输入。增强的幅度和相位谱被独立解码，并通过综合考虑幅度、复度、相位、鉴别器和一致性目标的复合损失函数进行优化。在VCTK+DEMAND和DNS Challenge数据集上的实验结果表明，LORT仅使用0.96万个参数就取得了与最先进（SOTA）模型相当或更好的性能，突出了其在计算资源有限的现实语音增强应用中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.