Dense-TSNet：用于超轻量级语音增强的密集连接两级结构

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI:arxiv-2409.11725

Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li

{"title":"Dense-TSNet：用于超轻量级语音增强的密集连接两级结构","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":null,"url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\nnoisy environments. Recent advancements have concentrated on deep neural\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\nfeature extraction. However, the complexity and size of these models remain\nsignificant, which limits their applicability in resource-constrained\nscenarios. Designing models suitable for edge devices presents its own set of\nchallenges. Narrow lightweight models often encounter performance bottlenecks\ndue to uneven loss landscapes. Additionally, advanced operators such as\nTransformers or Mamba may lack the practical adaptability and efficiency that\nconvolutional neural networks (CNNs) offer in real-world deployments. To\naddress these challenges, we propose Dense-TSNet, an innovative\nultra-lightweight speech enhancement network. Our approach employs a novel\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\nTwo-Stage architecture, ensures more robust refinement of the objective\nfunction in the later training stages. This leads to improved final\nperformance, addressing the early convergence limitations of the baseline\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\nfeature extraction by incorporating global, channel, and local perspectives\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\npromising performance with a compact model size of around 14K parameters,\nmaking it particularly well-suited for deployment in resource-constrained\nenvironments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement\",\"authors\":\"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li\",\"doi\":\"arxiv-2409.11725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech enhancement aims to improve speech quality and intelligibility in\\nnoisy environments. Recent advancements have concentrated on deep neural\\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\\nfeature extraction. However, the complexity and size of these models remain\\nsignificant, which limits their applicability in resource-constrained\\nscenarios. Designing models suitable for edge devices presents its own set of\\nchallenges. Narrow lightweight models often encounter performance bottlenecks\\ndue to uneven loss landscapes. Additionally, advanced operators such as\\nTransformers or Mamba may lack the practical adaptability and efficiency that\\nconvolutional neural networks (CNNs) offer in real-world deployments. To\\naddress these challenges, we propose Dense-TSNet, an innovative\\nultra-lightweight speech enhancement network. Our approach employs a novel\\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\\nTwo-Stage architecture, ensures more robust refinement of the objective\\nfunction in the later training stages. This leads to improved final\\nperformance, addressing the early convergence limitations of the baseline\\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\\nfeature extraction by incorporating global, channel, and local perspectives\\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\\npromising performance with a compact model size of around 14K parameters,\\nmaking it particularly well-suited for deployment in resource-constrained\\nenvironments.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"96 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语音增强的目的是提高语音质量和噪音环境下的可懂度。最近的进展主要集中在深度神经网络上，特别是采用两阶段（TS）架构来增强特征提取。然而，这些模型的复杂性和规模仍然很大，这限制了它们在资源受限情况下的适用性。设计适用于边缘设备的模型也面临着一系列挑战。由于损耗情况不均衡，窄小的轻量级模型经常会遇到性能瓶颈。此外，Transformers 或 Mamba 等高级运营商可能缺乏卷积神经网络 (CNN) 在实际部署中提供的实际适应性和效率。为了应对这些挑战，我们提出了一种创新的超轻量级语音增强网络 Dense-TSNet。我们的方法采用了新颖的密集两阶段（Dense-TS）架构，与经典的两阶段架构相比，它能确保在后期训练阶段对目标函数进行更稳健的细化。这将提高最终性能，解决基线模型早期收敛的局限性。我们还介绍了多视角凝视块（MVGB），它通过卷积神经网络（CNN）将全局、通道和局部视角结合起来，从而增强了特征提取能力。此外，我们还讨论了损失函数的选择如何影响感知质量。Dense-TSNet 以约 14K 个参数的紧凑模型大小展示了良好的性能，特别适合在资源有限的环境中部署。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement

Speech enhancement aims to improve speech quality and intelligibility in noisy environments. Recent advancements have concentrated on deep neural networks, particularly employing the Two-Stage (TS) architecture to enhance feature extraction. However, the complexity and size of these models remain significant, which limits their applicability in resource-constrained scenarios. Designing models suitable for edge devices presents its own set of challenges. Narrow lightweight models often encounter performance bottlenecks due to uneven loss landscapes. Additionally, advanced operators such as Transformers or Mamba may lack the practical adaptability and efficiency that convolutional neural networks (CNNs) offer in real-world deployments. To address these challenges, we propose Dense-TSNet, an innovative ultra-lightweight speech enhancement network. Our approach employs a novel Dense Two-Stage (Dense-TS) architecture, which, compared to the classic Two-Stage architecture, ensures more robust refinement of the objective function in the later training stages. This leads to improved final performance, addressing the early convergence limitations of the baseline model. We also introduce the Multi-View Gaze Block (MVGB), which enhances feature extraction by incorporating global, channel, and local perspectives through convolutional neural networks (CNNs). Furthermore, we discuss how the choice of loss function impacts perceptual quality. Dense-TSNet demonstrates promising performance with a compact model size of around 14K parameters, making it particularly well-suited for deployment in resource-constrained environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量