Dense-TSNet:用于超轻量级语音增强的密集连接两级结构

Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li
{"title":"Dense-TSNet:用于超轻量级语音增强的密集连接两级结构","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":null,"url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\nnoisy environments. Recent advancements have concentrated on deep neural\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\nfeature extraction. However, the complexity and size of these models remain\nsignificant, which limits their applicability in resource-constrained\nscenarios. Designing models suitable for edge devices presents its own set of\nchallenges. Narrow lightweight models often encounter performance bottlenecks\ndue to uneven loss landscapes. Additionally, advanced operators such as\nTransformers or Mamba may lack the practical adaptability and efficiency that\nconvolutional neural networks (CNNs) offer in real-world deployments. To\naddress these challenges, we propose Dense-TSNet, an innovative\nultra-lightweight speech enhancement network. Our approach employs a novel\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\nTwo-Stage architecture, ensures more robust refinement of the objective\nfunction in the later training stages. This leads to improved final\nperformance, addressing the early convergence limitations of the baseline\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\nfeature extraction by incorporating global, channel, and local perspectives\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\npromising performance with a compact model size of around 14K parameters,\nmaking it particularly well-suited for deployment in resource-constrained\nenvironments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement\",\"authors\":\"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li\",\"doi\":\"arxiv-2409.11725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech enhancement aims to improve speech quality and intelligibility in\\nnoisy environments. Recent advancements have concentrated on deep neural\\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\\nfeature extraction. However, the complexity and size of these models remain\\nsignificant, which limits their applicability in resource-constrained\\nscenarios. Designing models suitable for edge devices presents its own set of\\nchallenges. Narrow lightweight models often encounter performance bottlenecks\\ndue to uneven loss landscapes. Additionally, advanced operators such as\\nTransformers or Mamba may lack the practical adaptability and efficiency that\\nconvolutional neural networks (CNNs) offer in real-world deployments. To\\naddress these challenges, we propose Dense-TSNet, an innovative\\nultra-lightweight speech enhancement network. Our approach employs a novel\\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\\nTwo-Stage architecture, ensures more robust refinement of the objective\\nfunction in the later training stages. This leads to improved final\\nperformance, addressing the early convergence limitations of the baseline\\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\\nfeature extraction by incorporating global, channel, and local perspectives\\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\\npromising performance with a compact model size of around 14K parameters,\\nmaking it particularly well-suited for deployment in resource-constrained\\nenvironments.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"96 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

语音增强的目的是提高语音质量和噪音环境下的可懂度。最近的进展主要集中在深度神经网络上,特别是采用两阶段(TS)架构来增强特征提取。然而,这些模型的复杂性和规模仍然很大,这限制了它们在资源受限情况下的适用性。设计适用于边缘设备的模型也面临着一系列挑战。由于损耗情况不均衡,窄小的轻量级模型经常会遇到性能瓶颈。此外,Transformers 或 Mamba 等高级运营商可能缺乏卷积神经网络 (CNN) 在实际部署中提供的实际适应性和效率。为了应对这些挑战,我们提出了一种创新的超轻量级语音增强网络 Dense-TSNet。我们的方法采用了新颖的密集两阶段(Dense-TS)架构,与经典的两阶段架构相比,它能确保在后期训练阶段对目标函数进行更稳健的细化。这将提高最终性能,解决基线模型早期收敛的局限性。我们还介绍了多视角凝视块(MVGB),它通过卷积神经网络(CNN)将全局、通道和局部视角结合起来,从而增强了特征提取能力。此外,我们还讨论了损失函数的选择如何影响感知质量。Dense-TSNet 以约 14K 个参数的紧凑模型大小展示了良好的性能,特别适合在资源有限的环境中部署。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement
Speech enhancement aims to improve speech quality and intelligibility in noisy environments. Recent advancements have concentrated on deep neural networks, particularly employing the Two-Stage (TS) architecture to enhance feature extraction. However, the complexity and size of these models remain significant, which limits their applicability in resource-constrained scenarios. Designing models suitable for edge devices presents its own set of challenges. Narrow lightweight models often encounter performance bottlenecks due to uneven loss landscapes. Additionally, advanced operators such as Transformers or Mamba may lack the practical adaptability and efficiency that convolutional neural networks (CNNs) offer in real-world deployments. To address these challenges, we propose Dense-TSNet, an innovative ultra-lightweight speech enhancement network. Our approach employs a novel Dense Two-Stage (Dense-TS) architecture, which, compared to the classic Two-Stage architecture, ensures more robust refinement of the objective function in the later training stages. This leads to improved final performance, addressing the early convergence limitations of the baseline model. We also introduce the Multi-View Gaze Block (MVGB), which enhances feature extraction by incorporating global, channel, and local perspectives through convolutional neural networks (CNNs). Furthermore, we discuss how the choice of loss function impacts perceptual quality. Dense-TSNet demonstrates promising performance with a compact model size of around 14K parameters, making it particularly well-suited for deployment in resource-constrained environments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信