{"title":"利用噪声感知多目标深度学习实现硬件高效语音增强","authors":"Salinna Abdullah;Majid Zamani;Andreas Demosthenous","doi":"10.1109/OJCAS.2024.3389100","DOIUrl":null,"url":null,"abstract":"This paper describes a supervised speech enhancement (SE) method utilising a noise-aware four-layer deep neural network and training target switching. For optimal speech denoising, the SE system, trained with multiple-target joint learning, switches between mapping-based, masking-based, or complementary processing, depending on the level of noise contamination detected. Optimisation techniques, including ternary quantisation, structural pruning, efficient sparse matrix representation and cost-effective approximations for complex computations, were implemented to reduce area, memory, and power requirements. Up to 19.1x compression was obtained, and all weights could be stored on the on-chip memory. When processing NOISEX-92 noises, the system achieved an average short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) scores of 0.81 and 1.62, respectively, outperforming SE algorithms trained with only a single learning target. The proposed SE processor was implemented on a field programmable gate array (FPGA) for proof of concept. Mapping the design on a 65-nm CMOS process led to a chip core area of \n<inline-formula> <tex-math>$3.88~mm^{2}$ </tex-math></inline-formula>\n and a power consumption of 1.91 mW when operating at a 10 MHz clock frequency.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10500889","citationCount":"0","resultStr":"{\"title\":\"Hardware Efficient Speech Enhancement With Noise Aware Multi-Target Deep Learning\",\"authors\":\"Salinna Abdullah;Majid Zamani;Andreas Demosthenous\",\"doi\":\"10.1109/OJCAS.2024.3389100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a supervised speech enhancement (SE) method utilising a noise-aware four-layer deep neural network and training target switching. For optimal speech denoising, the SE system, trained with multiple-target joint learning, switches between mapping-based, masking-based, or complementary processing, depending on the level of noise contamination detected. Optimisation techniques, including ternary quantisation, structural pruning, efficient sparse matrix representation and cost-effective approximations for complex computations, were implemented to reduce area, memory, and power requirements. Up to 19.1x compression was obtained, and all weights could be stored on the on-chip memory. When processing NOISEX-92 noises, the system achieved an average short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) scores of 0.81 and 1.62, respectively, outperforming SE algorithms trained with only a single learning target. The proposed SE processor was implemented on a field programmable gate array (FPGA) for proof of concept. Mapping the design on a 65-nm CMOS process led to a chip core area of \\n<inline-formula> <tex-math>$3.88~mm^{2}$ </tex-math></inline-formula>\\n and a power consumption of 1.91 mW when operating at a 10 MHz clock frequency.\",\"PeriodicalId\":93442,\"journal\":{\"name\":\"IEEE open journal of circuits and systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10500889\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of circuits and systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10500889/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10500889/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
本文介绍了一种利用噪声感知四层深度神经网络和训练目标切换的有监督语音增强(SE)方法。为了优化语音去噪,通过多目标联合学习训练的 SE 系统会根据检测到的噪声污染程度,在基于映射、基于掩蔽或互补处理之间进行切换。该系统采用了优化技术,包括三元量化、结构剪枝、高效稀疏矩阵表示以及用于复杂计算的经济有效的近似方法,以减少对面积、内存和功率的需求。压缩率高达 19.1 倍,所有权重均可存储在片上存储器中。在处理 NOISEX-92 噪音时,该系统的平均短时客观可懂度(STOI)和语音质量感知评估(PESQ)得分分别为 0.81 和 1.62,优于仅使用单一学习目标训练的 SE 算法。为验证概念,在现场可编程门阵列(FPGA)上实现了所提出的 SE 处理器。将设计映射到 65 纳米 CMOS 工艺上后,芯片内核面积为 3.88~mm^{2}$ ,在 10 MHz 时钟频率下工作时的功耗为 1.91 mW。
Hardware Efficient Speech Enhancement With Noise Aware Multi-Target Deep Learning
This paper describes a supervised speech enhancement (SE) method utilising a noise-aware four-layer deep neural network and training target switching. For optimal speech denoising, the SE system, trained with multiple-target joint learning, switches between mapping-based, masking-based, or complementary processing, depending on the level of noise contamination detected. Optimisation techniques, including ternary quantisation, structural pruning, efficient sparse matrix representation and cost-effective approximations for complex computations, were implemented to reduce area, memory, and power requirements. Up to 19.1x compression was obtained, and all weights could be stored on the on-chip memory. When processing NOISEX-92 noises, the system achieved an average short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) scores of 0.81 and 1.62, respectively, outperforming SE algorithms trained with only a single learning target. The proposed SE processor was implemented on a field programmable gate array (FPGA) for proof of concept. Mapping the design on a 65-nm CMOS process led to a chip core area of
$3.88~mm^{2}$
and a power consumption of 1.91 mW when operating at a 10 MHz clock frequency.