频谱过度提取？半实时机器人自我语音过滤后的语音增强方法

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06274

Yue Li, Koen V. Hindriks, Florian A. Kunneman

{"title":"频谱过度提取？半实时机器人自我语音过滤后的语音增强方法","authors":"Yue Li, Koen V. Hindriks, Florian A. Kunneman","doi":"arxiv-2409.06274","DOIUrl":null,"url":null,"abstract":"Spectral subtraction, widely used for its simplicity, has been employed to\naddress the Robot Ego Speech Filtering (RESF) problem for detecting speech\ncontents of human interruption from robot's single-channel microphone\nrecordings when it is speaking. However, this approach suffers from\noversubtraction in the fundamental frequency range (FFR), leading to degraded\nspeech content recognition. To address this, we propose a Two-Mask\nConformer-based Metric Generative Adversarial Network (CMGAN) to enhance the\ndetected speech and improve recognition results. Our model compensates for\noversubtracted FFR values with high-frequency information and long-term\nfeatures and then de-noises the new spectrogram. In addition, we introduce an\nincremental processing method that allows semi-real-time audio processing with\nstreaming input on a network trained on long fixed-length input. Evaluations of\ntwo datasets, including one with unseen noise, demonstrate significant\nimprovements in recognition accuracy and the effectiveness of the proposed\ntwo-mask approach and incremental processing, enhancing the robustness of the\nproposed RESF pipeline in real-world HRI scenarios.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"61 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time\",\"authors\":\"Yue Li, Koen V. Hindriks, Florian A. Kunneman\",\"doi\":\"arxiv-2409.06274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spectral subtraction, widely used for its simplicity, has been employed to\\naddress the Robot Ego Speech Filtering (RESF) problem for detecting speech\\ncontents of human interruption from robot's single-channel microphone\\nrecordings when it is speaking. However, this approach suffers from\\noversubtraction in the fundamental frequency range (FFR), leading to degraded\\nspeech content recognition. To address this, we propose a Two-Mask\\nConformer-based Metric Generative Adversarial Network (CMGAN) to enhance the\\ndetected speech and improve recognition results. Our model compensates for\\noversubtracted FFR values with high-frequency information and long-term\\nfeatures and then de-noises the new spectrogram. In addition, we introduce an\\nincremental processing method that allows semi-real-time audio processing with\\nstreaming input on a network trained on long fixed-length input. Evaluations of\\ntwo datasets, including one with unseen noise, demonstrate significant\\nimprovements in recognition accuracy and the effectiveness of the proposed\\ntwo-mask approach and incremental processing, enhancing the robustness of the\\nproposed RESF pipeline in real-world HRI scenarios.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"61 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

频谱减法因其简便性而被广泛应用于解决机器人自我语音过滤（RESF）问题，用于从机器人说话时的单通道麦克风录音中检测人类干扰的语音内容。然而，这种方法在基频范围（FFR）内存在过度抽取问题，导致语音内容识别能力下降。针对这一问题，我们提出了基于双掩码变换器的度量生成对抗网络（CMGAN），以增强检测到的语音并改善识别结果。我们的模型利用高频信息和长期特征来补偿被减弱的 FFR 值，然后对新的频谱图进行去噪。此外，我们还引入了一种增量处理方法，该方法允许在根据长固定长度输入训练的网络上使用流输入进行半实时音频处理。对两个数据集（包括一个含有未知噪声的数据集）的评估结果表明，拟议的双掩码方法和增量处理方法的识别准确率和有效性都有显著提高，从而增强了拟议的 RESF 管道在现实世界 HRI 场景中的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time

Spectral subtraction, widely used for its simplicity, has been employed to address the Robot Ego Speech Filtering (RESF) problem for detecting speech contents of human interruption from robot's single-channel microphone recordings when it is speaking. However, this approach suffers from oversubtraction in the fundamental frequency range (FFR), leading to degraded speech content recognition. To address this, we propose a Two-Mask Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the detected speech and improve recognition results. Our model compensates for oversubtracted FFR values with high-frequency information and long-term features and then de-noises the new spectrogram. In addition, we introduce an incremental processing method that allows semi-real-time audio processing with streaming input on a network trained on long fixed-length input. Evaluations of two datasets, including one with unseen noise, demonstrate significant improvements in recognition accuracy and the effectiveness of the proposed two-mask approach and incremental processing, enhancing the robustness of the proposed RESF pipeline in real-world HRI scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量