{"title":"频谱过度提取?半实时机器人自我语音过滤后的语音增强方法","authors":"Yue Li, Koen V. Hindriks, Florian A. Kunneman","doi":"arxiv-2409.06274","DOIUrl":null,"url":null,"abstract":"Spectral subtraction, widely used for its simplicity, has been employed to\naddress the Robot Ego Speech Filtering (RESF) problem for detecting speech\ncontents of human interruption from robot's single-channel microphone\nrecordings when it is speaking. However, this approach suffers from\noversubtraction in the fundamental frequency range (FFR), leading to degraded\nspeech content recognition. To address this, we propose a Two-Mask\nConformer-based Metric Generative Adversarial Network (CMGAN) to enhance the\ndetected speech and improve recognition results. Our model compensates for\noversubtracted FFR values with high-frequency information and long-term\nfeatures and then de-noises the new spectrogram. In addition, we introduce an\nincremental processing method that allows semi-real-time audio processing with\nstreaming input on a network trained on long fixed-length input. Evaluations of\ntwo datasets, including one with unseen noise, demonstrate significant\nimprovements in recognition accuracy and the effectiveness of the proposed\ntwo-mask approach and incremental processing, enhancing the robustness of the\nproposed RESF pipeline in real-world HRI scenarios.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"61 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time\",\"authors\":\"Yue Li, Koen V. Hindriks, Florian A. Kunneman\",\"doi\":\"arxiv-2409.06274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spectral subtraction, widely used for its simplicity, has been employed to\\naddress the Robot Ego Speech Filtering (RESF) problem for detecting speech\\ncontents of human interruption from robot's single-channel microphone\\nrecordings when it is speaking. However, this approach suffers from\\noversubtraction in the fundamental frequency range (FFR), leading to degraded\\nspeech content recognition. To address this, we propose a Two-Mask\\nConformer-based Metric Generative Adversarial Network (CMGAN) to enhance the\\ndetected speech and improve recognition results. Our model compensates for\\noversubtracted FFR values with high-frequency information and long-term\\nfeatures and then de-noises the new spectrogram. In addition, we introduce an\\nincremental processing method that allows semi-real-time audio processing with\\nstreaming input on a network trained on long fixed-length input. Evaluations of\\ntwo datasets, including one with unseen noise, demonstrate significant\\nimprovements in recognition accuracy and the effectiveness of the proposed\\ntwo-mask approach and incremental processing, enhancing the robustness of the\\nproposed RESF pipeline in real-world HRI scenarios.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"61 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time
Spectral subtraction, widely used for its simplicity, has been employed to
address the Robot Ego Speech Filtering (RESF) problem for detecting speech
contents of human interruption from robot's single-channel microphone
recordings when it is speaking. However, this approach suffers from
oversubtraction in the fundamental frequency range (FFR), leading to degraded
speech content recognition. To address this, we propose a Two-Mask
Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the
detected speech and improve recognition results. Our model compensates for
oversubtracted FFR values with high-frequency information and long-term
features and then de-noises the new spectrogram. In addition, we introduce an
incremental processing method that allows semi-real-time audio processing with
streaming input on a network trained on long fixed-length input. Evaluations of
two datasets, including one with unseen noise, demonstrate significant
improvements in recognition accuracy and the effectiveness of the proposed
two-mask approach and incremental processing, enhancing the robustness of the
proposed RESF pipeline in real-world HRI scenarios.