Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction

ArXiv Pub Date : 2024-03-05 DOI:10.1145/3648536.3648539

Yue Li, Koen V. Hindriks, Florian Kunneman

{"title":"Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction","authors":"Yue Li, Koen V. Hindriks, Florian Kunneman","doi":"10.1145/3648536.3648539","DOIUrl":null,"url":null,"abstract":"In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"358 15","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3648536.3648539","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.

查看原文本刊更多论文

人机交互过程中的单通道机器人自我语音过滤

在本文中，我们将研究当人类语音与社交机器人 Pepper 的声音和风扇噪音重叠时，人类语音的自动过滤效果如何。我们的最终目标是，在人机交互场景中，当机器人说话时，麦克风可以保持打开状态，从而实现更自然的轮流方案，让人类可以打断机器人。为了做出适当的回应，机器人需要理解对话者在重叠部分所说的话，这可以通过目标语音提取（TSE）来实现。为了研究目标语音提取在流行的社交机器人 Pepper 中的应用效果，我们在一个混响和高混响的房间里制作了一个数据集，该数据集由 Pepper 本身的语音录音、其风扇噪声（靠近麦克风）和 Pepper 麦克风录制的人类语音混合组成。我们比较了一种信号处理方法（包括后置滤波和不带滤波）和一种卷积递归神经网络（CRNN）方法，以及一种最先进的基于说话人识别的 TSE 模型，发现不带后置滤波的信号处理方法在低混响的重叠语音信号中的字错误率方面表现最佳，而 CRNN 方法对混响具有更强的鲁棒性。这些结果表明，在实际应用中，只要室内混响较小，且人类语音音量较大或音调较高，就有可能在与机器人的重叠语音中估计出人类的声音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量