Gaze and filled pause detection for smooth human-robot conversations

Miriam Bilac, Marine Chamoux, Angelica Lim
{"title":"Gaze and filled pause detection for smooth human-robot conversations","authors":"Miriam Bilac, Marine Chamoux, Angelica Lim","doi":"10.1109/HUMANOIDS.2017.8246889","DOIUrl":null,"url":null,"abstract":"Let the human speak! Interactive robots and voice interfaces such as Pepper, Amazon Alexa, and OK Google are becoming more and more popular, allowing for more natural interaction compared to screens or keyboards. One issue with voice interfaces is that they tend to require a “robotic” flow of human speech. Humans must be careful to not produce disfluencies, such as hesitations or extended pauses between words. If they do, the agent may assume that the human has finished their speech turn, and interrupts them mid-thought. Interactive robots often rely on the same limited dialogue technology built for speech interfaces. Yet humanoid robots have the potential to also use their vision systems to determine when the human has finished their speaking turn. In this paper, we introduce HOMAGE (Human-rObot Multimodal Audio and Gaze End-of-turn), a multimodal turntaking system for conversational humanoid robots. We created a dataset of humans spontaneously hesitating when responding to a robot's open-ended questions such as, “What was your favorite moment this year?”. Our analyses found that users produced both auditory filled pauses such as “uhhh”, as well as gaze away from the robot to keep their speaking turn. We then trained a machine learning system to detect the auditory filled pauses and integrated it along with gaze into the Pepper humanoid robot's real-time dialog system. Experiments with 28 naive users revealed that adding auditory filled pause detection and gaze tracking significantly reduced robot interruptions. Furthermore, user turns were 2.1 times longer (without repetitions), suggesting that this strategy allows humans to express themselves more, toward less time pressure and better robot listeners.","PeriodicalId":143992,"journal":{"name":"2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HUMANOIDS.2017.8246889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Let the human speak! Interactive robots and voice interfaces such as Pepper, Amazon Alexa, and OK Google are becoming more and more popular, allowing for more natural interaction compared to screens or keyboards. One issue with voice interfaces is that they tend to require a “robotic” flow of human speech. Humans must be careful to not produce disfluencies, such as hesitations or extended pauses between words. If they do, the agent may assume that the human has finished their speech turn, and interrupts them mid-thought. Interactive robots often rely on the same limited dialogue technology built for speech interfaces. Yet humanoid robots have the potential to also use their vision systems to determine when the human has finished their speaking turn. In this paper, we introduce HOMAGE (Human-rObot Multimodal Audio and Gaze End-of-turn), a multimodal turntaking system for conversational humanoid robots. We created a dataset of humans spontaneously hesitating when responding to a robot's open-ended questions such as, “What was your favorite moment this year?”. Our analyses found that users produced both auditory filled pauses such as “uhhh”, as well as gaze away from the robot to keep their speaking turn. We then trained a machine learning system to detect the auditory filled pauses and integrated it along with gaze into the Pepper humanoid robot's real-time dialog system. Experiments with 28 naive users revealed that adding auditory filled pause detection and gaze tracking significantly reduced robot interruptions. Furthermore, user turns were 2.1 times longer (without repetitions), suggesting that this strategy allows humans to express themselves more, toward less time pressure and better robot listeners.
凝视和填充暂停检测平滑的人机对话
让人类说话吧!交互式机器人和语音界面(如Pepper、Amazon Alexa和OK Google)正变得越来越受欢迎,与屏幕或键盘相比,它们允许更自然的交互。语音界面的一个问题是,它们往往需要一种“机器人式”的人类语言流。人们必须注意不要产生不流畅,比如单词之间的犹豫或长时间停顿。如果他们这样做,代理可能会认为人类已经完成了他们的演讲,并打断他们的思考。交互式机器人通常依赖于为语音界面构建的同样有限的对话技术。然而,人形机器人也有可能利用它们的视觉系统来确定人类何时完成了他们的讲话。在本文中,我们介绍了HOMAGE (Human-rObot Multimodal Audio and Gaze end -turn),这是一个用于会话类人机器人的多模态轮转系统。我们创建了一个数据集,记录了人类在回答机器人提出的开放式问题时的自发犹豫,比如“你今年最喜欢的时刻是什么?”我们的分析发现,用户既会发出“啊”这样充满听觉的停顿,也会把目光从机器人身上移开,以保持说话的顺序。然后,我们训练了一个机器学习系统来检测充满听觉的停顿,并将其与凝视整合到Pepper人形机器人的实时对话系统中。对28名天真用户的实验表明,添加听觉填充暂停检测和凝视跟踪显著减少了机器人的干扰。此外,用户的回合数增加了2.1倍(没有重复),这表明这种策略可以让人类更多地表达自己,减少时间压力,让机器人更好地倾听。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信