用于人机界面系统中多模态参考表达理解的神经符号推理

IF 2 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Aman Jain, Anirudh Reddy Kondapally, Kentaro Yamada, Hitomi Yanaka
{"title":"用于人机界面系统中多模态参考表达理解的神经符号推理","authors":"Aman Jain, Anirudh Reddy Kondapally, Kentaro Yamada, Hitomi Yanaka","doi":"10.1007/s00354-024-00243-8","DOIUrl":null,"url":null,"abstract":"<p>Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.</p>","PeriodicalId":54726,"journal":{"name":"New Generation Computing","volume":"41 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems\",\"authors\":\"Aman Jain, Anirudh Reddy Kondapally, Kentaro Yamada, Hitomi Yanaka\",\"doi\":\"10.1007/s00354-024-00243-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.</p>\",\"PeriodicalId\":54726,\"journal\":{\"name\":\"New Generation Computing\",\"volume\":\"41 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"New Generation Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00354-024-00243-8\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Generation Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00354-024-00243-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

传统的人机交互(HMI)界面主要依靠图形用户界面和语音指令。然而,人类的自然交流也包括非语言交流,其中包括手势,如指点。因此,最近在人机界面系统方面开展的工作试图将指点手势作为一种输入方式,在识别指点手势并将其与语音指令集成方面取得了重大进展。然而,现有的方法往往将这些输入模式独立处理,限制了它们处理需要对语言和手势进行复杂推理的复杂多模态指令的能力。另一方面,需要复杂推理的多模态任务正在语言和视觉领域受到挑战,但这些任务通常不包括手势(如指向)。为了弥合这一差距,我们探讨了多模态人机界面系统中具有挑战性的多模态任务之一--"参考表达理解(REC)",该任务包含指向手势。我们展示了一个虚拟场景,在这个场景中,机器人与用户共享一个环境,机器人的任务是根据用户的语言和手势指令识别物体。此外,为了应对这一挑战,我们提出了一种神经-符号混合模型,该模型结合了深度学习的多功能性和符号推理的可解释性。我们的贡献包括为人机交互系统提供了一个具有挑战性的多模态 REC 数据集、一个可解释的神经符号模型,以及对该模型将推理推广到未知环境的能力评估,并对模型的内部运作进行了深入的定性分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems

Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems

Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
New Generation Computing
New Generation Computing 工程技术-计算机:理论方法
CiteScore
5.90
自引率
15.40%
发文量
47
审稿时长
>12 weeks
期刊介绍: The journal is specially intended to support the development of new computational and cognitive paradigms stemming from the cross-fertilization of various research fields. These fields include, but are not limited to, programming (logic, constraint, functional, object-oriented), distributed/parallel computing, knowledge-based systems, agent-oriented systems, and cognitive aspects of human embodied knowledge. It also encourages theoretical and/or practical papers concerning all types of learning, knowledge discovery, evolutionary mechanisms, human cognition and learning, and emergent systems that can lead to key technologies enabling us to build more complex and intelligent systems. The editorial board hopes that New Generation Computing will work as a catalyst among active researchers with broad interests by ensuring a smooth publication process.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信