{"title":"Using Features at Multiple Temporal and Spatial Resolutions to Predict Human Behavior in Real Time","authors":"L. Zhang, Justin Lieffers, A. Pyarelal","doi":"10.48550/arXiv.2211.06721","DOIUrl":null,"url":null,"abstract":". When performing complex tasks, humans naturally reason at multiple temporal and spatial resolutions simultaneously. We contend that for an artificially intelligent agent to effectively model human teammates, i.e., demonstrate computational theory of mind (ToM), it should do the same. In this paper, we present an approach for integrating high and low-resolution spatial and temporal information to predict human behavior in real time and evaluate it on data collected from human subjects performing simulated urban search and rescue (USAR) missions in a Minecraft-based environment. Our model composes neural networks for high and low-resolution feature extraction with a neural network for behavior prediction, with all three networks trained simultaneously. The high-resolution extractor encodes dynamically changing goals robustly by taking as input the Manhattan distance difference between the humans’ Minecraft avatars and candidate goals in the environment for the latest few actions, computed from a high-resolution gridworld representation. In contrast, the low-resolution extractor encodes participants’ historical behavior using a historical state matrix computed from a low-resolution graph representation. Through supervised learning, our model acquires a robust prior for human behavior prediction, and can effectively deal with long-term observations. Our experimental results demonstrate that our method significantly improves prediction accuracy compared to approaches that only use high-resolution information. layer combined with a batch normalization layer as a basic building block for our three neural networks. The output FC layers in the prediction network ( g ( e lr , e hr )) are passed through softmax and sigmoid functions to obtain the probabilities of the agent’s goal ( O gp ) and the likelihood that the next victim is triaged ( O vp ), respectively.","PeriodicalId":119585,"journal":{"name":"ToM for Teams","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ToM for Teams","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.06721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
. When performing complex tasks, humans naturally reason at multiple temporal and spatial resolutions simultaneously. We contend that for an artificially intelligent agent to effectively model human teammates, i.e., demonstrate computational theory of mind (ToM), it should do the same. In this paper, we present an approach for integrating high and low-resolution spatial and temporal information to predict human behavior in real time and evaluate it on data collected from human subjects performing simulated urban search and rescue (USAR) missions in a Minecraft-based environment. Our model composes neural networks for high and low-resolution feature extraction with a neural network for behavior prediction, with all three networks trained simultaneously. The high-resolution extractor encodes dynamically changing goals robustly by taking as input the Manhattan distance difference between the humans’ Minecraft avatars and candidate goals in the environment for the latest few actions, computed from a high-resolution gridworld representation. In contrast, the low-resolution extractor encodes participants’ historical behavior using a historical state matrix computed from a low-resolution graph representation. Through supervised learning, our model acquires a robust prior for human behavior prediction, and can effectively deal with long-term observations. Our experimental results demonstrate that our method significantly improves prediction accuracy compared to approaches that only use high-resolution information. layer combined with a batch normalization layer as a basic building block for our three neural networks. The output FC layers in the prediction network ( g ( e lr , e hr )) are passed through softmax and sigmoid functions to obtain the probabilities of the agent’s goal ( O gp ) and the likelihood that the next victim is triaged ( O vp ), respectively.
. 在执行复杂任务时,人类自然会同时在多个时间和空间分辨率下进行推理。我们认为,对于一个人工智能代理来说,为了有效地模拟人类队友,即展示计算思维理论(ToM),它应该做同样的事情。在本文中,我们提出了一种整合高分辨率和低分辨率时空信息的方法,以实时预测人类行为,并根据在基于《我的世界》的环境中执行模拟城市搜索和救援(USAR)任务的人类受试者收集的数据对其进行评估。我们的模型将用于高分辨率和低分辨率特征提取的神经网络与用于行为预测的神经网络组合在一起,并同时训练这三个网络。高分辨率提取器通过从高分辨率网格世界表示中计算出的人类Minecraft化身与环境中最近几个动作的候选目标之间的曼哈顿距离差作为输入,对动态变化的目标进行鲁棒编码。相比之下,低分辨率提取器使用从低分辨率图表示中计算的历史状态矩阵来编码参与者的历史行为。通过监督学习,我们的模型获得了对人类行为预测的鲁棒先验,并且可以有效地处理长期观察。实验结果表明,与仅使用高分辨率信息的方法相比,我们的方法显著提高了预测精度。层与批处理归一化层相结合,作为我们三个神经网络的基本构建块。预测网络中的输出FC层(g (e lr, e hr))通过softmax和sigmoid函数分别得到agent目标的概率(O gp)和下一个受害者被分类的可能性(O vp)。