{"title":"Scene-constrained spatial-temporal graph convolutional network for pedestrian trajectory prediction","authors":"Chen Haodong, Ji Qingge","doi":"10.11834/jig.221027","DOIUrl":null,"url":null,"abstract":"目的 针对行人轨迹预测问题,已有的几种结合场景信息的方法基于合并操作通过神经网络隐式学习场景与行人运动的关联,无法直观地解释场景对单个行人运动的调节作用。除此之外,基于图注意力机制的时空图神经网络旨在学习全局模式下行人之间的社会交互,在人群拥挤场景下精度不佳。鉴于此,本文提出一种场景限制时空图卷积神经网络(scene-constrained spatial-temporal graph convolutional neural network,Scene-STGCNN)。方法 Scene-STGCNN由运动模块、基于场景的微调模块、时空卷积和时空外推卷积组成。运动模块以时空图卷积提取局部行人时空特征,避免了时空图神经网络在全局模式下学习交互的局限性。基于场景的微调模块将场景信息嵌入为掩模矩阵,用来调节运动模块生成的中间运动特征,具备实际场景下的物理解释性。通过最小化核密度估计下真实轨迹的负对数似然,增强Scene-STGCNN输出的多模态性,减少预测误差。结果 实验在公开数据集ETH (包含ETH和HOTEL)和UCY (包含UNIV、ZARA1和ZARA2)上与其他7种主流方法进行比较,就平均值而言,相对于性能第2的模型,平均位移误差(average displacement error,ADE)值减少了12%,最终位移误差(final displacement error,FDE)值减少了9%。在同样的数据集上进行了消融实验以验证基于场景的微调模块的有效性,结果表明基于场景的微调模块能有效建模场景对行人轨迹的调节作用,从而减小算法的预测误差。结论 本文提出的场景限制时空图卷积网络能有效融合场景和行人运动,在学习局部模式下行人交互的同时基于场景特征对轨迹特征做实时性调节,相比于其他主流方法,具有更优的性能。;Objective Pedestrian trajectory prediction is essential for such domains like unmanned vehicles,security surveillance,and social robotics nowadays. Trajectory prediction is beneficial for computer systems to perform better decision making and planning to some extent. Current methods are focused on pedestrian trajectory information,and scene elements-related spatial constraints on pedestrian motion in the same space are challenged to explain human-to-human social interactions further,in which future location of pedestrians cannot be located in building walls,and pedestrians at building corners undergo large velocity direction deflections due to cornering behavior. The pathways can be focused on the integrated scene information,for which the scene image is melted into a one-dimensional vector and merged with the trajectory information. Two-dimensional spatial signal of the scene will be distorted and it cannot be intuitively explained according to the modulating effect of the scene on pedestrian motion. To build a spatiotemporal graph representation of pedestrians,recent graph neural network(GNN) is used to develop a method based on graph attention network(GAT),in which pedestrians are as the graph nodes,trajectory features as the node attributes,and pedestrians-between spatial interactions are as the edges in the graph. These sorts of methods can be used to focus on pedestrians-between social interactions in the global scale. However,for crowded scenes,graph attention mechanism may not be able to assign appropriate weights to each pedestrian accurately,resulting in poor algorithm accuracy. To resolve the two problems mentioned above,we develop a scene constraints-based spatiotemporal graph convolutional network,called Scene-STGCNN,which aggregates pedestrian motion status with a graph convolutional neural network for local interactions,and it achieves accurate aggregation of pedestrian motion status with a small number of parameters. At the same time,we design a scene-based fine-tuning module to explicitly model the modulating effect of scenes on pedestrian motion with the information of neighboring scene changes as input. Method Scene-STGCNN consists of a motion module,a scene-based fine-tuning module,spatiotemporal convolution,and spatiotemporal extrapolation convolution. For the motion module,the graph convolution is a 1 × 1 coresized convolutional neural network(CNN) layer for embedding pedestrian velocity information. The residual convolution is composed of CNN layer of 1 × 1 kernel size and BatchNorm(BN) layer. Temporal convolution is organized of BN layer, PReLU layer,3 × 1 core-sized CNN layer,BN layer and Dropout layer as well. The motion module takes the pedestrian velocity spatiotemporal graph and the scene mask matrix as input,in which CNN-based pedestrian velocity spatiotemporal graph is encoded and the pedestrian spatiotemporal features of existing multiple frames are fused. For the scene-based finetuning module,temporal neighboring scene change information is first introduced to generate the scene-based pedestrian spatiotemporal map,and the embedding of the pedestrian spatiotemporal map by scene convolution is then performed to obtain the scene mask matrix,which is used to make Hadamard products with the intermediate motion features in the motion module. The real-time regulation role of the scene on pedestrians can be explicitly modeling further. Spatiotemporal convolution as a transition coding network consists of two temporal gating units and a spatial convolution,which is used to enhance the temporal correlation and contextual spatial dependence of pedestrian motion. A two-dimensional Gaussian distribution-related trajectory distribution is generated in terms of temporal extrapolation convolution. The kernel density estimation-based negative log-likelihood as the loss function will enhance the multimodality of the Scene-STGCNN prediction distribution while the prediction loss is optimized. Result Experiments are carried out to compare with the other related seven popular methods on the publicly available datasets ETH(including ETH and HOTEL) and UCY(including UNIV, ZARA1,and ZARA2). The average displacement error(ADE) values are optimized by 12%,and the final displacement error(FDE) values are optimized by 9% in terms of average values. Ablation experiments are used to verify the effectiveness of the scene-based fine-tuning module,and the results demonstrate that the scene-based fine-tuning module can effectively model the modulation effect of the scene on pedestrian trajectory,and the prediction error of the algorithm is optimized as well. In addition,qualitative analysis is focused on the issues of Scene-STGCNN-captured inherent patterns of pedestrian motion and the involved prediction distribution. The visualization results show that Scene-STGCNN can be used to learn the pedestrian motion patterns effectively while maintaining accurate predictions. Conclusion we facilitate a pedestrian trajectory prediction model,called Scene-STGCNN,which can fuse scene information with trajectory features effectively through a scene-based fine-tuning module. Furthermore,Scene-STGCNN potentials can be focused on scene information-related pedestrian trajectory prediction method to a certain extent via modeling the modulation effect of scene on pedestrian motion.","PeriodicalId":36336,"journal":{"name":"中国图象图形学报","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"中国图象图形学报","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11834/jig.221027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
目的 针对行人轨迹预测问题,已有的几种结合场景信息的方法基于合并操作通过神经网络隐式学习场景与行人运动的关联,无法直观地解释场景对单个行人运动的调节作用。除此之外,基于图注意力机制的时空图神经网络旨在学习全局模式下行人之间的社会交互,在人群拥挤场景下精度不佳。鉴于此,本文提出一种场景限制时空图卷积神经网络(scene-constrained spatial-temporal graph convolutional neural network,Scene-STGCNN)。方法 Scene-STGCNN由运动模块、基于场景的微调模块、时空卷积和时空外推卷积组成。运动模块以时空图卷积提取局部行人时空特征,避免了时空图神经网络在全局模式下学习交互的局限性。基于场景的微调模块将场景信息嵌入为掩模矩阵,用来调节运动模块生成的中间运动特征,具备实际场景下的物理解释性。通过最小化核密度估计下真实轨迹的负对数似然,增强Scene-STGCNN输出的多模态性,减少预测误差。结果 实验在公开数据集ETH (包含ETH和HOTEL)和UCY (包含UNIV、ZARA1和ZARA2)上与其他7种主流方法进行比较,就平均值而言,相对于性能第2的模型,平均位移误差(average displacement error,ADE)值减少了12%,最终位移误差(final displacement error,FDE)值减少了9%。在同样的数据集上进行了消融实验以验证基于场景的微调模块的有效性,结果表明基于场景的微调模块能有效建模场景对行人轨迹的调节作用,从而减小算法的预测误差。结论 本文提出的场景限制时空图卷积网络能有效融合场景和行人运动,在学习局部模式下行人交互的同时基于场景特征对轨迹特征做实时性调节,相比于其他主流方法,具有更优的性能。;Objective Pedestrian trajectory prediction is essential for such domains like unmanned vehicles,security surveillance,and social robotics nowadays. Trajectory prediction is beneficial for computer systems to perform better decision making and planning to some extent. Current methods are focused on pedestrian trajectory information,and scene elements-related spatial constraints on pedestrian motion in the same space are challenged to explain human-to-human social interactions further,in which future location of pedestrians cannot be located in building walls,and pedestrians at building corners undergo large velocity direction deflections due to cornering behavior. The pathways can be focused on the integrated scene information,for which the scene image is melted into a one-dimensional vector and merged with the trajectory information. Two-dimensional spatial signal of the scene will be distorted and it cannot be intuitively explained according to the modulating effect of the scene on pedestrian motion. To build a spatiotemporal graph representation of pedestrians,recent graph neural network(GNN) is used to develop a method based on graph attention network(GAT),in which pedestrians are as the graph nodes,trajectory features as the node attributes,and pedestrians-between spatial interactions are as the edges in the graph. These sorts of methods can be used to focus on pedestrians-between social interactions in the global scale. However,for crowded scenes,graph attention mechanism may not be able to assign appropriate weights to each pedestrian accurately,resulting in poor algorithm accuracy. To resolve the two problems mentioned above,we develop a scene constraints-based spatiotemporal graph convolutional network,called Scene-STGCNN,which aggregates pedestrian motion status with a graph convolutional neural network for local interactions,and it achieves accurate aggregation of pedestrian motion status with a small number of parameters. At the same time,we design a scene-based fine-tuning module to explicitly model the modulating effect of scenes on pedestrian motion with the information of neighboring scene changes as input. Method Scene-STGCNN consists of a motion module,a scene-based fine-tuning module,spatiotemporal convolution,and spatiotemporal extrapolation convolution. For the motion module,the graph convolution is a 1 × 1 coresized convolutional neural network(CNN) layer for embedding pedestrian velocity information. The residual convolution is composed of CNN layer of 1 × 1 kernel size and BatchNorm(BN) layer. Temporal convolution is organized of BN layer, PReLU layer,3 × 1 core-sized CNN layer,BN layer and Dropout layer as well. The motion module takes the pedestrian velocity spatiotemporal graph and the scene mask matrix as input,in which CNN-based pedestrian velocity spatiotemporal graph is encoded and the pedestrian spatiotemporal features of existing multiple frames are fused. For the scene-based finetuning module,temporal neighboring scene change information is first introduced to generate the scene-based pedestrian spatiotemporal map,and the embedding of the pedestrian spatiotemporal map by scene convolution is then performed to obtain the scene mask matrix,which is used to make Hadamard products with the intermediate motion features in the motion module. The real-time regulation role of the scene on pedestrians can be explicitly modeling further. Spatiotemporal convolution as a transition coding network consists of two temporal gating units and a spatial convolution,which is used to enhance the temporal correlation and contextual spatial dependence of pedestrian motion. A two-dimensional Gaussian distribution-related trajectory distribution is generated in terms of temporal extrapolation convolution. The kernel density estimation-based negative log-likelihood as the loss function will enhance the multimodality of the Scene-STGCNN prediction distribution while the prediction loss is optimized. Result Experiments are carried out to compare with the other related seven popular methods on the publicly available datasets ETH(including ETH and HOTEL) and UCY(including UNIV, ZARA1,and ZARA2). The average displacement error(ADE) values are optimized by 12%,and the final displacement error(FDE) values are optimized by 9% in terms of average values. Ablation experiments are used to verify the effectiveness of the scene-based fine-tuning module,and the results demonstrate that the scene-based fine-tuning module can effectively model the modulation effect of the scene on pedestrian trajectory,and the prediction error of the algorithm is optimized as well. In addition,qualitative analysis is focused on the issues of Scene-STGCNN-captured inherent patterns of pedestrian motion and the involved prediction distribution. The visualization results show that Scene-STGCNN can be used to learn the pedestrian motion patterns effectively while maintaining accurate predictions. Conclusion we facilitate a pedestrian trajectory prediction model,called Scene-STGCNN,which can fuse scene information with trajectory features effectively through a scene-based fine-tuning module. Furthermore,Scene-STGCNN potentials can be focused on scene information-related pedestrian trajectory prediction method to a certain extent via modeling the modulation effect of scene on pedestrian motion.
目的 针对行人轨迹预测问题,已有的几种结合场景信息的方法基于合并操作通过神经网络隐式学习场景与行人运动的关联,无法直观地解释场景对单个行人运动的调节作用。除此之外,基于图注意力机制的时空图神经网络旨在学习全局模式下行人之间的社会交互,在人群拥挤场景下精度不佳。鉴于此,本文提出一种场景限制时空图卷积神经网络(scene-constrained spatial-temporal graph convolutional neural network,Scene-STGCNN)。方法 Scene-STGCNN由运动模块、基于场景的微调模块、时空卷积和时空外推卷积组成。运动模块以时空图卷积提取局部行人时空特征,避免了时空图神经网络在全局模式下学习交互的局限性。基于场景的微调模块将场景信息嵌入为掩模矩阵,用来调节运动模块生成的中间运动特征,具备实际场景下的物理解释性。通过最小化核密度估计下真实轨迹的负对数似然,增强Scene-STGCNN输出的多模态性,减少预测误差。结果 实验在公开数据集ETH (包含ETH和HOTEL)和UCY (包含UNIV、ZARA1和ZARA2)上与其他7种主流方法进行比较,就平均值而言,相对于性能第2的模型,平均位移误差(average displacement error,ADE)值减少了12%,最终位移误差(final displacement error,FDE)值减少了9%。在同样的数据集上进行了消融实验以验证基于场景的微调模块的有效性,结果表明基于场景的微调模块能有效建模场景对行人轨迹的调节作用,从而减小算法的预测误差。结论 本文提出的场景限制时空图卷积网络能有效融合场景和行人运动,在学习局部模式下行人交互的同时基于场景特征对轨迹特征做实时性调节,相比于其他主流方法,具有更优的性能。;Objective Pedestrian trajectory prediction is essential for such domains like unmanned vehicles,security surveillance,and social robotics nowadays. Trajectory prediction is beneficial for computer systems to perform better decision making and planning to some extent. Current methods are focused on pedestrian trajectory information,and scene elements-related spatial constraints on pedestrian motion in the same space are challenged to explain human-to-human social interactions further,in which future location of pedestrians cannot be located in building walls,and pedestrians at building corners undergo large velocity direction deflections due to cornering behavior. The pathways can be focused on the integrated scene information,for which the scene image is melted into a one-dimensional vector and merged with the trajectory information. Two-dimensional spatial signal of the scene will be distorted and it cannot be intuitively explained according to the modulating effect of the scene on pedestrian motion. To build a spatiotemporal graph representation of pedestrians,recent graph neural network(GNN) is used to develop a method based on graph attention network(GAT),in which pedestrians are as the graph nodes,trajectory features as the node attributes,and pedestrians-between spatial interactions are as the edges in the graph. These sorts of methods can be used to focus on pedestrians-between social interactions in the global scale. However,for crowded scenes,graph attention mechanism may not be able to assign appropriate weights to each pedestrian accurately,resulting in poor algorithm accuracy. To resolve the two problems mentioned above,we develop a scene constraints-based spatiotemporal graph convolutional network,called Scene-STGCNN,which aggregates pedestrian motion status with a graph convolutional neural network for local interactions,and it achieves accurate aggregation of pedestrian motion status with a small number of parameters. At the same time,we design a scene-based fine-tuning module to explicitly model the modulating effect of scenes on pedestrian motion with the information of neighboring scene changes as input. Method Scene-STGCNN consists of a motion module,a scene-based fine-tuning module,spatiotemporal convolution,and spatiotemporal extrapolation convolution. For the motion module,the graph convolution is a 1 × 1 coresized convolutional neural network(CNN) layer for embedding pedestrian velocity information. The residual convolution is composed of CNN layer of 1 × 1 kernel size and BatchNorm(BN) layer. Temporal convolution is organized of BN layer, PReLU layer,3 × 1 core-sized CNN layer,BN layer and Dropout layer as well. The motion module takes the pedestrian velocity spatiotemporal graph and the scene mask matrix as input,in which CNN-based pedestrian velocity spatiotemporal graph is encoded and the pedestrian spatiotemporal features of existing multiple frames are fused. For the scene-based finetuning module,temporal neighboring scene change information is first introduced to generate the scene-based pedestrian spatiotemporal map,and the embedding of the pedestrian spatiotemporal map by scene convolution is then performed to obtain the scene mask matrix,which is used to make Hadamard products with the intermediate motion features in the motion module. The real-time regulation role of the scene on pedestrians can be explicitly modeling further. Spatiotemporal convolution as a transition coding network consists of two temporal gating units and a spatial convolution,which is used to enhance the temporal correlation and contextual spatial dependence of pedestrian motion. A two-dimensional Gaussian distribution-related trajectory distribution is generated in terms of temporal extrapolation convolution. The kernel density estimation-based negative log-likelihood as the loss function will enhance the multimodality of the Scene-STGCNN prediction distribution while the prediction loss is optimized. Result Experiments are carried out to compare with the other related seven popular methods on the publicly available datasets ETH(including ETH and HOTEL) and UCY(including UNIV, ZARA1,and ZARA2).
中国图象图形学报Computer Science-Computer Graphics and Computer-Aided Design
CiteScore
1.20
自引率
0.00%
发文量
6776
期刊介绍:
Journal of Image and Graphics (ISSN 1006-8961, CN 11-3758/TB, CODEN ZTTXFZ) is an authoritative academic journal supervised by the Chinese Academy of Sciences and co-sponsored by the Institute of Space and Astronautical Information Innovation of the Chinese Academy of Sciences (ISIAS), the Chinese Society of Image and Graphics (CSIG), and the Beijing Institute of Applied Physics and Computational Mathematics (BIAPM). The journal integrates high-tech theories, technical methods and industrialisation of applied research results in computer image graphics, and mainly publishes innovative and high-level scientific research papers on basic and applied research in image graphics science and its closely related fields. The form of papers includes reviews, technical reports, project progress, academic news, new technology reviews, new product introduction and industrialisation research. The content covers a wide range of fields such as image analysis and recognition, image understanding and computer vision, computer graphics, virtual reality and augmented reality, system simulation, animation, etc., and theme columns are opened according to the research hotspots and cutting-edge topics.
Journal of Image and Graphics reaches a wide range of readers, including scientific and technical personnel, enterprise supervisors, and postgraduates and college students of colleges and universities engaged in the fields of national defence, military, aviation, aerospace, communications, electronics, automotive, agriculture, meteorology, environmental protection, remote sensing, mapping, oil field, construction, transportation, finance, telecommunications, education, medical care, film and television, and art.
Journal of Image and Graphics is included in many important domestic and international scientific literature database systems, including EBSCO database in the United States, JST database in Japan, Scopus database in the Netherlands, China Science and Technology Thesis Statistics and Analysis (Annual Research Report), China Science Citation Database (CSCD), China Academic Journal Network Publishing Database (CAJD), and China Academic Journal Network Publishing Database (CAJD). China Science Citation Database (CSCD), China Academic Journals Network Publishing Database (CAJD), China Academic Journal Abstracts, Chinese Science Abstracts (Series A), China Electronic Science Abstracts, Chinese Core Journals Abstracts, Chinese Academic Journals on CD-ROM, and China Academic Journals Comprehensive Evaluation Database.