{"title":"通过高效多模态预测器进行端到端行人轨迹预测","authors":"","doi":"10.1016/j.cviu.2024.104107","DOIUrl":null,"url":null,"abstract":"<div><p>Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-end pedestrian trajectory prediction via Efficient Multi-modal Predictors\",\"authors\":\"\",\"doi\":\"10.1016/j.cviu.2024.104107\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.</p></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314224001887\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001887","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
End-to-end pedestrian trajectory prediction via Efficient Multi-modal Predictors
Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems