Multi-modal transformer with language modality distillation for early pedestrian action anticipation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-09-10 DOI:10.1016/j.cviu.2024.104144

Nada Osman, Guglielmo Camporese, Lamberto Ballan

{"title":"Multi-modal transformer with language modality distillation for early pedestrian action anticipation","authors":"Nada Osman, Guglielmo Camporese, Lamberto Ballan","doi":"10.1016/j.cviu.2024.104144","DOIUrl":null,"url":null,"abstract":"<div><p>Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104144"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S107731422400225X/pdfft?md5=56f12e2679069b787f5e626421a0e104&pid=1-s2.0-S107731422400225X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S107731422400225X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.

查看原文本刊更多论文

多模态转换器与语言模态提炼，用于早期行人行动预测

语言-视觉整合已成为计算机视觉领域日益流行的研究方向。近年来，越来越多的人认识到将语言信息融入视觉任务的重要性，尤其是在动作预测等领域。这种整合使预测模型能够利用文本描述获得更深入的上下文理解，从而做出更准确的预测。在这项工作中，我们的重点是行人行动预测，目标是尽早预测行人在城市环境中的未来行动。我们的方法依赖于多模态转换器模型，该模型可对过去的观察结果进行编码，并在不同的预测时间进行预测，同时采用学习掩码技术来过滤观察帧中的冗余信息。我们没有单纯依赖从图像或视频中提取的视觉线索，而是探索了整合文本信息对丰富行人行动预测模型输入模式的影响。我们研究了生成与输入图像相对应的描述性标题的各种技术，旨在提高预测性能。现有公共基准的评估结果表明，与之前的研究相比，我们的方法能有效提高不同预测时间的预测性能。此外，在我们的预测模型中加入语言模式后，效果显著，在 1 秒钟预测时间内，F1 分数提高了 29.5%，在 4 秒钟预测时间内，F1 分数提高了 16.66%。这些结果凸显了语言-视觉整合在复杂城市环境中提高行人行动预测能力的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems