Wentao Zheng , Hong Zheng , Yuquan Sun , Ying Jing
{"title":"IntSTR: An integrated spatio-temporal relation transformer for video object detection","authors":"Wentao Zheng , Hong Zheng , Yuquan Sun , Ying Jing","doi":"10.1016/j.neucom.2025.131704","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Transformer-based video object detection (VOD) methods have achieved remarkable progress by replacing the hand-crafted components traditionally used in CNN-based detectors. However, many existing approaches rely on staged spatio-temporal modeling strategies, which increase model complexity and restrict early interaction between spatial and temporal information. To overcome these limitations, we propose IntSTR, a novel framework for unified spatio-temporal modeling. At its core, the spatio-temporal relation encoder (STRE) integrates spatio-temporal feature processing within a single encoder through cascaded attention modules. To strengthen temporal consistency, the temporal query relation (TQR) module explicitly captures geometric relations between object queries across adjacent frames with minimal computational overhead. In addition, the Temporal Feature Memory (TFM) maintains a dynamic memory bank that caches temporal contexts, enabling effective feature aggregation and efficient online processing. Extensive experiments on the ImageNet VID dataset validate the effectiveness of our approach. IntSTR achieves an excellent trade-off between accuracy and efficiency, reaching a competitive 87.2 % <span><math><msub><mtext>mAP</mtext><mrow><mn>50</mn></mrow></msub></math></span> with the ResNet-101 backbone while maintaining real-time performance at 33.4 FPS.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131704"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023768","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Transformer-based video object detection (VOD) methods have achieved remarkable progress by replacing the hand-crafted components traditionally used in CNN-based detectors. However, many existing approaches rely on staged spatio-temporal modeling strategies, which increase model complexity and restrict early interaction between spatial and temporal information. To overcome these limitations, we propose IntSTR, a novel framework for unified spatio-temporal modeling. At its core, the spatio-temporal relation encoder (STRE) integrates spatio-temporal feature processing within a single encoder through cascaded attention modules. To strengthen temporal consistency, the temporal query relation (TQR) module explicitly captures geometric relations between object queries across adjacent frames with minimal computational overhead. In addition, the Temporal Feature Memory (TFM) maintains a dynamic memory bank that caches temporal contexts, enabling effective feature aggregation and efficient online processing. Extensive experiments on the ImageNet VID dataset validate the effectiveness of our approach. IntSTR achieves an excellent trade-off between accuracy and efficiency, reaching a competitive 87.2 % with the ResNet-101 backbone while maintaining real-time performance at 33.4 FPS.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.