Video Object Detection Considering Dynamic Neighborhood Feature Multiplexing

IF 8.6 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Systems Man Cybernetics-Systems Pub Date : 2025-06-05 DOI:10.1109/TSMC.2025.3572123

Jiahui Yu;Yifan Chen;Xuna Wang;Long Chen;Hang Chen;Dalin Zhou;Yingke Xu;Zhaojie Ju

{"title":"Video Object Detection Considering Dynamic Neighborhood Feature Multiplexing","authors":"Jiahui Yu;Yifan Chen;Xuna Wang;Long Chen;Hang Chen;Dalin Zhou;Yingke Xu;Zhaojie Ju","doi":"10.1109/TSMC.2025.3572123","DOIUrl":null,"url":null,"abstract":"Video object detection is essential for human-interaction applications, including bimanual manipulation sensing (BMS). The effects of video detection in practical applications still need to be improved, as they are restricted by long-range spatiotemporal dependency analysis. How do humans sense bimanual manipulation in videos, especially for deteriorated clips? We argue that humans analyze the current clips based on earlier memory, namely, long-term spatial and temporal dependencies (LTSTD). However, most existing methods have yet to report significant results, as the limited exploration of these dependencies limits them. Developing an easy-to-integrate module is generally preferred for future applications rather than designing a complex end-to-end framework. Therefore, we propose a dynamic neighborhood feature multiplexing mechanism for online video object detection in this article, which is better at learning LTSTD in flexible and robust ways, boosting existing detection results, called DNFM. Specifically, we develop dynamic memory enhancement neural networks for better long-term feature aggregation with negligible additional computation costs. We multiplex each frame feature to aggregate key enhanced representations under the guidance of dynamic memory recall. The DNFM contributes to various famous detectors in BMS and other challenging detection tasks, and particular attention has been devoted to “low-quality” frame detection. Experimental results show that, while achieving state-of-the-art detection performance, DNFM clearly illustrates the easy-to-integrate operation for boosting the video object detection results.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 8","pages":"5451-5463"},"PeriodicalIF":8.6000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11025159/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Video object detection is essential for human-interaction applications, including bimanual manipulation sensing (BMS). The effects of video detection in practical applications still need to be improved, as they are restricted by long-range spatiotemporal dependency analysis. How do humans sense bimanual manipulation in videos, especially for deteriorated clips? We argue that humans analyze the current clips based on earlier memory, namely, long-term spatial and temporal dependencies (LTSTD). However, most existing methods have yet to report significant results, as the limited exploration of these dependencies limits them. Developing an easy-to-integrate module is generally preferred for future applications rather than designing a complex end-to-end framework. Therefore, we propose a dynamic neighborhood feature multiplexing mechanism for online video object detection in this article, which is better at learning LTSTD in flexible and robust ways, boosting existing detection results, called DNFM. Specifically, we develop dynamic memory enhancement neural networks for better long-term feature aggregation with negligible additional computation costs. We multiplex each frame feature to aggregate key enhanced representations under the guidance of dynamic memory recall. The DNFM contributes to various famous detectors in BMS and other challenging detection tasks, and particular attention has been devoted to “low-quality” frame detection. Experimental results show that, while achieving state-of-the-art detection performance, DNFM clearly illustrates the easy-to-integrate operation for boosting the video object detection results.

查看原文本刊更多论文

考虑动态邻域特征复用的视频目标检测

视频目标检测对于人机交互应用至关重要，包括手动操作传感（BMS）。视频检测在实际应用中的效果还有待提高，因为它受到长时间时空依赖性分析的限制。人类是如何感知视频中的人工操作的，尤其是那些变质的片段？我们认为，人类分析当前片段是基于早期记忆，即长期空间和时间依赖性（LTSTD）。然而，大多数现有的方法还没有报告重要的结果，因为对这些依赖关系的有限探索限制了它们。开发易于集成的模块通常是未来应用程序的首选，而不是设计复杂的端到端框架。因此，我们在本文中提出了一种动态邻域特征复用机制用于在线视频目标检测，该机制以灵活和鲁棒的方式更好地学习LTSTD，增强了现有的检测结果，称为DNFM。具体来说，我们开发了动态记忆增强神经网络，以更好的长期特征聚合，而额外的计算成本可以忽略不计。在动态记忆召回的指导下，对每个帧特征进行多路复用以聚合键增强表示。DNFM在BMS和其他具有挑战性的检测任务中贡献了各种著名的检测器，并特别关注“低质量”帧检测。实验结果表明，在达到最先进的检测性能的同时，DNFM清晰地展示了易于集成的操作，以提高视频目标的检测结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS

CiteScore

18.50

自引率

11.50%

发文量

812

审稿时长

6 months

期刊介绍： The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.