用于稳健视频对象检测的混合多注意变换器

IF 7.5 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2024-11-15 DOI:10.1016/j.engappai.2024.109606

Sathishkumar Moorthy , Sachin Sakthi K.S. , Sathiyamoorthi Arthanari , Jae Hoon Jeong , Young Hoon Joo

{"title":"用于稳健视频对象检测的混合多注意变换器","authors":"Sathishkumar Moorthy , Sachin Sakthi K.S. , Sathiyamoorthi Arthanari , Jae Hoon Jeong , Young Hoon Joo","doi":"10.1016/j.engappai.2024.109606","DOIUrl":null,"url":null,"abstract":"<div><div>Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from distant frames to mitigate appearance deterioration. However, relying solely on object candidate regions in distant frames, independent of object position, has limitations, as it depends heavily on the performance of these regions and struggles with deteriorated appearances. To overcome these challenges, we propose a novel Hybrid Multi-Attention Transformer (HyMAT) module as our main contribution. HyMAT enhances relevant correlations while suppressing flawed information by searching for an agreement between whole correlation vectors. This module is designed for flexibility and can be integrated into both self- and cross-attention blocks to significantly improve detection accuracy. Additionally, we introduce a simplified Transformer-based object detection framework, named Hybrid Multi-Attention Object Detection (HyMATOD), which leverages competent feature reprocessing and target-background embeddings to more effectively utilize temporal references. Our approach demonstrates state-of-the-art performance, as evaluated on the ImageNet video object detection benchmark (ImageNet VID) and the University at Albany DEtection and TRACking (UA-DETRAC) benchmarks. Specifically, our HyMATOD model achieves an impressive 86.7% mean Average Precision (mAP) on the ImageNet VID dataset, establishing its superiority and practicality for video object detection tasks. These results underscore the significance of our contributions to advancing the field of VOD.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"139 ","pages":"Article 109606"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hybrid multi-attention transformer for robust video object detection\",\"authors\":\"Sathishkumar Moorthy , Sachin Sakthi K.S. , Sathiyamoorthi Arthanari , Jae Hoon Jeong , Young Hoon Joo\",\"doi\":\"10.1016/j.engappai.2024.109606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from distant frames to mitigate appearance deterioration. However, relying solely on object candidate regions in distant frames, independent of object position, has limitations, as it depends heavily on the performance of these regions and struggles with deteriorated appearances. To overcome these challenges, we propose a novel Hybrid Multi-Attention Transformer (HyMAT) module as our main contribution. HyMAT enhances relevant correlations while suppressing flawed information by searching for an agreement between whole correlation vectors. This module is designed for flexibility and can be integrated into both self- and cross-attention blocks to significantly improve detection accuracy. Additionally, we introduce a simplified Transformer-based object detection framework, named Hybrid Multi-Attention Object Detection (HyMATOD), which leverages competent feature reprocessing and target-background embeddings to more effectively utilize temporal references. Our approach demonstrates state-of-the-art performance, as evaluated on the ImageNet video object detection benchmark (ImageNet VID) and the University at Albany DEtection and TRACking (UA-DETRAC) benchmarks. Specifically, our HyMATOD model achieves an impressive 86.7% mean Average Precision (mAP) on the ImageNet VID dataset, establishing its superiority and practicality for video object detection tasks. These results underscore the significance of our contributions to advancing the field of VOD.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"139 \",\"pages\":\"Article 109606\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197624017640\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197624017640","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

视频物体检测（VOD）是一项检测视频中物体的任务，由于物体的外观会随着时间的推移而发生变化，从而导致潜在的检测错误。最近的研究已经解决了这一问题，方法是聚合邻近帧的特征并结合远处帧的信息，以减轻外观劣化。但是，仅仅依靠远处帧中的候选物体区域（与物体位置无关）有其局限性，因为它严重依赖于这些区域的性能，并且在外观劣化的情况下也很难发挥作用。为了克服这些挑战，我们提出了一种新颖的混合多注意力转换器（HyMAT）模块，作为我们的主要贡献。HyMAT 通过寻找整个相关向量之间的一致性，在增强相关性的同时抑制有缺陷的信息。该模块设计灵活，可集成到自关注和交叉关注模块中，从而显著提高检测精度。此外，我们还引入了一个基于变换器的简化对象检测框架，名为混合多注意对象检测（HyMATOD），该框架利用胜任的特征再处理和目标-背景嵌入来更有效地利用时间参考。通过对 ImageNet 视频对象检测基准（ImageNet VID）和奥尔巴尼大学 DEtection and TRACking（UA-DETRAC）基准的评估，我们的方法展示了最先进的性能。具体来说，我们的 HyMATOD 模型在 ImageNet VID 数据集上达到了令人印象深刻的 86.7% 平均精度 (mAP)，从而确立了其在视频对象检测任务中的优越性和实用性。这些结果凸显了我们在推进 VOD 领域所做贡献的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hybrid multi-attention transformer for robust video object detection

Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from distant frames to mitigate appearance deterioration. However, relying solely on object candidate regions in distant frames, independent of object position, has limitations, as it depends heavily on the performance of these regions and struggles with deteriorated appearances. To overcome these challenges, we propose a novel Hybrid Multi-Attention Transformer (HyMAT) module as our main contribution. HyMAT enhances relevant correlations while suppressing flawed information by searching for an agreement between whole correlation vectors. This module is designed for flexibility and can be integrated into both self- and cross-attention blocks to significantly improve detection accuracy. Additionally, we introduce a simplified Transformer-based object detection framework, named Hybrid Multi-Attention Object Detection (HyMATOD), which leverages competent feature reprocessing and target-background embeddings to more effectively utilize temporal references. Our approach demonstrates state-of-the-art performance, as evaluated on the ImageNet video object detection benchmark (ImageNet VID) and the University at Albany DEtection and TRACking (UA-DETRAC) benchmarks. Specifically, our HyMATOD model achieves an impressive 86.7% mean Average Precision (mAP) on the ImageNet VID dataset, establishing its superiority and practicality for video object detection tasks. These results underscore the significance of our contributions to advancing the field of VOD.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.