Bridging the gap between object detection in close-up and high-resolution wide shots

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-09-23 DOI:10.1016/j.cviu.2024.104181

Wenxi Li , Yuchen Guo , Jilai Zheng , Haozhe Lin , Chao Ma , Lu Fang , Xiaokang Yang

{"title":"Bridging the gap between object detection in close-up and high-resolution wide shots","authors":"Wenxi Li , Yuchen Guo , Jilai Zheng , Haozhe Lin , Chao Ma , Lu Fang , Xiaokang Yang","doi":"10.1016/j.cviu.2024.104181","DOIUrl":null,"url":null,"abstract":"<div><div>Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform accurate and efficient object detection with such large input in low-power edge devices like UAVs. There are several unique challenges in HRW shots. (1) Sparse information: the objects of interest cover less area. (2) Various scale: there is 10 to 100<span><math><mo>×</mo></math></span> object scale change in one single image. (3) Incomplete objects: the sliding window strategy to handle the large input leads to truncated objects at the window edge. (4) Multi-scale information: it is unclear how to use multi-scale information in training and inference. Consequently, directly using a close-up detector leads to inaccuracy and inefficiency. In this paper, we systematically investigate this problem and bridge the gap between object detection in close-up and HRW shots, by introducing a novel sparse architecture that can be integrated with common networks like ConvNet and Transformer. It leverages alternative sparse learning to complementarily fuse coarse-grained and fine-grained features to (1) adaptively extract valuable information from (2) different object scales. We also propose a novel Cross-window Non-Maximum Suppression (C-NMS) algorithm to (3) improve the box merge from different windows. Furthermore, we propose a (4) simple yet effective multi-scale training and inference strategy to improve accuracy. Experiments on two benchmarks with HRW shots, PANDA and DOTA-v1.0, demonstrate that our methods significantly improve accuracy (up to 5.8%) and speed (up to 3<span><math><mo>×</mo></math></span>) over SotAs, for both ConvNet or Transformer based detectors, on edge devices. Our code is open-sourced and available at <span><span>https://github.com/liwenxi/SparseFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002625","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform accurate and efficient object detection with such large input in low-power edge devices like UAVs. There are several unique challenges in HRW shots. (1) Sparse information: the objects of interest cover less area. (2) Various scale: there is 10 to 100

\times

object scale change in one single image. (3) Incomplete objects: the sliding window strategy to handle the large input leads to truncated objects at the window edge. (4) Multi-scale information: it is unclear how to use multi-scale information in training and inference. Consequently, directly using a close-up detector leads to inaccuracy and inefficiency. In this paper, we systematically investigate this problem and bridge the gap between object detection in close-up and HRW shots, by introducing a novel sparse architecture that can be integrated with common networks like ConvNet and Transformer. It leverages alternative sparse learning to complementarily fuse coarse-grained and fine-grained features to (1) adaptively extract valuable information from (2) different object scales. We also propose a novel Cross-window Non-Maximum Suppression (C-NMS) algorithm to (3) improve the box merge from different windows. Furthermore, we propose a (4) simple yet effective multi-scale training and inference strategy to improve accuracy. Experiments on two benchmarks with HRW shots, PANDA and DOTA-v1.0, demonstrate that our methods significantly improve accuracy (up to 5.8%) and speed (up to 3

\times

) over SotAs, for both ConvNet or Transformer based detectors, on edge devices. Our code is open-sourced and available at https://github.com/liwenxi/SparseFormer.

查看原文本刊更多论文

缩小特写镜头和高分辨率广角镜头之间的差距

近年来，千兆像素级图像/视频捕捉系统和高分辨率广角（HRW）拍摄基准大幅增加。与 MS COCO 等特写镜头不同的是，更高的分辨率和更宽的视场提出了新的研究和应用问题，例如如何在无人机等低功耗边缘设备中利用如此大的输入量进行准确高效的物体检测。高红外图像有几个独特的挑战。(1) 信息稀疏：感兴趣的物体覆盖面积较小。(2) 尺度变化大：单张图像中的物体尺度变化在 10 到 100 倍之间。(3) 对象不完整：处理大输入的滑动窗口策略会导致窗口边缘的对象被截断。(4) 多尺度信息：目前还不清楚如何在训练和推理中使用多尺度信息。因此，直接使用特写检测器会导致不准确和低效率。在本文中，我们系统地研究了这一问题，并通过引入一种可与 ConvNet 和 Transformer 等常见网络集成的新型稀疏架构，弥补了特写镜头和 HRW 镜头中物体检测之间的差距。它利用替代性稀疏学习来互补融合粗粒度和细粒度特征，从而(1) 自适应地从(2) 不同物体尺度中提取有价值的信息。我们还提出了一种新颖的跨窗口非最大值抑制（C-NMS）算法，以(3) 改进来自不同窗口的框合并。此外，我们还提出了一种 (4) 简单而有效的多尺度训练和推理策略，以提高准确性。在 PANDA 和 DOTA-v1.0 这两个具有 HRW 镜头的基准上进行的实验表明，对于基于 ConvNet 或 Transformer 的检测器，我们的方法在边缘设备上比 SotAs 显著提高了准确率（高达 5.8%）和速度（高达 3 倍）。我们的代码开源于 https://github.com/liwenxi/SparseFormer。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems