用于小图像对象检测的变换器-CNN

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2024-08-21 DOI:10.1016/j.image.2024.117194

Yan-Lin Chen , Chun-Liang Lin , Yu-Chen Lin , Tzu-Chun Chen

{"title":"用于小图像对象检测的变换器-CNN","authors":"Yan-Lin Chen , Chun-Liang Lin , Yu-Chen Lin , Tzu-Chun Chen","doi":"10.1016/j.image.2024.117194","DOIUrl":null,"url":null,"abstract":"<div><p>Object recognition in computer vision technology has been a popular research field in recent years. Although the detection success rate of regular objects has achieved impressive results, small object detection (SOD) is still a challenging issue. In the Microsoft Common Objects in Context (MS COCO) public dataset, the detection rate of small objects is typically half that of regular-sized objects. The main reason is that small objects are often affected by multi-layer convolution and pooling, leading to insufficient details to distinguish them from the background or similar objects, resulting in poor recognition rates or even no results. This paper presents a network architecture, Transformer-CNN, that combines a self-attention mechanism-based transformer and a convolutional neural network (CNN) to improve the recognition rate of SOD. It captures global information through a transformer and uses the translation invariance and translation equivalence of CNN to maximize the retention of global and local features while improving the reliability and robustness of SOD. Our experiments show that the proposed model improves the small object recognition rate by 2∼5 % than the general transformer architectures.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117194"},"PeriodicalIF":3.4000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-CNN for small image object detection\",\"authors\":\"Yan-Lin Chen , Chun-Liang Lin , Yu-Chen Lin , Tzu-Chun Chen\",\"doi\":\"10.1016/j.image.2024.117194\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Object recognition in computer vision technology has been a popular research field in recent years. Although the detection success rate of regular objects has achieved impressive results, small object detection (SOD) is still a challenging issue. In the Microsoft Common Objects in Context (MS COCO) public dataset, the detection rate of small objects is typically half that of regular-sized objects. The main reason is that small objects are often affected by multi-layer convolution and pooling, leading to insufficient details to distinguish them from the background or similar objects, resulting in poor recognition rates or even no results. This paper presents a network architecture, Transformer-CNN, that combines a self-attention mechanism-based transformer and a convolutional neural network (CNN) to improve the recognition rate of SOD. It captures global information through a transformer and uses the translation invariance and translation equivalence of CNN to maximize the retention of global and local features while improving the reliability and robustness of SOD. Our experiments show that the proposed model improves the small object recognition rate by 2∼5 % than the general transformer architectures.</p></div>\",\"PeriodicalId\":49521,\"journal\":{\"name\":\"Signal Processing-Image Communication\",\"volume\":\"129 \",\"pages\":\"Article 117194\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing-Image Communication\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S092359652400095X\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092359652400095X","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

近年来，计算机视觉技术中的物体识别是一个热门研究领域。虽然常规物体的检测成功率已经取得了令人瞩目的成果，但小物体检测（SOD）仍然是一个具有挑战性的问题。在 Microsoft Common Objects in Context（MS COCO）公共数据集中，小物体的检测率通常只有常规尺寸物体的一半。主要原因是小物体通常会受到多层卷积和池化的影响，导致细节不足，无法将其与背景或类似物体区分开来，从而导致识别率低下，甚至没有结果。本文提出了一种网络架构--变压器-CNN，它结合了基于自注意机制的变压器和卷积神经网络（CNN），以提高 SOD 的识别率。它通过变压器捕捉全局信息，并利用 CNN 的翻译不变性和翻译等价性最大限度地保留全局和局部特征，同时提高 SOD 的可靠性和鲁棒性。我们的实验表明，与一般的变换器架构相比，所提出的模型可将小物体识别率提高 2∼5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transformer-CNN for small image object detection

Object recognition in computer vision technology has been a popular research field in recent years. Although the detection success rate of regular objects has achieved impressive results, small object detection (SOD) is still a challenging issue. In the Microsoft Common Objects in Context (MS COCO) public dataset, the detection rate of small objects is typically half that of regular-sized objects. The main reason is that small objects are often affected by multi-layer convolution and pooling, leading to insufficient details to distinguish them from the background or similar objects, resulting in poor recognition rates or even no results. This paper presents a network architecture, Transformer-CNN, that combines a self-attention mechanism-based transformer and a convolutional neural network (CNN) to improve the recognition rate of SOD. It captures global information through a transformer and uses the translation invariance and translation equivalence of CNN to maximize the retention of global and local features while improving the reliability and robustness of SOD. Our experiments show that the proposed model improves the small object recognition rate by 2∼5 % than the general transformer architectures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.