一种用于伪装目标检测的有效CNN和Transformer融合网络

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-06-21 DOI:10.1016/j.cviu.2025.104431

Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li

{"title":"一种用于伪装目标检测的有效CNN和Transformer融合网络","authors":"Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li","doi":"10.1016/j.cviu.2025.104431","DOIUrl":null,"url":null,"abstract":"<div><div>Camouflage object detection aims to identify concealed objects in images. Global context and local spatial details are crucial for this task. Convolutional neural network (CNN) excels at capturing fine-grained local features, while Transformer is adept at modeling global contextual information. To leverage their respective strengths, we propose a novel CNN-Transformer fusion network (CTF-Net) for COD to achieve more accurate detection. Our approach employs parallel CNN and Transformer branches as an encoder to extract complementary features. We then propose a cross-domain fusion module (CDFM) to fuse these features with cross-modulation. Additionally, we develop a boundary-aware module (BAM) that combines low-level edge details with high-level global context to extract camouflaged object edge features. Furthermore, we design a feature enhancement module (FEM) to mitigate background and noise interference during cross-layer feature fusion, thereby highlighting camouflaged object regions for precise predictions. Extensive experiments show that CTF-Net outperforms the existing 16 state-of-the-art methods on four widely-used COD datasets. Especially, compared with all the comparison models, CTF-Net significantly improves the performance by <span><math><mo>∼</mo></math></span>5.1% (F-measure) on the NC4K dataset, showing that CTF-Net could accurately detect camouflaged objects. Our code is publicly available at <span><span>https://github.com/zcc0616/CTF-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104431"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An effective CNN and Transformer fusion network for camouflaged object detection\",\"authors\":\"Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li\",\"doi\":\"10.1016/j.cviu.2025.104431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Camouflage object detection aims to identify concealed objects in images. Global context and local spatial details are crucial for this task. Convolutional neural network (CNN) excels at capturing fine-grained local features, while Transformer is adept at modeling global contextual information. To leverage their respective strengths, we propose a novel CNN-Transformer fusion network (CTF-Net) for COD to achieve more accurate detection. Our approach employs parallel CNN and Transformer branches as an encoder to extract complementary features. We then propose a cross-domain fusion module (CDFM) to fuse these features with cross-modulation. Additionally, we develop a boundary-aware module (BAM) that combines low-level edge details with high-level global context to extract camouflaged object edge features. Furthermore, we design a feature enhancement module (FEM) to mitigate background and noise interference during cross-layer feature fusion, thereby highlighting camouflaged object regions for precise predictions. Extensive experiments show that CTF-Net outperforms the existing 16 state-of-the-art methods on four widely-used COD datasets. Especially, compared with all the comparison models, CTF-Net significantly improves the performance by <span><math><mo>∼</mo></math></span>5.1% (F-measure) on the NC4K dataset, showing that CTF-Net could accurately detect camouflaged objects. Our code is publicly available at <span><span>https://github.com/zcc0616/CTF-Net</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"259 \",\"pages\":\"Article 104431\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001547\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001547","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

伪装目标检测的目的是识别图像中隐藏的目标。全局背景和局部空间细节对这项任务至关重要。卷积神经网络（CNN）擅长于捕获细粒度的局部特征，而Transformer擅长于建模全局上下文信息。为了利用两者各自的优势，我们提出了一种新的CNN-Transformer融合网络（CTF-Net）来实现更准确的COD检测。我们的方法采用并行CNN和Transformer分支作为编码器来提取互补特征。然后，我们提出了一个跨域融合模块（CDFM）来融合这些特征与交叉调制。此外，我们开发了一个边界感知模块（BAM），该模块将低级边缘细节与高级全局上下文相结合，以提取伪装对象的边缘特征。此外，我们设计了一个特征增强模块（FEM）来减轻跨层特征融合过程中的背景和噪声干扰，从而突出伪装的目标区域以进行精确预测。大量的实验表明，CTF-Net在四个广泛使用的COD数据集上优于现有的16种最先进的方法。特别是，与所有比较模型相比，CTF-Net在NC4K数据集上的性能显著提高了约5.1% (F-measure)，这表明CTF-Net可以准确地检测伪装目标。我们的代码可以在https://github.com/zcc0616/CTF-Net上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An effective CNN and Transformer fusion network for camouflaged object detection

Camouflage object detection aims to identify concealed objects in images. Global context and local spatial details are crucial for this task. Convolutional neural network (CNN) excels at capturing fine-grained local features, while Transformer is adept at modeling global contextual information. To leverage their respective strengths, we propose a novel CNN-Transformer fusion network (CTF-Net) for COD to achieve more accurate detection. Our approach employs parallel CNN and Transformer branches as an encoder to extract complementary features. We then propose a cross-domain fusion module (CDFM) to fuse these features with cross-modulation. Additionally, we develop a boundary-aware module (BAM) that combines low-level edge details with high-level global context to extract camouflaged object edge features. Furthermore, we design a feature enhancement module (FEM) to mitigate background and noise interference during cross-layer feature fusion, thereby highlighting camouflaged object regions for precise predictions. Extensive experiments show that CTF-Net outperforms the existing 16 state-of-the-art methods on four widely-used COD datasets. Especially, compared with all the comparison models, CTF-Net significantly improves the performance by

\sim

5.1% (F-measure) on the NC4K dataset, showing that CTF-Net could accurately detect camouflaged objects. Our code is publicly available at https://github.com/zcc0616/CTF-Net.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems