RelFormer: Advancing contextual relations for transformer-based dense captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI:10.1016/j.cviu.2025.104300

Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei

{"title":"RelFormer: Advancing contextual relations for transformer-based dense captioning","authors":"Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei","doi":"10.1016/j.cviu.2025.104300","DOIUrl":null,"url":null,"abstract":"<div><div>Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, <em>e.g</em>., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, <em>e.g</em>., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at <span><span>https://github.com/Wykay/Relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104300"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000232","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, e.g., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, e.g., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at https://github.com/Wykay/Relformer.

查看原文本刊更多论文

RelFormer：为基于变压器的密集字幕推进上下文关系

密集字幕旨在检测图像中的区域，并为每个识别的区域生成自然语言描述。对于这项任务，上下文建模对于生成准确的描述至关重要，因为图像中的区域可以相互交互。以前的工作主要集中在分类对象区域之间的建模，这些区域是由预训练的对象检测器（例如Fast R-CNN）提取的。然而，他们忽略了非对象区域的上下文建模，例如，天空、河流和草地，通常被称为“东西”。在本文中，我们提出了RelFormer框架来增强基于transformer的密集字幕的上下文关系建模。具体来说，我们设计了一个clip-assisted region feature extraction模块来提取区域丰富的上下文特征，包括素材区域。然后，我们引入了一种基于自关注的直接关系编码器来有效地建模区域特征之间的关系。为了在最小化冗余建议的同时准确提取密集图像中的候选区域，我们进一步引入了放大衰减非最大抑制，放大冗余建议的衰减程度，从而在保留低置信度阈值下对小区域的检测的同时可以去除冗余建议。实验结果表明，通过增强上下文交互，我们的模型表现出对区域的良好理解，并在密集字幕任务上取得了最先进的性能。我们的方法在VG V1.0上实现了17.52%的mAP，在VG V1.2上实现了16.59%，在VG- coco上实现了15.49%。代码可从https://github.com/Wykay/Relformer获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems