使用Mamba从变压器中提取图形，用于全景场景图形生成

IF 2.3 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-04-09 DOI:10.1016/j.array.2025.100394

Youxuan Sun, Yunliang Chen , Xiaohui Huang , Yuewei Wang, Shaoqian Chen, Kangfei Yao, Ao Yang

{"title":"使用Mamba从变压器中提取图形，用于全景场景图形生成","authors":"Youxuan Sun, Yunliang Chen , Xiaohui Huang , Yuewei Wang, Shaoqian Chen, Kangfei Yao, Ao Yang","doi":"10.1016/j.array.2025.100394","DOIUrl":null,"url":null,"abstract":"<div><div>Scene Graph Generation (SGG) transforms images into structured graph representations that encapsulate the objects, attributes, and relationships present within objects. Graph models boost visual content understanding and reasoning for image captioning, question answering, and HCI. Panoptic Scene Graph Generation (PSG) enhances the object detection task within scene graph generation by incorporating panoptic segmentation, thereby imposing greater demands on the model’s capacity to comprehend images. Existing approaches often rely on intricate modeling techniques to predict relationships between objects, while neglecting the inherent connections among object queries that are learned through multi-head self-attention in object detectors. This oversight not only leads to a significant increase in parameter count but also complicates model design and hinders transferability. This paper proposes a new single-stage panoptic scene graph generator called DGTM (Deriving Graph from transformer with Mamba). DGTM utilizes the by-products of multi-head self-attention layers in transformers, treats queries and keys as subjects and objects respectively to extract relationship information between objects. By introducing the Mamba module, multi-level and multi-scale feature information is integrated efficiently, empowering the model to better grasp intricate relationships. In addition, a Kolmogorov–Arnold Network (KAN) is incorporated to help the model better distinguish between subjects and objects, enriching feature representation. Experimental results show that DGTM achieves at least 25%, 15%, and 15% improvements in mR@20, mR@50, and mR@100 compared to the baseline, demonstrating notable enhancements in the precision and comprehensiveness of PSG.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"26 ","pages":"Article 100394"},"PeriodicalIF":2.3000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DGTM: Deriving Graph from transformer with Mamba for panoptic scene graph generation\",\"authors\":\"Youxuan Sun, Yunliang Chen , Xiaohui Huang , Yuewei Wang, Shaoqian Chen, Kangfei Yao, Ao Yang\",\"doi\":\"10.1016/j.array.2025.100394\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Scene Graph Generation (SGG) transforms images into structured graph representations that encapsulate the objects, attributes, and relationships present within objects. Graph models boost visual content understanding and reasoning for image captioning, question answering, and HCI. Panoptic Scene Graph Generation (PSG) enhances the object detection task within scene graph generation by incorporating panoptic segmentation, thereby imposing greater demands on the model’s capacity to comprehend images. Existing approaches often rely on intricate modeling techniques to predict relationships between objects, while neglecting the inherent connections among object queries that are learned through multi-head self-attention in object detectors. This oversight not only leads to a significant increase in parameter count but also complicates model design and hinders transferability. This paper proposes a new single-stage panoptic scene graph generator called DGTM (Deriving Graph from transformer with Mamba). DGTM utilizes the by-products of multi-head self-attention layers in transformers, treats queries and keys as subjects and objects respectively to extract relationship information between objects. By introducing the Mamba module, multi-level and multi-scale feature information is integrated efficiently, empowering the model to better grasp intricate relationships. In addition, a Kolmogorov–Arnold Network (KAN) is incorporated to help the model better distinguish between subjects and objects, enriching feature representation. Experimental results show that DGTM achieves at least 25%, 15%, and 15% improvements in mR@20, mR@50, and mR@100 compared to the baseline, demonstrating notable enhancements in the precision and comprehensiveness of PSG.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"26 \",\"pages\":\"Article 100394\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625000219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625000219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

场景图形生成（Scene Graph Generation， SGG）将图像转换成结构化的图形表示形式，封装对象、属性和对象中存在的关系。图模型提高了对图像字幕、问题回答和HCI的视觉内容理解和推理能力。Panoptic Scene Graph Generation （PSG）通过加入Panoptic segmentation来增强场景图生成中的目标检测任务，从而对模型理解图像的能力提出了更高的要求。现有的方法往往依赖于复杂的建模技术来预测对象之间的关系，而忽略了通过对象检测器中的多头自注意学习的对象查询之间的内在联系。这种疏忽不仅会导致参数数量的显著增加，而且还会使模型设计复杂化并阻碍可转移性。本文提出了一种新的单级全景场景图生成器DGTM （derived graph from transformer with Mamba）。DGTM利用变形器中多头自关注层的副产品，将查询和键分别视为主体和客体，提取客体之间的关系信息。通过引入Mamba模块，有效地整合了多层次、多尺度的特征信息，使模型能够更好地掌握复杂的关系。此外，还引入了Kolmogorov-Arnold网络（KAN）来帮助模型更好地区分主体和客体，丰富了特征表示。实验结果表明，与基线相比，DGTM在mR@20、mR@50和mR@100上分别提高了25%、15%和15%，PSG的准确性和全面性得到了显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DGTM: Deriving Graph from transformer with Mamba for panoptic scene graph generation

Scene Graph Generation (SGG) transforms images into structured graph representations that encapsulate the objects, attributes, and relationships present within objects. Graph models boost visual content understanding and reasoning for image captioning, question answering, and HCI. Panoptic Scene Graph Generation (PSG) enhances the object detection task within scene graph generation by incorporating panoptic segmentation, thereby imposing greater demands on the model’s capacity to comprehend images. Existing approaches often rely on intricate modeling techniques to predict relationships between objects, while neglecting the inherent connections among object queries that are learned through multi-head self-attention in object detectors. This oversight not only leads to a significant increase in parameter count but also complicates model design and hinders transferability. This paper proposes a new single-stage panoptic scene graph generator called DGTM (Deriving Graph from transformer with Mamba). DGTM utilizes the by-products of multi-head self-attention layers in transformers, treats queries and keys as subjects and objects respectively to extract relationship information between objects. By introducing the Mamba module, multi-level and multi-scale feature information is integrated efficiently, empowering the model to better grasp intricate relationships. In addition, a Kolmogorov–Arnold Network (KAN) is incorporated to help the model better distinguish between subjects and objects, enriching feature representation. Experimental results show that DGTM achieves at least 25%, 15%, and 15% improvements in mR@20, mR@50, and mR@100 compared to the baseline, demonstrating notable enhancements in the precision and comprehensiveness of PSG.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Array Computer Science-General Computer Science

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

45 days