有限数据下三维兴趣区域标注的多模态自我感知增强大语言模型

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557703

Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen

{"title":"有限数据下三维兴趣区域标注的多模态自我感知增强大语言模型","authors":"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen","doi":"10.1109/TMM.2025.3557703","DOIUrl":null,"url":null,"abstract":"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2935-2948"},"PeriodicalIF":8.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data\",\"authors\":\"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen\",\"doi\":\"10.1109/TMM.2025.3557703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2935-2948\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948342/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948342/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

3D感兴趣区域（RoI）字幕包括将模型对复杂3D场景中特定对象的理解转换为描述性字幕。大型语言模型（llm）的最新进展在这一领域显示了巨大的潜力。现有的方法从roi中捕获可视化信息，作为llm的输入令牌。然而，这种方法可能无法为法学硕士提供足够详细的信息，以生成准确的区域特定标题。本文介绍了一种具有多模态自我感知能力的大型语言模型Self-RoI，用于三维RoI字幕。为了确保llm获得更准确和充分的信息，Self-RoI结合了隐含文本信息。感知构建多模态视觉语言信息。该模块利用简单的映射网络，从llm的视觉跟随响应中生成RoI基本属性的文本信息。然后将该文本信息与RoI的可视化表示相集成，形成llm的综合多模态指令。鉴于三维roi字幕数据的可用性有限，我们提出了一种两阶段的训练策略来有效地优化Self-RoI。在第一阶段，我们对齐3D RoI视觉和标题表示。在第二阶段，我们将重点放在3D RoI视觉-标题交互上，使用完全不同的对比嵌入模块来提高隐式文本信息的可靠性，并使用语言建模损失来确保准确生成标题。我们的实验表明，Self-RoI显著优于以前的3D RoI字幕模型。此外，隐含文本信息。感知可以集成到其他多模态llm中以增强性能。我们将为进一步的研究提供我们的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data

3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.