{"title":"有限数据下三维兴趣区域标注的多模态自我感知增强大语言模型","authors":"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen","doi":"10.1109/TMM.2025.3557703","DOIUrl":null,"url":null,"abstract":"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2935-2948"},"PeriodicalIF":8.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data\",\"authors\":\"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen\",\"doi\":\"10.1109/TMM.2025.3557703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2935-2948\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948342/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948342/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data
3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.