通过视觉语言辅助伪标记生成弱监督三维场景图

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-08-16 DOI:10.1109/TMM.2024.3443670

Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang

{"title":"通过视觉语言辅助伪标记生成弱监督三维场景图","authors":"Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang","doi":"10.1109/TMM.2024.3443670","DOIUrl":null,"url":null,"abstract":"Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11164-11175"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling\",\"authors\":\"Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang\",\"doi\":\"10.1109/TMM.2024.3443670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"11164-11175\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10638255/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10638255/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

学习构建三维场景图对于以结构化和丰富的方式感知真实世界至关重要。然而，以往的三维场景图生成方法采用的是完全监督学习方式，需要大量对象和关系的实体级注释数据，而获取这些数据极其耗费资源且繁琐。为了解决这个问题，我们提出了一种通过视觉语言辅助伪标注（Visual-Linguistic Assisted Pseudo-labeling）生成弱监督三维场景图的方法--3D-VLAP。具体地说，我们的 3D-VLAP 利用了当前大规模视觉语言学模型在文本和二维图像之间对齐语义的卓越能力，以及二维图像和三维点云之间自然存在的对应关系，从而隐式地构建了文本和三维点云之间的对应关系。首先，我们通过相机的内在和外在参数建立三维点云与二维图像的位置对应关系，从而实现三维点云与二维图像的对齐。随后，我们采用大规模跨模态视觉语言模型，通过将二维图像与对象类别标签进行匹配，间接地将三维实例与对象的文本类别标签进行对齐。然后，通过计算视觉语言模型编码的对象和关系的视觉嵌入与文本类别嵌入之间的相似度，生成对象和关系的伪标签，用于 3D-VLAP 模型训练。最后，我们设计了一种基于边缘自注意的图神经网络，用于生成三维点云的场景图。实验证明，我们的 3D-VLAP 与目前的全监督方法取得了相当的结果，同时减轻了数据注释的压力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling

Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.