通过视觉语言预训练提升嵌入式推理的零点泛化能力

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-09-18 DOI:10.1109/TIP.2024.3459800

Ke Su;Xingxing Zhang;Siyang Zhang;Jun Zhu;Bo Zhang

{"title":"通过视觉语言预训练提升嵌入式推理的零点泛化能力","authors":"Ke Su;Xingxing Zhang;Siyang Zhang;Jun Zhu;Bo Zhang","doi":"10.1109/TIP.2024.3459800","DOIUrl":null,"url":null,"abstract":"Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5370-5381"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training\",\"authors\":\"Ke Su;Xingxing Zhang;Siyang Zhang;Jun Zhu;Bo Zhang\",\"doi\":\"10.1109/TIP.2024.3459800\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"33 \",\"pages\":\"5370-5381\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10684038/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10684038/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近，人们对嵌入式人工智能（EAI）的研究兴趣日益浓厚，这涉及到代理在与周围三维环境进行动态交互时学习执行特定任务。其中，一个新的挑战是，由于三维场景中物体类别的增加，可能会出现许多未见物体。因此，有必要针对新物体开发具有强大零点泛化能力的模型。现有的工作试图通过提供与要学习的任务密切相关的大量高质量人类注释来实现这一目标，但在实践中成本太高。受二维视觉任务中预训练模型最新进展的启发，我们尝试通过视觉语言预训练来提高零点泛化的嵌入式推理能力，这种预训练可以将常识编码为一般先验知识。为了进一步提高其在特定任务上的性能，我们通过遮蔽场景图建模（MSGM）以自我监督的方式修正了预训练表示，其中特定任务的知识是从迭代信息传递中学来的。我们的方法可以大幅提高各种有代表性的具身推理任务的性能（例如，在 MP3D-EQA 数据集上，答案准确率超过 5.0%，该数据集由许多真实场景组成，测试过程中出现了大量新对象），并达到新的一流性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量