SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2024-06-05 DOI:10.1016/j.inffus.2024.102509

Hongtao Zheng , Yifei Ding , Zilong Wang , Xinyan Huang

{"title":"SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes","authors":"Hongtao Zheng , Yifei Ding , Zilong Wang , Xinyan Huang","doi":"10.1016/j.inffus.2024.102509","DOIUrl":null,"url":null,"abstract":"<div><p>Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process (<strong>SegLD</strong>), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD’s capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is <span>https://zht-segld.github.io/</span><svg><path></path></svg>.</p></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":null,"pages":null},"PeriodicalIF":14.7000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524002872","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process (SegLD), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD’s capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is https://zht-segld.github.io/.

查看原文本刊更多论文

SegLD：通过潜在扩散过程进行多模态融合，实现通用、零镜头和开放词汇分割

开放式词汇学习可以识别训练过程中标记的类别（已见类别），并推广到训练集中未注释的类别（未见类别）。理论上，它可以将分词系统扩展到更普遍的应用领域。然而，目前的开放式词汇分割框架主要适用于特定任务，或需要根据任务进行再训练，与完全监督框架相比，它们在推断已见类别方面的表现明显不足。因此，我们引入了一种基于潜在扩散过程（SegLD）的通用开放词汇分割框架，该框架只需要在全视角数据集上进行一次训练，就能实现对所有开放词汇分割任务的推断，并在每个任务中对已见和未见类别都达到 SOTA 分割性能。具体来说，SegLD 包括两个阶段：在第一阶段，我们部署了两个并行的潜在扩散过程，以深度融合文本（图像标题或类别标签）和图像信息，并进一步按比例聚合两个潜在扩散过程输出的多尺度特征。在第二阶段，我们引入了文本查询、文本列表查询和任务查询，通过计算它们之间的对比损失来促进类别间和任务间差异的学习。然后，将文本查询进一步输入变换器解码器，以获得与类别无关的分割掩码。然后，我们针对训练期间输入的文本类型（无论是图像标题还是类别标签）建立分类损失函数，以帮助从开放词汇中为每个预测的二进制掩码分配类别标签。实验结果表明，在 ADE20K、Cityscapes 和 COCO 数据集上的已知和未知类别中，SegLD 在几乎所有评估指标上都明显优于其他当代 SOTA 完全监督分割框架和开放词汇分割框架。这凸显了 SegLD 作为通用分割框架的能力，它具有取代其他分割框架并适应各种分割领域的潜力。SegLD 的项目链接是 https://zht-segld.github.io/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.