语义分割视觉基础模型中的领域广义令牌链接

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-09-19 DOI:10.1016/j.knosys.2025.114497

Muxin Liao , Jiayang Wang , Hong Deng , Yingqiong Peng , Hua Yin , Yinglong Wang , Guoguang Hua

{"title":"语义分割视觉基础模型中的领域广义令牌链接","authors":"Muxin Liao , Jiayang Wang , Hong Deng , Yingqiong Peng , Hua Yin , Yinglong Wang , Guoguang Hua","doi":"10.1016/j.knosys.2025.114497","DOIUrl":null,"url":null,"abstract":"<div><div>[S U M M A R Y] Vision Foundation Models (VFMs) achieve remarkable performance compared with traditional methods based on convolutional neural networks and vision transformer networks in Domain-Generalized Semantic Segmentation (DGSS). These VFM-based DGSS methods focus on adopting efficient parameter fine-tuning strategies that use a set of learnable tokens to fine-tune VFMs to the downstream DGSS task, yet struggle to mine domain-invariant information from VFMs since the backbone of VFMs is frozen during the fine-tuning stage. To address this issue, a Domain-Generalized Token Linking (DGTL) approach is proposed to mine domain-invariant information from VFMs for improving the performance in unseen target domains, which contains a Text-guided Dual Token Linking (TDTL) module and a Text-guided Distribution Normalization (TDN) strategy. For the TDTL module, first, a set of learnable tokens is linked to the text embeddings for building the relations between the learnable tokens and text embeddings, which is beneficial for learning domain-invariant tokens since the text embeddings generated from the CLIP model are domain-invariant. Second, the feature-level and mask-level linking strategies are proposed to link the learned domain-invariant tokens to the features and masks to guide the mining of domain-invariant information from the VFM. For the TDN strategy, the pairwise similarity between the predictive masks associated with the learnable tokens and the text embeddings is utilized to explicitly align the semantic distribution of visual features in the learnable tokens with the text embeddings. Extensive experiments demonstrate that the DGTL approach achieves superior performance to recent methods across multiple DGSS benchmarks. The code is released on GitHub:<span><span>https://github.com/seabearlmx/DGTL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114497"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Domain-generalized token linking in vision foundation models for semantic segmentation\",\"authors\":\"Muxin Liao , Jiayang Wang , Hong Deng , Yingqiong Peng , Hua Yin , Yinglong Wang , Guoguang Hua\",\"doi\":\"10.1016/j.knosys.2025.114497\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>[S U M M A R Y] Vision Foundation Models (VFMs) achieve remarkable performance compared with traditional methods based on convolutional neural networks and vision transformer networks in Domain-Generalized Semantic Segmentation (DGSS). These VFM-based DGSS methods focus on adopting efficient parameter fine-tuning strategies that use a set of learnable tokens to fine-tune VFMs to the downstream DGSS task, yet struggle to mine domain-invariant information from VFMs since the backbone of VFMs is frozen during the fine-tuning stage. To address this issue, a Domain-Generalized Token Linking (DGTL) approach is proposed to mine domain-invariant information from VFMs for improving the performance in unseen target domains, which contains a Text-guided Dual Token Linking (TDTL) module and a Text-guided Distribution Normalization (TDN) strategy. For the TDTL module, first, a set of learnable tokens is linked to the text embeddings for building the relations between the learnable tokens and text embeddings, which is beneficial for learning domain-invariant tokens since the text embeddings generated from the CLIP model are domain-invariant. Second, the feature-level and mask-level linking strategies are proposed to link the learned domain-invariant tokens to the features and masks to guide the mining of domain-invariant information from the VFM. For the TDN strategy, the pairwise similarity between the predictive masks associated with the learnable tokens and the text embeddings is utilized to explicitly align the semantic distribution of visual features in the learnable tokens with the text embeddings. Extensive experiments demonstrate that the DGTL approach achieves superior performance to recent methods across multiple DGSS benchmarks. The code is released on GitHub:<span><span>https://github.com/seabearlmx/DGTL</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"330 \",\"pages\":\"Article 114497\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125015369\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015369","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

[S U M M M A R Y]在领域广义语义分割（DGSS）中，与基于卷积神经网络和视觉变换网络的传统方法相比，视觉基础模型（VFMs）取得了显著的性能。这些基于vfm的DGSS方法侧重于采用有效的参数微调策略，使用一组可学习的令牌将vfm微调到下游DGSS任务，但由于vfm的骨干在微调阶段被冻结，因此难以从vfm中挖掘出域不变信息。为了解决这一问题，提出了一种域广义令牌链接（DGTL）方法，该方法包含文本引导双令牌链接（TDTL）模块和文本引导分布归一化（TDN）策略，从vfm中挖掘域不变信息，以提高不可见目标域的性能。对于TDTL模块，首先将一组可学习标记与文本嵌入相关联，建立可学习标记与文本嵌入之间的关系，这有利于学习域不变标记，因为CLIP模型生成的文本嵌入是域不变的。其次，提出了特征级和掩码级链接策略，将学习到的域不变令牌与特征和掩码链接起来，指导从VFM中挖掘域不变信息。对于TDN策略，利用与可学习标记相关的预测掩码与文本嵌入之间的两两相似度，显式地将可学习标记中的视觉特征的语义分布与文本嵌入对齐。大量的实验表明，DGTL方法在多个DGSS基准测试中取得了优于最近方法的性能。代码发布在GitHub:https://github.com/seabearlmx/DGTL。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Domain-generalized token linking in vision foundation models for semantic segmentation

[S U M M A R Y] Vision Foundation Models (VFMs) achieve remarkable performance compared with traditional methods based on convolutional neural networks and vision transformer networks in Domain-Generalized Semantic Segmentation (DGSS). These VFM-based DGSS methods focus on adopting efficient parameter fine-tuning strategies that use a set of learnable tokens to fine-tune VFMs to the downstream DGSS task, yet struggle to mine domain-invariant information from VFMs since the backbone of VFMs is frozen during the fine-tuning stage. To address this issue, a Domain-Generalized Token Linking (DGTL) approach is proposed to mine domain-invariant information from VFMs for improving the performance in unseen target domains, which contains a Text-guided Dual Token Linking (TDTL) module and a Text-guided Distribution Normalization (TDN) strategy. For the TDTL module, first, a set of learnable tokens is linked to the text embeddings for building the relations between the learnable tokens and text embeddings, which is beneficial for learning domain-invariant tokens since the text embeddings generated from the CLIP model are domain-invariant. Second, the feature-level and mask-level linking strategies are proposed to link the learned domain-invariant tokens to the features and masks to guide the mining of domain-invariant information from the VFM. For the TDN strategy, the pairwise similarity between the predictive masks associated with the learnable tokens and the text embeddings is utilized to explicitly align the semantic distribution of visual features in the learnable tokens with the text embeddings. Extensive experiments demonstrate that the DGTL approach achieves superior performance to recent methods across multiple DGSS benchmarks. The code is released on GitHub:https://github.com/seabearlmx/DGTL.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.