VisionHub：高效通用视觉模型的学习任务插件。

IF 13.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing Pub Date : 2025-09-25 DOI:10.1109/tip.2025.3611645

Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu

{"title":"VisionHub：高效通用视觉模型的学习任务插件。","authors":"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu","doi":"10.1109/tip.2025.3611645","DOIUrl":null,"url":null,"abstract":"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"91 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.\",\"authors\":\"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu\",\"doi\":\"10.1109/tip.2025.3611645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.\",\"PeriodicalId\":13217,\"journal\":{\"name\":\"IEEE Transactions on Image Processing\",\"volume\":\"91 1\",\"pages\":\"\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Image Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tip.2025.3611645\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3611645","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于自然语言处理（NLP）中通用语言模型的成功，研究人员最近寻求开发能够在统一基础框架内处理广泛视觉任务的方法。然而，现有的通用视觉模型在适应快速扩展的下游任务范围时面临着重大挑战。这些挑战不仅源于与训练这些模型相关的令人望而却步的计算和存储费用，还源于它们的工作流程的复杂性，这使得有效的适应变得困难。此外，这些模型通常无法为广泛的应用程序提供所需的性能和多功能性，主要是由于它们不完整的视觉生成和感知能力，限制了它们在不同设置中的通用性和有效性。在本文中，我们提出了VisionHub，一个新的通用视觉模型，旨在同时管理多个视觉恢复和感知任务，同时为下游任务提供流线型的可转移性。我们的模型利用Stable Diffusion的冷冻去噪U-Net架构作为主干，充分利用其在视觉恢复和感知方面的固有潜力。为了进一步增强模型的灵活性，我们建议合并轻量级任务插件和任务路由器，它们无缝集成到U-Net骨干网上。这种架构使VisionHub能够根据用户提供的自然语言指令有效地处理各种视觉任务，同时保持最小的存储成本和操作开销。在11种不同的视觉任务中进行的大量实验显示了我们的方法的效率和有效性。值得注意的是，VisionHub在各种基准测试中都取得了具有竞争力的性能，包括ADE20K语义分割上的53.3% mIoU， NYUv2深度估计上的0.253 RMSE， MS-COCO姿态估计上的74.2 AP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.

Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Image Processing 工程技术-工程：电子与电气

CiteScore

20.90

自引率

6.60%

发文量

774

审稿时长

7.6 months

期刊介绍： The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.