Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu
{"title":"VisionHub:高效通用视觉模型的学习任务插件。","authors":"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu","doi":"10.1109/tip.2025.3611645","DOIUrl":null,"url":null,"abstract":"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"91 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.\",\"authors\":\"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu\",\"doi\":\"10.1109/tip.2025.3611645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.\",\"PeriodicalId\":13217,\"journal\":{\"name\":\"IEEE Transactions on Image Processing\",\"volume\":\"91 1\",\"pages\":\"\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Image Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tip.2025.3611645\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3611645","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.
Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.
期刊介绍:
The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.