Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-03-13 DOI:10.1007/s11263-025-02398-3

Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan

{"title":"Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation","authors":"Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan","doi":"10.1007/s11263-025-02398-3","DOIUrl":null,"url":null,"abstract":"<p>The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.\n</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02398-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.

查看原文本刊更多论文

基于Möbius-Inspired变换的多模态生成视觉模型参数高效微调

多模态生成视觉模型的快速发展引起了科学界的好奇心。值得注意的进步，如OpenAI的ChatGPT和Stable Diffusion，展示了将多模态数据结合起来生成内容的潜力。尽管如此，由于计算成本和数据需求，将这些模型定制到特定的领域或任务是具有挑战性的。传统的微调方法需要冗余的加工资源，这推动了适配器模块、低秩分解和正交微调等参数高效微调技术的发展。这些解决方案选择性地更改模型参数的子集，在保持高质量结果的同时减少了学习需求。正交微调被认为是一种可靠的技术，它保留了权重空间中的语义联系，但在表达能力上存在局限性。为了更好地克服这些限制，我们提供了一种简单但创新且有效的变换方法，该方法受到Möbius几何的启发，取代了传统的参数高效微调中的正交变换。该策略改进了微调的适应性和表达性，使其能够捕获更多的数据模式。我们的策略得到了理论理解和实证验证的支持，优于现有的方法，展示了关键生成任务的生成质量的竞争性改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.