{"title":"An empirical analysis of feature fusion task heads of ViT pre-trained models on OOD classification tasks","authors":"Mingxing Zhang, Jun Ai, Tao Shi","doi":"10.1016/j.jss.2025.112358","DOIUrl":null,"url":null,"abstract":"<div><div>ViT pre-training model has been widely used in various downstream tasks, and the structure of task head has a significant impact on downstream tasks. While it is a common practice to empirically concatenate the last few layers’ cls token of the ViT model for classification, there exists limited research on whether the feature fusion structure holds significance for the model. This paper primarily discusses the impact of attention-mechanism-based fusion structure on the backbone network and classification performance. Initially, we examine the relationship between dataset and feature fusion task head, followed by an exploration of how different locations of fusion middle layer affect model performance as well as how feature fusion task head influences the backbone network itself. Finally, we characterize the task head through the loss of models based on feature fusion structure. Based on empirical findings, we identify 5 important insights and provide recommendations for the model structures during downstream task fine-tuning.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"223 ","pages":"Article 112358"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225000263","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
ViT pre-training model has been widely used in various downstream tasks, and the structure of task head has a significant impact on downstream tasks. While it is a common practice to empirically concatenate the last few layers’ cls token of the ViT model for classification, there exists limited research on whether the feature fusion structure holds significance for the model. This paper primarily discusses the impact of attention-mechanism-based fusion structure on the backbone network and classification performance. Initially, we examine the relationship between dataset and feature fusion task head, followed by an exploration of how different locations of fusion middle layer affect model performance as well as how feature fusion task head influences the backbone network itself. Finally, we characterize the task head through the loss of models based on feature fusion structure. Based on empirical findings, we identify 5 important insights and provide recommendations for the model structures during downstream task fine-tuning.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.