{"title":"KFF:用于视觉转换器的k特征融合令牌合并","authors":"Yu Yang , Yue Zhou , Xiaofang Hu , Shukai Duan","doi":"10.1016/j.eswa.2025.128206","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"288 ","pages":"Article 128206"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"KFF: K-feature fusion token merging for vision transformer\",\"authors\":\"Yu Yang , Yue Zhou , Xiaofang Hu , Shukai Duan\",\"doi\":\"10.1016/j.eswa.2025.128206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"288 \",\"pages\":\"Article 128206\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425018263\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425018263","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
KFF: K-feature fusion token merging for vision transformer
Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.