KFF：用于视觉转换器的k特征融合令牌合并

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-18 DOI:10.1016/j.eswa.2025.128206

Yu Yang , Yue Zhou , Xiaofang Hu , Shukai Duan

{"title":"KFF：用于视觉转换器的k特征融合令牌合并","authors":"Yu Yang , Yue Zhou , Xiaofang Hu , Shukai Duan","doi":"10.1016/j.eswa.2025.128206","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"288 ","pages":"Article 128206"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"KFF: K-feature fusion token merging for vision transformer\",\"authors\":\"Yu Yang , Yue Zhou , Xiaofang Hu , Shukai Duan\",\"doi\":\"10.1016/j.eswa.2025.128206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"288 \",\"pages\":\"Article 128206\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425018263\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425018263","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，视觉变压器（Vision Transformer, ViT）在各种视觉应用中取得了比卷积神经网络（Convolutional Neural Networks, cnn）更好的性能。然而，它们通常比cnn计算成本更高，并且在训练和推理效率方面面临挑战。令牌合并是一种有效且无需训练的降低模型复杂性的方法。然而，由于很少有令牌是完全相同的，流行的基于相似性的合并方法在避免特征信息丢失和准确性降低方面具有挑战性。为了解决这个问题，我们提出了一种新的k特征融合令牌合并算法，该算法在几乎没有精度损失的情况下显著降低了相似性度量误差和令牌合并误差。具体而言，我们首先揭示了相似度度量误差和合并策略对令牌合并算法的性能有显著影响，目前流行的基于相似度的方法在合并过程中会引起明显的特征偏移。在此基础上，提出了一种新的特征增强的k -特征融合令牌相似度计算方法。通过将每个令牌所包含信息的汇总键(K)与更详细的中间特征相结合，大大降低了相似性度量的误差。然后，我们设计了一种相似度加权平均令牌合并算法，该算法比普通平均令牌合并更快、更准确。大量的实验表明，我们的方法在没有额外训练的情况下减少了可比的计算工作量并提高了吞吐量，从而产生了更好的模型性能。例如，对于ImageNet上的vitb，我们的方法减少了49.58%的令牌，并将吞吐量提高了30%，而准确率仅下降了0.44%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

KFF: K-feature fusion token merging for vision transformer

Recently, Vision Transformer (ViT) has achieved better performance than Convolutional Neural Networks (CNNs) in various vision applications. However, they are usually more computationally expensive than CNNs and face challenges in training and inference efficiency. Token merging is an effective and training-free way to reduce model complexity. However, since few tokens are exactly the same, prevalent similarity-based merging methods are challenging to avoid feature information loss and accuracy degradation. To address this issue, we propose a novel K-feature fusion token merging algorithm that significantly reduces the similarity metric error and token merging error with almost no accuracy loss. Specifically, we first reveal that similarity measurement errors and merging strategies have a significant impact on the performance of token merging algorithms, and the currently popular K-based similarity method will cause obvious feature shifts during the merging process. Based on this observation, we present a new feature-enhanced K-feature fusion token similarity calculation method. By combining the keys (K), which summarize the information contained in each token, and the more detailed intermediate features, the error of similarity measurement is greatly reduced. Then, we design a similarity-weighted average token merging algorithm to combine tokens that is faster and more accurate than ordinary average token merging. Extensive experiments show that our approach yields better model performance when reducing comparable computational effort and improving throughput without extra training. For example, for ViT-B on ImageNet, our method reduces 49.58 % of tokens and improves throughput by 30 % with only a 0.44 % drop in accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.