具有异构注意机制的高效、高性能的Transformer私有推理

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-04-15 DOI:10.1016/j.asoc.2025.113150

Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu

{"title":"具有异构注意机制的高效、高性能的Transformer私有推理","authors":"Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu","doi":"10.1016/j.asoc.2025.113150","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"176 ","pages":"Article 113150"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient and performant Transformer private inference with heterogeneous attention mechanisms\",\"authors\":\"Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu\",\"doi\":\"10.1016/j.asoc.2025.113150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"176 \",\"pages\":\"Article 113150\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625004612\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625004612","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着大规模模型的发展，Transformer架构得到了广泛的采用。然而，当模型推断涉及数据和模型参数的单独所有权时，隐私问题就变得至关重要了。现有的基于mpc的私有推理方法存在较大的开销和高延迟，用更快的替代方案取代传统的Softmax注意机制是一个有前途的研究方向。为了更好地平衡Transformer模型性能和推理速度，我们探讨了注意机制和注意头对模型性能的影响。首先，我们发现注意机制的性能与下游任务数据集密切相关，在特定数据集上更快的注意机制实际上可以获得更好的模型性能。此外，我们发现，对于经历性能下降的注意机制，适当地恢复Softmax机制的注意头可以显着提高性能。我们进一步观察到，不同机制下关键注意头的选择是一致的，这为设计适应不同场景的搜索策略提供了基础。基于这些发现，我们提出了一种mpc友好的注意机制替代方法，使Transformer私有推理更加高效和性能。该方法结合了两种选择和替换注意机制的策略来解决不同的场景需求，由此产生的异构注意机制在最大限度地提高模型性能的同时显著提高了私有推理的速度。通过对不同下游任务的实验，我们证明了我们的方法与标准预训练模型相比，平均模型性能提高了1.94 %，推理速度提高了约3 × 。与最先进的方法相比，我们的方法将模型性能提高了1.01-8.03 %，并且推理速度更快。此外，当使用综合指标进行评估时，与其他方法相比，我们的方法显示了4.15 × 到8.97 × 的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient and performant Transformer private inference with heterogeneous attention mechanisms

With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.