Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu
{"title":"具有异构注意机制的高效、高性能的Transformer私有推理","authors":"Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu","doi":"10.1016/j.asoc.2025.113150","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"176 ","pages":"Article 113150"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient and performant Transformer private inference with heterogeneous attention mechanisms\",\"authors\":\"Peng Hu , Lei Sun , Cuiyun Hu , Xiuqing Mao , Song Guo , Jingwen Wang , Miao Yu\",\"doi\":\"10.1016/j.asoc.2025.113150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"176 \",\"pages\":\"Article 113150\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625004612\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625004612","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Efficient and performant Transformer private inference with heterogeneous attention mechanisms
With the development of large-scale models, Transformer architectures have gained widespread adoption. However, privacy concerns become critical when model inference involves separate ownership of data and model parameters. Existing MPC-based methods for private inference suffer from significant overhead and high latency, where replacing traditional Softmax attention mechanisms with faster alternatives serves as a promising research direction. To achieve a better balance between Transformer model performance and inference speed, we explore the impact of attention mechanisms and attention heads on model performance. First, we found that the performance of the attention mechanism is closely related to the downstream task dataset, and the attention mechanism that is faster on specific datasets can actually achieve better model performance. Additionally, we discovered that for attention mechanisms that experience a performance decline, appropriately restoring the attention heads of the Softmax mechanism can significantly enhance performance. We further observed that the selection of key attention heads under different mechanisms is consistent, providing a basis for designing search strategies adapted to different scenarios. Based on these findings, we propose an MPC-friendly attention mechanism replacement method that enables Transformer private inference to be more efficient and performant. This method incorporates two strategies for selecting and replacing attention mechanisms to address diverse scenario requirements, and the resulting heterogeneous attention mechanism significantly improves the speed of private inference while maximizing model performance. With experiments on different downstream tasks, we demonstrated that our method improves average model performance by 1.94 % compared to standard pre-training models, with an inference speed increase of approximately 3 × . Compared to the state-of-the-art methods, our approach enhances model performance by 1.01–8.03 %, with faster inference speeds. Additionally, when evaluated using comprehensive metrics, our method shows improvements of 4.15 × to 8.97 × compared to other approaches.
期刊介绍:
Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities.
Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.