{"title":"CoDA:多功能高效注意力加速器的协同设计框架","authors":"Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He","doi":"10.1109/TC.2024.3398488","DOIUrl":null,"url":null,"abstract":"As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 8","pages":"1924-1938"},"PeriodicalIF":3.6000,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators\",\"authors\":\"Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He\",\"doi\":\"10.1109/TC.2024.3398488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"73 8\",\"pages\":\"1924-1938\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10527401/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10527401/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators
As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.