CNN加速器的时空复用

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2022-07-01 DOI:10.1016/j.parco.2022.102922

Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga

{"title":"CNN加速器的时空复用","authors":"Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga","doi":"10.1016/j.parco.2022.102922","DOIUrl":null,"url":null,"abstract":"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102922"},"PeriodicalIF":2.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf","citationCount":"1","resultStr":"{\"title\":\"Spatial- and time- division multiplexing in CNN accelerator\",\"authors\":\"Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga\",\"doi\":\"10.1016/j.parco.2022.102922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"111 \",\"pages\":\"Article 102922\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819122000254\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000254","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 1

摘要

随着人工智能(AI)实时数据分析的广泛应用，加速器的集成从其低功耗和低延迟的角度受到关注。本研究的目的是在保持实时性能的同时，通过在多个用户之间共享加速器来提高加速器资源效率并进一步降低功耗。为了实现加速器共享系统，我们定义了三个要求:高设备利用率、用户间公平的设备利用率和实时性。针对人工智能推理用例，本文提出了一种通过在FPGA上切换存储在器件存储器中的卷积神经网络(CNN)模型，在满足上述三个要求的情况下，在多个用户之间共享现场可编程门阵列(FPGA)的系统。提出的系统对具有可预测和不可预测数据到达时间的工作负载使用不同的行为模型。对于数据到达时间可预测的工作负载，系统采用FPGA设备内存的空分复用，实现实时性和高设备利用率。具体来说，系统的FPGA器件存储器控制器在数据到达之前透明地将CNN模型预加载并缓存到FPGA器件存储器中。对于数据到达时间不可预测的工作负载，系统利用FPGA设备内存的时分复用技术，在数据到达时将CNN模型传输到FPGA设备内存中。在后一种情况下，由于工作负载不可预测，为了实现实时性能和高设备利用率，CNN模型之间的切换成本是不可忽略的，因此系统集成了一种考虑CNN模型切换时间的新的调度算法。对于可预测和不可预测的工作负载，通过在调度算法中使用老化技术来实现用户公平性，该算法根据作业等待时间增加作业的优先级。评估结果表明，对于可预测和不可预测的工作负载，所提出的系统的调度开销可以忽略不计，提供了实际的实时性能。对于不可预测的工作负载，与使用先到先得或轮循的传统调度算法相比，新调度算法的公平性提高24% ~ 94%，资源效率提高31% ~ 33%。对于可预测的工作负载，与先到先得相比，系统的公平性提高了50.5%，资源效率达到99.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spatial- and time- division multiplexing in CNN accelerator

With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications