Kashif Inayat , Fahad Bin Muslim , Tayyeb Mahmood , Jaeyong Chung
{"title":"参数化人工智能加速器的 FPGA 辅助设计空间探索:快速循环方法","authors":"Kashif Inayat , Fahad Bin Muslim , Tayyeb Mahmood , Jaeyong Chung","doi":"10.1016/j.sysarc.2024.103260","DOIUrl":null,"url":null,"abstract":"<div><p>FPGAs facilitate prototyping and debug, and recently accelerate full-stack simulations due to their rapid turnaround time (TAT). However, this TAT is restrictive in exhaustive design space explorations of parameterized RTL generators, especially DNN accelerators that unleash an explosive full-stack search space. This paper presents Quickloop, an efficient and scalable framework to enable FPGA-accelerated exploration. Quickloop first abstracts away the cumbersome flow of RTL generation, software stack, FPGA toolflow, workload execution and metrics extraction by wrapping these stages into isolated Quicksteps, featuring cascadability, scalability, and replay. Then, we analytically minimize the FPGA toolflow TAT via a novel, data-driven strategy that intelligently utilizes build fragments from previous iterations, enhancing the loop efficiency and simultaneously lowering the toolflow’s compute utilization.</p><p>Quickloop is built around the OpenAI Gym environment framework and thus supports drop-in regression and reinforcement learning explorations. With a Quickloop around a reference Berkeley’s Gemmini DNN accelerator, we exhaustively explore its parameter space and discover complex performance patterns, based on full-stack simulation of Imagenet benchmarks as a workload. Compared to conventional FPGA toolflow, we further show that Quickloop effectively reduces episodal time by above 30%, as the episode approaches realistic lengths.</p></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"155 ","pages":"Article 103260"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FPGA-assisted Design Space Exploration of Parameterized AI Accelerators: A Quickloop Approach\",\"authors\":\"Kashif Inayat , Fahad Bin Muslim , Tayyeb Mahmood , Jaeyong Chung\",\"doi\":\"10.1016/j.sysarc.2024.103260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>FPGAs facilitate prototyping and debug, and recently accelerate full-stack simulations due to their rapid turnaround time (TAT). However, this TAT is restrictive in exhaustive design space explorations of parameterized RTL generators, especially DNN accelerators that unleash an explosive full-stack search space. This paper presents Quickloop, an efficient and scalable framework to enable FPGA-accelerated exploration. Quickloop first abstracts away the cumbersome flow of RTL generation, software stack, FPGA toolflow, workload execution and metrics extraction by wrapping these stages into isolated Quicksteps, featuring cascadability, scalability, and replay. Then, we analytically minimize the FPGA toolflow TAT via a novel, data-driven strategy that intelligently utilizes build fragments from previous iterations, enhancing the loop efficiency and simultaneously lowering the toolflow’s compute utilization.</p><p>Quickloop is built around the OpenAI Gym environment framework and thus supports drop-in regression and reinforcement learning explorations. With a Quickloop around a reference Berkeley’s Gemmini DNN accelerator, we exhaustively explore its parameter space and discover complex performance patterns, based on full-stack simulation of Imagenet benchmarks as a workload. Compared to conventional FPGA toolflow, we further show that Quickloop effectively reduces episodal time by above 30%, as the episode approaches realistic lengths.</p></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"155 \",\"pages\":\"Article 103260\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762124001978\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762124001978","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
FPGA-assisted Design Space Exploration of Parameterized AI Accelerators: A Quickloop Approach
FPGAs facilitate prototyping and debug, and recently accelerate full-stack simulations due to their rapid turnaround time (TAT). However, this TAT is restrictive in exhaustive design space explorations of parameterized RTL generators, especially DNN accelerators that unleash an explosive full-stack search space. This paper presents Quickloop, an efficient and scalable framework to enable FPGA-accelerated exploration. Quickloop first abstracts away the cumbersome flow of RTL generation, software stack, FPGA toolflow, workload execution and metrics extraction by wrapping these stages into isolated Quicksteps, featuring cascadability, scalability, and replay. Then, we analytically minimize the FPGA toolflow TAT via a novel, data-driven strategy that intelligently utilizes build fragments from previous iterations, enhancing the loop efficiency and simultaneously lowering the toolflow’s compute utilization.
Quickloop is built around the OpenAI Gym environment framework and thus supports drop-in regression and reinforcement learning explorations. With a Quickloop around a reference Berkeley’s Gemmini DNN accelerator, we exhaustively explore its parameter space and discover complex performance patterns, based on full-stack simulation of Imagenet benchmarks as a workload. Compared to conventional FPGA toolflow, we further show that Quickloop effectively reduces episodal time by above 30%, as the episode approaches realistic lengths.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.