{"title":"HLScope: High-Level Performance Debugging for FPGA Designs","authors":"Young-kyu Choi, J. Cong","doi":"10.1109/FCCM.2017.44","DOIUrl":null,"url":null,"abstract":"In their quest for further optimization, field-programmable gate array (FPGA) designers often spend considerable time trying to identify the performance bottleneck in a current design. But since FPGAs do not have built-in high-level probes for performance analysis, manual effort is required to insert custom hardware monitors. This, however, is a time-consuming process which calls for automation. Previous work automates the process of inserting hardware monitors into the communication channels or the finite-state machine, but the instrumentation is applied in low-level hardware description languages (HDL) which limits the comprehensibility in identifying the root cause of stalls. Instead, we propose a performance debugging methodology based on high-level synthesis (HLS). High-level analysis allows tracing the cause of stalls on a function or loop level, which provides a more intuitive feedback that can be used to pinpoint the performance bottleneck. In this paper we propose HLScope, a source-to-source transformation framework based on Vivado HLS for automated performance analysis. We present a method for analyzing the information collected from the software simulation to estimate the stall rate and its cause without the need for FPGA bitstream generation. For detailed analysis, an in-FPGA analysis method is proposed that can be natively integrated into the HLS environment. Experiments show that the parameter extraction from the simulation process is orders of magnitude faster than bitstream generation, with a 2.2% cycle difference on average. In-FPGA flow consumes only about 170 LUTs and a BRAM per monitored module and provides cycle-accurate results.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.44","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
Abstract
In their quest for further optimization, field-programmable gate array (FPGA) designers often spend considerable time trying to identify the performance bottleneck in a current design. But since FPGAs do not have built-in high-level probes for performance analysis, manual effort is required to insert custom hardware monitors. This, however, is a time-consuming process which calls for automation. Previous work automates the process of inserting hardware monitors into the communication channels or the finite-state machine, but the instrumentation is applied in low-level hardware description languages (HDL) which limits the comprehensibility in identifying the root cause of stalls. Instead, we propose a performance debugging methodology based on high-level synthesis (HLS). High-level analysis allows tracing the cause of stalls on a function or loop level, which provides a more intuitive feedback that can be used to pinpoint the performance bottleneck. In this paper we propose HLScope, a source-to-source transformation framework based on Vivado HLS for automated performance analysis. We present a method for analyzing the information collected from the software simulation to estimate the stall rate and its cause without the need for FPGA bitstream generation. For detailed analysis, an in-FPGA analysis method is proposed that can be natively integrated into the HLS environment. Experiments show that the parameter extraction from the simulation process is orders of magnitude faster than bitstream generation, with a 2.2% cycle difference on average. In-FPGA flow consumes only about 170 LUTs and a BRAM per monitored module and provides cycle-accurate results.