Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides
{"title":"动态调度高级综合中的并行控制流","authors":"Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides","doi":"https://dl.acm.org/doi/10.1145/3599973","DOIUrl":null,"url":null,"abstract":"<p>Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions <i>within</i> each basic block (BB) of the source program, but parallelism <i>between</i> BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to <i>start</i> in strict program order, thus limiting the achievable parallelism and overall performance. </p><p>We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"25 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis\",\"authors\":\"Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides\",\"doi\":\"https://dl.acm.org/doi/10.1145/3599973\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions <i>within</i> each basic block (BB) of the source program, but parallelism <i>between</i> BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to <i>start</i> in strict program order, thus limiting the achievable parallelism and overall performance. </p><p>We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).</p>\",\"PeriodicalId\":49248,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3599973\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3599973","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis
Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance.
We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).
期刊介绍:
TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right.
Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications.
-The board and systems architectures of a reconfigurable platform.
-Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity.
-Languages and compilers for reconfigurable systems.
-Logic synthesis and related tools, as they relate to reconfigurable systems.
-Applications on which success can be demonstrated.
The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.)
In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.