{"title":"Kernel Normalised Least Mean Squares with Delayed Model Adaptation","authors":"Nicholas J. Fraser, P. Leong","doi":"10.1145/3376924","DOIUrl":"https://doi.org/10.1145/3376924","url":null,"abstract":"Kernel adaptive filters (KAFs) are non-linear filters which can adapt temporally and have the additional benefit of being computationally efficient through use of the “kernel trick”. In a number of real-world applications, such as channel equalisation, the non-linear mapping provides significant improvements over conventional linear techniques such as the least mean squares (LMS) and recursive least squares (RLS) algorithms. Prior works have focused mainly on the theory and accuracy of KAFs, with little research on their implementations. This article proposes several variants of algorithms based on the kernel normalised least mean squares (KNLMS) algorithm which utilise a delayed model update to minimise dependencies. Subsequently, this work proposes corresponding hardware architectures which utilise this delayed model update to achieve high sample rates and low latency while also providing high modelling accuracy. The resultant delayed KNLMS (DKNLMS) algorithms can achieve clock rates up to 12× higher than the standard KNLMS algorithm, with minimal impact on accuracy and stability. A system implementation achieves 250 GOps/s and a throughput of 187.4 MHz on an Ultra96 board with 1.8× higher throughput than previous state of the art.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132808479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In-Circuit Debugging with Dynamic Reconfiguration of FPGA Interconnects","authors":"Alexandra Kourfali, D. Stroobandt","doi":"10.1145/3375459","DOIUrl":"https://doi.org/10.1145/3375459","url":null,"abstract":"In this work, a novel method for in-circuit debugging on FPGAs is introduced that allows the insertion of low-overhead debugging infrastructure by exploiting the technique of parameterized configurations. This allows the parameterization of the LUTs and the routing infrastructure to create a virtual network of debugging multiplexers. It aims to facilitate debugging, to increase the internal signal observability, and to reduce the debugging (area and reconfiguration) overhead. Signal ranking techniques are also introduced that classify signals that can be traced during debug. Finally, the results of the method are presented and compared with a commercial tool. The area and time results and the tradeoffs between internal signal observability and area and reconfiguration overhead are also explored.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122462131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feel Free to Interrupt","authors":"Sameh Attia, Vaughn Betz","doi":"10.1145/3372491","DOIUrl":"https://doi.org/10.1145/3372491","url":null,"abstract":"Saving and restoring an FPGA task state in an orderly manner is essential to enable hardware checkpointing, which is highly desirable to improve the ability to debug cloud-scale hardware services, and context switching, which allows multiple users to share FPGA resources. However, these features require task interruption, and stopping a task at an arbitrary time can cause several hazards including deadlock and data loss. In this article, we build a context saving and restoring simulator to simulate and identify these hazards. In addition, we derive design rules that should be followed to achieve safe task interruption. Finally, we propose task wrappers that can be placed around an FPGA task to implement these rules. The timing and area overheads added by these wrappers are very small; they add 1.8% area and no timing overhead to a full Memcached system. Taken together, these design rules and wrappers enable safe checkpointing and context switching in a wide variety of FPGA tasks, including those with multiple clocks, multi-cycle I/O transactions, and interface dependencies.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"22 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116552950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Al-Shahna Jamal, Eli Cahill, Jeffrey B. Goeders, S. Wilton
{"title":"Fast Turnaround HLS Debugging Using Dependency Analysis and Debug Overlays","authors":"Al-Shahna Jamal, Eli Cahill, Jeffrey B. Goeders, S. Wilton","doi":"10.1145/3372490","DOIUrl":"https://doi.org/10.1145/3372490","url":null,"abstract":"High-level synthesis (HLS) has gained considerable traction over recent years, as it allows for faster development and verification of hardware accelerators than traditional RTL design. While HLS allows for most bugs to be caught during software verification, certain non-deterministic or data-dependent bugs still require debugging the actual hardware system during execution. Recent work has focused on techniques to allow designers to perform in-system debug of HLS circuits in the context of the original software code; however, like RTL debug, the user must still determine the root cause of a bug using small execution traces, with lengthy debug turns. In this work, we demonstrate techniques aimed at reducing the time HLS designers spend performing in-system debug. Our approaches consist of performing data dependency analysis to guide the user in selecting which variables are observed by the debug instrumentation, as well as an associated debug overlay that allows for rapid reconfiguration of the debug logic, enabling rapid switching of variable observation between debug iterations. In addition, our overlay provides additional debug capability, such as selective function tracing and conditional buffer freeze points. We explore the area overhead of these different overlay features, showing a basic overlay with only a 1.7% increase in area overhead from the baseline debug instrumentation, while a deluxe variant offers 2×--7× improvement in trace buffer memory utilization with conditional buffer freeze support.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130457659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FOS","authors":"Anuj Vaishnav, K. Pham, Joseph Powell, Dirk Koch","doi":"10.1145/3405794","DOIUrl":"https://doi.org/10.1145/3405794","url":null,"abstract":"With FPGAs now being deployed in the cloud and at the edge, there is a need for scalable design methods that can incorporate the heterogeneity present in the hardware and software components of FPGA systems. Moreover, these FPGA systems need to be maintainable and adaptable to changing workloads while improving accessibility for the application developers. However, current FPGA systems fail to achieve modularity and support for multi-tenancy due to dependencies between system components and the lack of standardised abstraction layers. To solve this, we introduce a modular FPGA operating system – FOS, which adopts a modular FPGA development flow to allow each system component to be changed and be agnostic to the heterogeneity of EDA tool versions, hardware and software layers. Further, to dynamically maximise the utilisation transparently from the users, FOS employs resource-elastic scheduling to arbitrate the FPGA resources in both time and spatial domain for any type of accelerators. Our evaluation on different FPGA boards shows that FOS can provide performance improvements in both single-tenant and multi-tenant environments while substantially reducing the development time and, at the same time, improving flexibility.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131234756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos S. Alachiotis, Charalampos Vatsolakis, Grigorios Chrysos, D. Pnevmatikatos
{"title":"RAiSD-X","authors":"Nikolaos S. Alachiotis, Charalampos Vatsolakis, Grigorios Chrysos, D. Pnevmatikatos","doi":"10.1145/3364225","DOIUrl":"https://doi.org/10.1145/3364225","url":null,"abstract":"Detecting traces of positive selection in genomes carries theoretical significance and has practical applications from shedding light on the forces that drive adaptive evolution to the design of more effective drug treatments. The size of genomic datasets currently grows at an unprecedented pace, fueled by continuous advances in DNA sequencing technologies, leading to ever-increasing compute and memory requirements for meaningful genomic analyses. The majority of existing methods for positive selection detection either are not designed to handle whole genomes or scale poorly with the sample size; they inevitably resort to a runtime versus accuracy tradeoff, raising an alarming concern for the feasibility of future large-scale scans. To this end, we present RAiSD-X, a high-performance system that relies on a decoupled access-execute processing paradigm for efficient FPGA acceleration and couples a novel, to our knowledge, sliding-window algorithm for the recently introduced μ statistic with a mutation-driven hashing technique to rapidly detect patterns in the data. RAiSD-X achieves up to three orders of magnitude faster processing than widely used software implementations, and more importantly, it can exhaustively scan thousands of human chromosomes in minutes, yielding a scalable full-system solution for future studies of positive selection in species of flora and fauna.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129648762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSL-Based Hardware Generation with Scala","authors":"F. Serre, Markus Püschel","doi":"10.1145/3359754","DOIUrl":"https://doi.org/10.1145/3359754","url":null,"abstract":"We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks, and Walsh-Hadamard transforms that outperform prior generators.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129377673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The FPOA, a Medium-grained Reconfigurable Architecture for High-level Synthesis","authors":"J. Gorski, D. M. Hanna","doi":"10.1145/3340556","DOIUrl":"https://doi.org/10.1145/3340556","url":null,"abstract":"In this article, we present a novel type of medium-grained reconfigurable architecture that we term the Field Programmable Operation Array (FPOA). This device has been designed specifically for the implementation of HLS-generated circuitry. At the core of the FPOA is the OP-block. Unlike a standard LUT, an OP-block performs multi-bit operations through gate-based logic structures, translating into greater speed and efficiency in digital circuit implementation. Our device is not optimized for a specific application domain. Rather, we have created a device that is optimized for a specific circuit structure, namely those generated by HLS. This gives the FPOA a significant advantage as it can be used across all application domains. In this work, we add support for both distributed and block memory to the FPOA architecture. Experimental results show up to a 13.5× reduction in logic area and a 9.5× reduction in critical path delay for circuit implementation using the FPOA compared to a standard FPGA.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133317682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Design Flow Engine for the Support of Customized Dynamic High Level Synthesis Flows","authors":"M. Lattuada, Fabrizio Ferrandi","doi":"10.1145/3356475","DOIUrl":"https://doi.org/10.1145/3356475","url":null,"abstract":"High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high-level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently from most of the traditional compilation flows, the complexity and the execution time of the High Level Synthesis techniques are much less relevant than the quality of the produced results. For this reason, fixed-point analyses, as well as successive refinement optimizations, can be accepted, provided that they can improve the quality of the generated designs. This article presents a design flow engine for the description and the execution of complex and customized synthesis flows. It supports dynamic addition of passes and dependencies, cyclic dependencies, and selective pass invalidation. Experimental results show the benefits of such type of design flows with respect to static linear design flows when applied to High Level Synthesis.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132763501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraVF-M","authors":"Nina Engelhardt, Hayden Kwok-Hay So","doi":"10.1145/3357596","DOIUrl":"https://doi.org/10.1145/3357596","url":null,"abstract":"Due to the irregular nature of connections in most graph datasets, partitioning graph analysis algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8GTEPS (billions of traversed edges per second) on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature and reach 94% of the projected performance limit of the system.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"339 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133992954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}