{"title":"Embedding-based placement of processing element networks on FPGAs for physical model simulation","authors":"Bailey Miller, F. Vahid, T. Givargis","doi":"10.1145/2435264.2435297","DOIUrl":"https://doi.org/10.1145/2435264.2435297","url":null,"abstract":"Physical models utilize mathematical equations to model physical systems like airway mechanics, neuron networks, or chemical reactions. Previous work has shown that physical models can execute fast on FPGAs (field-programmable gate arrays). We introduce an approach for implementing physical models on FPGAs that applies graph theoretic techniques to make use of a physical model's natural structure--tree, ring, chain, etc.--resulting in model execution speedups. A first phase of the approach maps physical model equations to a structured virtual PE (processing element) graph using graph theoretic folding techniques. A second phase maps the structured virtual PE graph to physical PE regions on an FPGA using graph embedding theory. We also present a simulated annealing approach with custom cost and neighbor functions that can map any physical model onto an FPGA with low wire costs. Average circuit speedup improvements over previous works for various physical models are 65% using the graph embedding and 35% using the simulated annealing approach. Each approach's more efficient use of FPGA resources also enables larger models to be implemented on an FPGA device.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"15 2","pages":"181-190"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91438105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A latency-optimized hybrid network for clustering FPGAs (abstract only)","authors":"Trevor Bunker, S. Swanson","doi":"10.1145/2435264.2435312","DOIUrl":"https://doi.org/10.1145/2435264.2435312","url":null,"abstract":"The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support unstructured (e.g., logs) and semi-structured (e.g., large graph, XML) data sets. To explore the suitability of FPGAs for these computations, we are constructing an FPGA-based system with a memory capacity of 512 GB from a collection of 32 Virtex-5 FPGAs spread across 8 enclosures. This poster describes the system's interconnect that combines inter-enclosure high-speed serial links and wide, single-ended intra-enclosure on-board traces with a network topology that optimizes for latency and bandwidth for small packets. The network uses a multi-level radix-12 router optimized for the asymmetry between the inter- and intra-enclosure links. The system has a peak theoretical bisection bandwidth of 247.2 Gb/s and a total switching capacity of 2.13 Tb/s. Under random traffic, the network sustains an aggregate throughput of 354.3 Gb/s. The channel transceivers and router consume 22% of the FPGAs' Block RAMs and 33% of their FPGA Slices.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"45 1","pages":"266"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90765173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire
{"title":"Performance and toolchain of a combined GPU/FPGA desktop (abstract only)","authors":"B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire","doi":"10.1145/2435264.2435336","DOIUrl":"https://doi.org/10.1145/2435264.2435336","url":null,"abstract":"Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance analysis of a desktop equipped with a GPU Tesla 2050 and an FPGA Virtex-6 LX240T. First, the balance between the I/O and the raw peak performance is depicted using the roofline model [4]. Next, the performance of a number of image processing algorithms is measured and the results are mapped onto the roofline graph. This allows to compare the GPU and the FPGA and also to optimize the algorithms for both accelerators. A programming toolchain is implemented, consisting of OpenCL for the GPU and several High-Level Synthesis compilers for the FPGA. Our results show that the HLS compilers outperform handwritten code and offer a performance comparable to the GPU. In addition the FPGA compilers reduce the development time by an order of magnitude, at the expense of an increased resource consumption. The roofline model also shows that both accelerators are equally limited by the input/output bandwidth to the host. A well-tuned accelerator-based codesign, identifying the parallelism, the computation and data patterns of different classes of algorithms, will enable to maximize the performance of the combined GPU/FPGA system [1].","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"80 1","pages":"274"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72774431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sensing nanosecond-scale voltage attacks and natural transients in FPGAs","authors":"K. Zick, Meeta Srivastav, Wei Zhang, M. French","doi":"10.1145/2435264.2435283","DOIUrl":"https://doi.org/10.1145/2435264.2435283","url":null,"abstract":"Voltage noise not only detracts from reliability and performance, but has been used to attack system security. Most systems are completely unaware of fluctuations occurring on nanosecond time scales. This paper quantifies the threat to FPGA-based systems and presents a solution approach. Novel measurements of transients on 28nm FPGAs show that extreme activity in the fabric can cause enormous undershoot and overshoot, more than 10× larger than what is allowed by the specification. An existing voltage sensor is evaluated and shown to be insufficient. Lastly, a sensor design using reconfigurable logic is presented; its time-to-digital converter enables sample rates 500× faster than the 28nm Xilinx ADC. This enables quick characterization of transients that would normally go undetected, thereby providing potentially useful data for system optimization and helping to defend against supply voltage attacks.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"101-104"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88299071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimum energy operation for clustered island-style FPGAs","authors":"P. Grossmann, M. Leeser, M. Onabajo","doi":"10.1145/2435264.2435293","DOIUrl":"https://doi.org/10.1145/2435264.2435293","url":null,"abstract":"Despite the advantages offered by field-programmable gate arrays (FPGAs) for low-power systems requiring flexible computing resources, applications with the lowest power budgets still favor microprocessors and application-specific integrated circuits (ASICs). In order for such systems to exploit FPGAs, an FPGA achieving minimum energy operation is needed. Minimum energy points have been found for ASICs and microprocessors to occur at operating voltages that are typically below the transistor threshold voltage. This paper presents two clustered island-style test chips capable of operating with a single supply voltage as low as 260 mV. This supply voltage represents the lowest voltage at which an FPGA has been successfully programmed. Test chip measurements show that the minimum energy point of both circuits is at or below this minimum operating voltage. Operation at 260 mV leads to a 40X power-delay product reduction vs. 1.5V operation. The results demonstrate a clear path forward for fabricating low voltage FPGAs that are fully compatible with existing tool flows.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"157-166"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77478781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA meta-data management system for accelerating implementation time with incremental compilation (abstract only)","authors":"A. Love, P. Athanas","doi":"10.1145/2435264.2435323","DOIUrl":"https://doi.org/10.1145/2435264.2435323","url":null,"abstract":"","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"269"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87537145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoMapper: an automated tool for optimal hardware resource allocation for networking applications on FPGA (abstract only)","authors":"Swapnil Haria, V. Prasanna","doi":"10.1145/2435264.2435335","DOIUrl":"https://doi.org/10.1145/2435264.2435335","url":null,"abstract":"It has now become imperative for routers to support complicated lookup schemes, based on the specific function of the networking hardware. It is no longer possible to ensure an optimal resource utilization using manual organization techniques due to the increasing complexity of lookup schemes, as well as the large number of potential implementation choices. We have developed an automated tool, AutoMapper, which can map lookup schemes onto a particular target architecture optimally, thereby providing a superior alternative to the time-consuming and resource inefficient technique of manual conversion. It is based on an Integer Linear Programming (ILP) formulation that is able to allocate the limited hardware resources for a single lookup scheme, while optimizing any of the three performance metrics of latency, throughput or power consumption. Accurate formulation of the objective function and the constraint equations guarantee optimality in terms of the chosen performance metric. We demonstrate the operation of the developed tool, by successfully mapping complex real world lookup schemes onto a state-of-the art FPGA device, with execution times being under a second on a dual-core computer with 4 GB of RAM, running at 2.40 GHz.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"60 1","pages":"274"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90637573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are FPGAs suffering from the innovator's dilemna?","authors":"Vaughn Betz, J. Cong","doi":"10.1145/2435264.2435289","DOIUrl":"https://doi.org/10.1145/2435264.2435289","url":null,"abstract":"FPGAs constitute a highly profitable industry, with approximately $5 billion of sales per year. High barriers to entry keep most companies away, and enable high profit margins for the incumbents. The industry has grown greatly over the years, but still constitutes a small portion of the overall semiconductor market. This raises the question to be addressed by this panel: is the FPGA community innovating as much as it should, or is a bias to maintain high profit margins and protect the cash flow of the current FPGA market holding us back from exploring new ideas and products that could greatly expand the appeal of and market for FPGA-related technology? This would be a classic case of the innovator's dilemma defined by Clayton Christensen: it is difficult for a company to engage in creative destruction of a cash cow product.\u0000 Our distinguished panel of experts will discuss whether we are seeing major innovation in architectures, design flows and applications, or only incremental improvements. We will also discuss if any new (possibly low margin) application domain is left out by the FPGA industry, what radical ideas should be explored, and whether large incumbents, new startups, academia or some combination are best able to attack these new areas.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"42 1","pages":"135-136"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85548962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuit optimizations to minimize energy in the global interconnect of a low-power--FPGA (abstract only)","authors":"Oluseyi A. Ayorinde, B. Calhoun","doi":"10.1145/2435264.2435341","DOIUrl":"https://doi.org/10.1145/2435264.2435341","url":null,"abstract":"We compare circuit and architecture choices in the global interconnect of an FPGA in order to find the minimum energy design for low voltage operation. We look at switch box topology, number of repeaters, receiver circuit topology, and dynamic voltage selection, all with the intent of minimizing energy consumption. The results show that using a pass gate switchbox topology with repeaters in the interconnect and a custom receiver lowers delay by up to 63% and energy by up to 87% from the standard FPGA circuit choices. This work also identifies the optimal VDD choices to maximize performance under energy constraints or vice versa.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"277"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74089841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Defect recovery in nanodevice-based programmable interconnects (abstract only)","authors":"J. Cong, Bingjun Xiao","doi":"10.1145/2435264.2435343","DOIUrl":"https://doi.org/10.1145/2435264.2435343","url":null,"abstract":"This work focuses on defect tolerance for nanodevice-based programmable interconnects of FPGAs. A single nanodevice can function as a routing switch in place of a pass transistor and its six-transistor SRAM cell in conventional FPGAs. Defects of nanodevices in programmable interconnects are manifested as losses of configurability and can be categorized into stuck- open defect and stuck- closed defect. First, we show that the stuck-closed defects of nanodevices have a much higher impact than the stuck-open defects. Instead of simply avoiding the stuck-closed defects, we recover them by treating them as shorting constraints in the routing. We develop a scalable algorithm to perform timing-driven routing under these extra constraints. We extend the idea of the resource negotiation to balance the goals of timing and routability under shorting constraints. We also develop several techniques to guide the router to map the shorting clusters to those nets with more shared paths for better utilization of routing resources while automatically balancing it with circuit performance. We also enhance the placement algorithm to recover logic blocks which become virtually unusable due to shorted pins. Simulation results show that at the up-to-date level of nanodevice defects (108-1011x higher than CMOS), compared to the simple avoidance method, our approach reduces the degradation of resource usage by 87%, improves the routability by 37%, and reduce the degradation of circuit performance by 36%, at a negligible overhead of tool runtime.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"277-278"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90214192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}