{"title":"Embedding-based placement of processing element networks on FPGAs for physical model simulation","authors":"Bailey Miller, F. Vahid, T. Givargis","doi":"10.1145/2435264.2435297","DOIUrl":"https://doi.org/10.1145/2435264.2435297","url":null,"abstract":"Physical models utilize mathematical equations to model physical systems like airway mechanics, neuron networks, or chemical reactions. Previous work has shown that physical models can execute fast on FPGAs (field-programmable gate arrays). We introduce an approach for implementing physical models on FPGAs that applies graph theoretic techniques to make use of a physical model's natural structure--tree, ring, chain, etc.--resulting in model execution speedups. A first phase of the approach maps physical model equations to a structured virtual PE (processing element) graph using graph theoretic folding techniques. A second phase maps the structured virtual PE graph to physical PE regions on an FPGA using graph embedding theory. We also present a simulated annealing approach with custom cost and neighbor functions that can map any physical model onto an FPGA with low wire costs. Average circuit speedup improvements over previous works for various physical models are 65% using the graph embedding and 35% using the simulated annealing approach. Each approach's more efficient use of FPGA resources also enables larger models to be implemented on an FPGA device.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"15 2","pages":"181-190"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91438105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A latency-optimized hybrid network for clustering FPGAs (abstract only)","authors":"Trevor Bunker, S. Swanson","doi":"10.1145/2435264.2435312","DOIUrl":"https://doi.org/10.1145/2435264.2435312","url":null,"abstract":"The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support unstructured (e.g., logs) and semi-structured (e.g., large graph, XML) data sets. To explore the suitability of FPGAs for these computations, we are constructing an FPGA-based system with a memory capacity of 512 GB from a collection of 32 Virtex-5 FPGAs spread across 8 enclosures. This poster describes the system's interconnect that combines inter-enclosure high-speed serial links and wide, single-ended intra-enclosure on-board traces with a network topology that optimizes for latency and bandwidth for small packets. The network uses a multi-level radix-12 router optimized for the asymmetry between the inter- and intra-enclosure links. The system has a peak theoretical bisection bandwidth of 247.2 Gb/s and a total switching capacity of 2.13 Tb/s. Under random traffic, the network sustains an aggregate throughput of 354.3 Gb/s. The channel transceivers and router consume 22% of the FPGAs' Block RAMs and 33% of their FPGA Slices.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"45 1","pages":"266"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90765173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire
{"title":"Performance and toolchain of a combined GPU/FPGA desktop (abstract only)","authors":"B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire","doi":"10.1145/2435264.2435336","DOIUrl":"https://doi.org/10.1145/2435264.2435336","url":null,"abstract":"Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance analysis of a desktop equipped with a GPU Tesla 2050 and an FPGA Virtex-6 LX240T. First, the balance between the I/O and the raw peak performance is depicted using the roofline model [4]. Next, the performance of a number of image processing algorithms is measured and the results are mapped onto the roofline graph. This allows to compare the GPU and the FPGA and also to optimize the algorithms for both accelerators. A programming toolchain is implemented, consisting of OpenCL for the GPU and several High-Level Synthesis compilers for the FPGA. Our results show that the HLS compilers outperform handwritten code and offer a performance comparable to the GPU. In addition the FPGA compilers reduce the development time by an order of magnitude, at the expense of an increased resource consumption. The roofline model also shows that both accelerators are equally limited by the input/output bandwidth to the host. A well-tuned accelerator-based codesign, identifying the parallelism, the computation and data patterns of different classes of algorithms, will enable to maximize the performance of the combined GPU/FPGA system [1].","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"80 1","pages":"274"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72774431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sensing nanosecond-scale voltage attacks and natural transients in FPGAs","authors":"K. Zick, Meeta Srivastav, Wei Zhang, M. French","doi":"10.1145/2435264.2435283","DOIUrl":"https://doi.org/10.1145/2435264.2435283","url":null,"abstract":"Voltage noise not only detracts from reliability and performance, but has been used to attack system security. Most systems are completely unaware of fluctuations occurring on nanosecond time scales. This paper quantifies the threat to FPGA-based systems and presents a solution approach. Novel measurements of transients on 28nm FPGAs show that extreme activity in the fabric can cause enormous undershoot and overshoot, more than 10× larger than what is allowed by the specification. An existing voltage sensor is evaluated and shown to be insufficient. Lastly, a sensor design using reconfigurable logic is presented; its time-to-digital converter enables sample rates 500× faster than the 28nm Xilinx ADC. This enables quick characterization of transients that would normally go undetected, thereby providing potentially useful data for system optimization and helping to defend against supply voltage attacks.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"101-104"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88299071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimum energy operation for clustered island-style FPGAs","authors":"P. Grossmann, M. Leeser, M. Onabajo","doi":"10.1145/2435264.2435293","DOIUrl":"https://doi.org/10.1145/2435264.2435293","url":null,"abstract":"Despite the advantages offered by field-programmable gate arrays (FPGAs) for low-power systems requiring flexible computing resources, applications with the lowest power budgets still favor microprocessors and application-specific integrated circuits (ASICs). In order for such systems to exploit FPGAs, an FPGA achieving minimum energy operation is needed. Minimum energy points have been found for ASICs and microprocessors to occur at operating voltages that are typically below the transistor threshold voltage. This paper presents two clustered island-style test chips capable of operating with a single supply voltage as low as 260 mV. This supply voltage represents the lowest voltage at which an FPGA has been successfully programmed. Test chip measurements show that the minimum energy point of both circuits is at or below this minimum operating voltage. Operation at 260 mV leads to a 40X power-delay product reduction vs. 1.5V operation. The results demonstrate a clear path forward for fabricating low voltage FPGAs that are fully compatible with existing tool flows.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"157-166"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77478781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA meta-data management system for accelerating implementation time with incremental compilation (abstract only)","authors":"A. Love, P. Athanas","doi":"10.1145/2435264.2435323","DOIUrl":"https://doi.org/10.1145/2435264.2435323","url":null,"abstract":"","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"269"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87537145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoMapper: an automated tool for optimal hardware resource allocation for networking applications on FPGA (abstract only)","authors":"Swapnil Haria, V. Prasanna","doi":"10.1145/2435264.2435335","DOIUrl":"https://doi.org/10.1145/2435264.2435335","url":null,"abstract":"It has now become imperative for routers to support complicated lookup schemes, based on the specific function of the networking hardware. It is no longer possible to ensure an optimal resource utilization using manual organization techniques due to the increasing complexity of lookup schemes, as well as the large number of potential implementation choices. We have developed an automated tool, AutoMapper, which can map lookup schemes onto a particular target architecture optimally, thereby providing a superior alternative to the time-consuming and resource inefficient technique of manual conversion. It is based on an Integer Linear Programming (ILP) formulation that is able to allocate the limited hardware resources for a single lookup scheme, while optimizing any of the three performance metrics of latency, throughput or power consumption. Accurate formulation of the objective function and the constraint equations guarantee optimality in terms of the chosen performance metric. We demonstrate the operation of the developed tool, by successfully mapping complex real world lookup schemes onto a state-of-the art FPGA device, with execution times being under a second on a dual-core computer with 4 GB of RAM, running at 2.40 GHz.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"60 1","pages":"274"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90637573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are FPGAs suffering from the innovator's dilemna?","authors":"Vaughn Betz, J. Cong","doi":"10.1145/2435264.2435289","DOIUrl":"https://doi.org/10.1145/2435264.2435289","url":null,"abstract":"FPGAs constitute a highly profitable industry, with approximately $5 billion of sales per year. High barriers to entry keep most companies away, and enable high profit margins for the incumbents. The industry has grown greatly over the years, but still constitutes a small portion of the overall semiconductor market. This raises the question to be addressed by this panel: is the FPGA community innovating as much as it should, or is a bias to maintain high profit margins and protect the cash flow of the current FPGA market holding us back from exploring new ideas and products that could greatly expand the appeal of and market for FPGA-related technology? This would be a classic case of the innovator's dilemma defined by Clayton Christensen: it is difficult for a company to engage in creative destruction of a cash cow product.\u0000 Our distinguished panel of experts will discuss whether we are seeing major innovation in architectures, design flows and applications, or only incremental improvements. We will also discuss if any new (possibly low margin) application domain is left out by the FPGA industry, what radical ideas should be explored, and whether large incumbents, new startups, academia or some combination are best able to attack these new areas.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"42 1","pages":"135-136"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85548962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving bitstream compression by modifying FPGA architecture","authors":"S. A. Razavi, M. S. Zamani","doi":"10.1145/2435264.2435294","DOIUrl":"https://doi.org/10.1145/2435264.2435294","url":null,"abstract":"The size of configuration bitstreams of field-programmable gate arrays (FPGA) is increasing rapidly. Compression techniques are used to decrease the size of bitstreams. In this paper, an appropriate bitstream format and variable symbol lengths are proposed to utilize the routing patterns for enhancing the compression efficiency. An order of inputs of multiplexers in switch modules is also proposed to improve the symbol statistics and hence, the compression efficiency. A framework to generate the bitstream and hardware description of FPGAs is developed as well. Experimental results over 20 MCNC benchmarks show that by applying the proposed approaches, the compression rate is improved by 46% on average compared to the methods with fixed symbol lengths without any area and performance degradation.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"167-170"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86492640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architecture support for custom instructions with memory operations","authors":"J. Cong, Karthik Gururaj","doi":"10.1145/2435264.2435303","DOIUrl":"https://doi.org/10.1145/2435264.2435303","url":null,"abstract":"Customized instructions (CIs) implemented using custom functional units (CFUs) have been proposed as a way of improving performance and energy efficiency of software while minimizing cost of designing and verifying accelerators from scratch. However, previous work allows CIs to only communicate with the processor through registers or with limited memory operations. In this work we propose an architecture that allows CIs to seamlessly execute memory operations without any special synchronization operations to guarantee program order of instructions. Our results show that our architecture can provide 24% energy savings with 14% performance improvement for 2-issue and 4-issue superscalar processor cores.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"231-234"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81276494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}