Benjamin Gojman, Sirisha Nalmela, Nikil Mehta, N. Howarth, A. DeHon
{"title":"GROK-LAB: generating real on-chip knowledge for intra-cluster delays using timing extraction","authors":"Benjamin Gojman, Sirisha Nalmela, Nikil Mehta, N. Howarth, A. DeHon","doi":"10.1145/2435264.2435281","DOIUrl":"https://doi.org/10.1145/2435264.2435281","url":null,"abstract":"Timing Extraction identifies the delay of fine-grained components within an FPGA. From these computed delays, the delay of any path can be calculated. Moreover, a comparison of the fine-grained delays allows a detailed understanding of the amount and type of process variation that exists in the FPGA. To obtain these delays, Timing Extraction measures, using only resources already available in the FPGA, the delay of a small subset of the total paths in the FPGA. We apply Timing Extraction to the Logic Array Block (LAB) on an Altera Cyclone III FPGA to obtain a view of the delay down to near individual LUT granularity, characterizing components with delays on the order of a few hundred picoseconds with a resolution of ±3.2 ps. This information reveals that the 65 nm process used has, on average, random variation of Ã/¼ = 4.0% with components having an average maximum spread of 83 ps. Timing extraction also shows that as VDD decreases from 1.2 V to 0.9 V in a Cyclone IV 60 nm FPGA, paths slow down and variation increases from Ã/¼ = 4.3% to Ã/¼ = 5.8%, a clear indication that lowering VDD magnifies the impact of random variation.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"216 1","pages":"81-90"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75143661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Zemčík, Roman Juránek, Petr Musil, M. Musil, Michal Hradiš
{"title":"High performance architecture for object detection in streamed video (abstract only)","authors":"P. Zemčík, Roman Juránek, Petr Musil, M. Musil, Michal Hradiš","doi":"10.1145/2435264.2435319","DOIUrl":"https://doi.org/10.1145/2435264.2435319","url":null,"abstract":"Object detection is one of the key tasks in computer vision. It is computationally intensive and it is reasonable to accelerate it in hardware. The possible benefits of the acceleration are reduction of the computational load of the host computer system, increase of the overall performance of the applications, and reduction of the power consumption. We present novel architecture for multi-scale object detection in video streams. The architecture uses scanning window classifiers produced by WaldBoost learning algorithm, and simple image features. It employs small image buffer for data under processing, and on-the-fly scaling units to enable detection of object in multiple scales. The whole processing chain is pipelined and thus more image windows are processed in parallel. We implemented the engine in Spartan 6 FPGA and we show that it can process 640x480 pixel video streams at over 160 frames per second without the need of external memory. The design takes only a fraction of resources, compared to similar state of the art approaches.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"35 1","pages":"268"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75165773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Area-efficient near-associative memories on FPGAs","authors":"Udit Dhawan, A. DeHon","doi":"10.1145/2435264.2435298","DOIUrl":"https://doi.org/10.1145/2435264.2435298","url":null,"abstract":"Associative memories can map sparsely used keys to values with low latency but can incur heavy area overheads. The lack of customized hardware for associative memories in today's mainstream FPGAs exacerbates the overhead cost of building these memories using the fixed address match BRAMs. In this paper, we develop a new, FPGA-friendly, memory architecture based on a multiple hash scheme that is able to achieve near-associative performance (less than 5% of evictions due to conflicts) without the area overheads of a fully associative memory on FPGAs. Using the proposed architecture as a 64KB L1 data cache, we show that it is able to achieve near-associative miss-rates while consuming 6-7× less FPGA memory resources for a set of benchmark programs from the SPEC2006 suite than fully associative memories generated by the Xilinx Coregen tool. Benefits increase with match width, allowing area reduction up to 100×. At the same time, the new architecture has lower latency than the fully associative memory -- 3.7 ns for a 1024-entry flat version or 6.1 ns for an area-efficient version compared to 8.8 ns for a fully associative memory for a 64b key.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"80 1","pages":"191-200"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84230338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polyhedral-based data reuse optimization for configurable computing","authors":"L. Pouchet, Peng Zhang, P. Sadayappan, J. Cong","doi":"10.1145/2435264.2435273","DOIUrl":"https://doi.org/10.1145/2435264.2435273","url":null,"abstract":"Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques. We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching.\u0000 We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while still aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"29-38"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78881712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effect of fixed-point arithmetic on deep belief networks (abstract only)","authors":"Jingfei Jiang, Rongdong Hu, M. Luján","doi":"10.1145/2435264.2435331","DOIUrl":"https://doi.org/10.1145/2435264.2435331","url":null,"abstract":"Deep Belief Networks (DBNs) are state-of-the-art learning algorithms building on a subset of neural networks, Restricted Boltzmann Machine (RBM). DBNs are computationally intensive posing the question of whether DBNs can be FPGA accelerated. Fixed-point arithmetic can have an important influence on the execution time and prediction accuracy of a DBN. Previous studies have focused only on customized RBM accelerators with a fixed data-width. Our results experiments demonstrate that variable data-widths can obtain similar performance levels. We can also observe that the most suitable data-widths for different types of DBN are not unique or fixed. From this we conclude that a DBN accelerator should support various data-widths rather than only fixed one as done in previous work. The processing performance of DBN accelerators in FPGA is almost always constrained not by the capacity of the processing units, but by their on-chip RAM capacity and speed. We propose an efficient memory sub-system combining junction and padding methods to reduce bandwidth usage for DBN accelerators, which shows that supporting various data-widths is not as difficult as it may sound. The cost is only little in hardware terms and does not affect the critical path. We design a generation tool to help users reconfiguring the memory sub-system with arbitrary data-width flexibly. Our tool can also be used as an advanced IP core generator above FPGA memory controller supporting parallel memory access in irregular data-width for other applications.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"11 1","pages":"273"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76427075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indirect connection aware attraction for FPGA clustering (abstract only)","authors":"Meng Yang, J. Tong, A. Almaini","doi":"10.1145/2435264.2435310","DOIUrl":"https://doi.org/10.1145/2435264.2435310","url":null,"abstract":"Indirect connection aware attraction clustering algorithm is proposed for clustered field programmable gate array architecture model to achieve simultaneously optimization of several performance metrics. A new cost function considers the attraction of the subsequent basic logic elements (BLEs) to the selected cluster, the number of the used pins already in the cluster, as well as critical path delay. The attractions of which BLEs are directly and indirectly connected to the selected cluster are taken into account. As a result, more external nets are absorbed into clusters, less number of pins per cluster and fewer clusters are required. Hence, smaller channel width is required for routing and speed of the design is improved. Performance comparisons are carried out in details with respect to state-of-the-art clustering techniques interconnect resource aware clustering (iRAC) and many-objective clustering (MO-Pack). Results show that the proposed algorithm outperforms these two clustering approaches with achievements of 38.8% and 42.2% respectively in terms of channel widths and 40.1% and 44.8% respectively in terms of number of external nets but with no critical path and area overhead.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"265"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75509375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita
{"title":"Rectification of advanced microprocessors without changing routing on FPGAs (abstract only)","authors":"Satoshi Jo, A. M. Gharehbaghi, Takeshi Matsumoto, M. Fujita","doi":"10.1145/2435264.2435347","DOIUrl":"https://doi.org/10.1145/2435264.2435347","url":null,"abstract":"We propose a method for rectification of bugs in microprocessors that are implemented on FPGAs, by only changing the configuration of LUTs, without any modification to the routing. Therefore, correcting the bugs does not require resynthesis, which can be very long for complex microprocessors due to possible timing closure problems. As the structure of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In design phase, we may add additional LUTs to the original circuit, so that we can use them in the correction phase. After a bug is found, we perform the following two tasks. Fist, we find the candidate control signals as well as the required change to correct their behavior. This is done by using symbolic simulation and equivalency checking between the formal specification and the erroneous formal model of the processor. Then, we try to map the corrected functionality into the existing LUT structure. This is done by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers instead of QBF solvers under a CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness of our method by correcting bugs in two complex out-of-order superscalar processors with two different timing error recovery mechanisms.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"111 1","pages":"279"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83737632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automating resource optimisation in reconfigurable design (abstract only)","authors":"Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu","doi":"10.1145/2435264.2435338","DOIUrl":"https://doi.org/10.1145/2435264.2435338","url":null,"abstract":"A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Configuration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.61 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"140 1","pages":"275"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77434804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynafuse: dynamic dependence analysis for FPGA pipeline fusion and locality optimizations","authors":"J. Fowers, G. Stitt","doi":"10.1145/2435264.2435300","DOIUrl":"https://doi.org/10.1145/2435264.2435300","url":null,"abstract":"Although high-level synthesis improves FPGA productivity by enabling designers to use high-level code, the resulting performance is often significantly worse than register-transfer-level designs. One cause of such limited optimization is that high-level synthesis tools are restricted by multiple possible dependencies due to the undecidability of alias analysis. In this paper, we introduce the Dynafuse optimization, which analyzes dependencies dynamically to resolve aliases and enable runtime circuit optimizations. To resolve aliases, Dynafuse provides a specialized software data structure that dynamically determines definition-use chains between FPGA functions. In addition, Dynafuse statically creates a reconfigurable overlay network that uses detected dependencies to dynamically adjust connections between functions and memories in order to fuse pipelines and exploit data locality. Experimental results show that Dynafuse sped up two existing FPGA applications by 1.6-1.8x when exploiting locality and by 3-5x when fusing pipelines. Furthermore, the speedup from pipeline fusion increases linearly with the number of fused functions, which suggests larger applications will experience larger improvements.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"201-210"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86928287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanjie Huang, P. Ienne, O. Temam, Yunji Chen, Chengyong Wu
{"title":"Elastic CGRAs","authors":"Yuanjie Huang, P. Ienne, O. Temam, Yunji Chen, Chengyong Wu","doi":"10.1145/2435264.2435296","DOIUrl":"https://doi.org/10.1145/2435264.2435296","url":null,"abstract":"Vital technology trends such as voltage scaling and homogeneous multicore scaling have reached their limits and architects turn to alternate computing paradigms, such as heterogeneous and domain-specialized solutions. Coarse-Grain Reconfigurable Arrays (CGRAs) promise the performance of massively spatial computing while offering interesting trade-offs of flexibility versus energy efficiency. Yet, configuring and scheduling execution for CGRAs generally runs into the classic difficulties that have hampered Very-Long Instruction Word (VLIW) architectures: efficient schedules are difficult to generate, especially for applications with complex control flow and data structures, and they are inherently static - thus, in adapted to variable-latency components (such as the read ports of caches). Over the years, VLIWs have been relegated to important but specific application domains where such issues are more under the control of the designers; similarly, statically-scheduled CGRAs may prove inadequate for future general-purpose computing systems. In this paper, we introduce Elastic CGRAs, the superscalar processors of computing fabrics: no complex schedule needs to be computed at configuration time, and the operations execute dynamically in the CGRA when data are ready, thus exploiting the data parallelism that an application offers. We designed, down to a manufacturable layout, a simple CGRA where we demonstrated and optimized our elastic control circuitry. We also built a complete compilation toolchain that transforms arbitrary C code in a configuration for the array. The area overhead (26.2%), critical path overhead (8.2%) and energy overhead (53.6%) of Elastic CGRAs over non-elastic CGRAs are significantly lower than the overhead of superscalar processors over VLIWs, while providing the same benefits. At such moderate costs, elasticity may prove to be one of the key enablers to make the adoption of CGRAs widespread.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"44 1","pages":"171-180"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87537320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}