{"title":"FPGA Gaussian Random Number Generators with Guaranteed Statistical Accuracy","authors":"David B. Thomas","doi":"10.1109/FCCM.2014.47","DOIUrl":"https://doi.org/10.1109/FCCM.2014.47","url":null,"abstract":"Many types of stochastic algorithms, such as Monte-Carlo simulations and Bit-Error-Rate testing, require very high run-times and are often trivially parallelisable, so are natural candidates for execution using FPGAs. However, the applications are reliant on Gaussian Random Number Generators (GRNGs) with good statistical properties, as very small biases over trillions of random samples can lead to incorrect results. Previous hardware GRNGs have focussed on area-efficient algorithms to produce Gaussian distributions under idealised assumptions, but do not make statements about the actual distribution coming out of real fixed-point hardware. In this paper, we present a new type of GRNG called a Piecewise-CLT, which uses a weighted blend of many small smooth distributions to approximate the Gaussian. By adjusting the weights, it is possible to directly target the distribution of the Gaussian, resulting in a circuit with an exactly quantified output distribution. Three members of the PwCLT family are presented, ranging from medium-area with good quality, up to a generator providing guaranteed statistical accuracy out to 12-sigma. We also show that PwCLT provides a better area-accuracy tradeoff than all existing high-speed scalar FPGA GRNGs, and can provide extremely high levels of statistical quality not possible in any previous methods.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127760237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hierarchical Memory Architecture with NoC Support for MPSoC on FPGAs","authors":"Shiming Li, Miaoqing Huang, Hongyuan Ding, Sen Ma","doi":"10.1109/FCCM.2014.55","DOIUrl":"https://doi.org/10.1109/FCCM.2014.55","url":null,"abstract":"This work presents a memory hierarchy with the support of network-on-chip (NoC) for MPSoC systems. The memory hierarchy consists of a shared global memory and private local memories as shown in Figure 1. Each core in the system is equipped with two local memories, one for instructions and one for data. The MicroBlaze soft core used in this work connects the main bus through the PLB interface and connects the local memory modules through the LMB interface. Further it connects to a 4x4 mesh NoC through the FSL interface, as shown in Figure 2(a). We built the generic NoC (NoC-g) using the open-source router designed by the Concurrent VLSI Architecture group at the Stanford University [2]. Each router has 5 input ports and 5 output ports. Each input physical channel and each output physical channel is connected to 4 input virtual channels and 4 output virtual channels, respectively. The 40 virtual channels are connected to an internal crossbar switch for routing. We designed the adapter to connect the MicroBlaze processor to the router.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115047505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA Implementation of Optical Flow Algorithm Based on Cost Aggregation","authors":"Y. Tanabe, T. Maruyama","doi":"10.1109/FCCM.2014.57","DOIUrl":"https://doi.org/10.1109/FCCM.2014.57","url":null,"abstract":"The computational complexity of the optical flow estimation is very high, and many hardware systems have been proposed. In these systems, Lucas-Kanade, tensor-based, and phase-based method have been widely used. Census-transform, which is widely used in the stereo vision systems, was also implemented in several FPGA systems. In these systems, only one clock cycle is required for calculating one flow as their throughput, and their processing speed is fast enough for real-time processing of high resolution images. GPUs have also been used, and it was reported that the acceleration by FPGAs and GPUs is comparable[1][2]. The main problem in these systems is their low accuracy. The methods described above show high accuracy for the regions with high changes of brightness, but show poor results for uniform regions. This is the common problem with the stereo vision, and the approaches used in the stereo vision can be applied to the optical flow estimation. In this paper, we extend a cost aggregation algorithm[3] for the optical flow estimation, and implement it on FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133063503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Xiang, Lei Zhao, Xi Jin, Tianqi Wang, S. Chu, C. Ma, Shubin Liu, Q. An, Xue Ben
{"title":"A Multi-phase Clock Time-to-Digital Convertor Based on ISERDES Architecture","authors":"Tian Xiang, Lei Zhao, Xi Jin, Tianqi Wang, S. Chu, C. Ma, Shubin Liu, Q. An, Xue Ben","doi":"10.1109/FCCM.2014.22","DOIUrl":"https://doi.org/10.1109/FCCM.2014.22","url":null,"abstract":"The time-to-digital converter(TDC) aims to mark an accurate timestamp at the time of input signal comes. The Multi-phase Clock sampling method is an usual way to map the TDC into an FPGA. Traditionally, this method provides a medium accuracy and low resources occupation. In this paper, we present a new architecture of TDC base on the 2-ISERDES in the SelectIO, rather than utilizing the Slice resources by the old way. The ISERDESes based TDC is equivalent to a 8 equidistant phase-shifted clocks TDC, with maximum clock frequency 900MHz. The least significant bit(LSB) is 139ps, which is 445% better than traditional architecture.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134297609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edward A. Stott, Joshua M. Levine, P. Cheung, Nachiket Kapre
{"title":"Timing Fault Detection in FPGA-Based Circuits","authors":"Edward A. Stott, Joshua M. Levine, P. Cheung, Nachiket Kapre","doi":"10.1109/FCCM.2014.32","DOIUrl":"https://doi.org/10.1109/FCCM.2014.32","url":null,"abstract":"The operation of FPGA systems, like most VLSI technology, is traditionally governed by static timing analysis, whereby safety margins for operating and manufacturing uncertainty are factored in at design-time. If we operate FPGA designs beyond these conservative margins we can obtain substantial energy and performance improvements. However, doing this carelessly would cause unacceptable impacts to reliability, lifespan and yield - issues which are growing more severe with continuing process scaling. Fortunately, the flexibility of FPGA architecture allows us to monitor and control reliability problems with a variety of runtime instrumentation and adaptation techniques. In this paper we develop a system for detecting timing faults in arbitrary FPGA circuits based on Razor-like shadow register insertion. Through a combination of calibration, timing constraint and adaptation of the CAD flow, we deliver low-overhead, trustworthy fault detection for FPGA-based circuits.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"47 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129360991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Abstract: Shared L2 Cache Management in Multicore Real-Time System","authors":"Gang Chen, Biao Hu, Kai Huang, A. Knoll, Di Liu","doi":"10.1109/FCCM.2014.52","DOIUrl":"https://doi.org/10.1109/FCCM.2014.52","url":null,"abstract":"In multicore system, shared cache interference has been recognized as one of the major factors that degrade the average performance as well as predictability of system. How to manage the shared cache in order to optimize the system performance while guaranteeing the system predictability is still an open issue. State-of-the-art techniques on this topic use page coloring to partition the shared cache at OS level. In this paper, we present a shared cache management scheme for multicore system. This shared cache management scheme supports way-based cache partitioning at hardware level, building task-level time-triggered reconfigurable-cache multicore system. We evaluated the proposed scheme w.r.t. different numbers of cores and cache modules and prototyped the constructed MPSoCs on FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127542409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Outer Loop Parallelism of Nested Loop on Coarse-Grained Reconfigurable Architectures","authors":"Dajiang Liu, S. Yin, Leibo Liu, Shaojun Wei","doi":"10.1109/FCCM.2014.19","DOIUrl":"https://doi.org/10.1109/FCCM.2014.19","url":null,"abstract":"A coarse-grained reconfigurable architecture is a promising architecture with high power efficiency, which is typically composed of a host controller and a processing element array (PEA). Loops are often mapped onto PEAs for acceleration. In previous work, innermost loop is pipelined, and the the maximal number of concurrently executable operators (CEOs) in the kernel is limited by the inner loop. The loop body DFG of the input 2D nested loop with a inner loop carried dependence ([0,1]) and outer loop carried dependence ([1,1]). We would map this loop onto a 4×4 PEA with pipelining. We assume that the latency of executing one loop iteration is Lb, and the number of iterations involved at one cycle in the kernel phase of pipelining is Wk. As there is a inner loop dependence ([0,1]), the initiation interval (IIi) of inner loop pipelining could be minimized to 1 and we get Wk = 4. We also note that the angle α is contained by two sides in Figure 1(b), which could be written as follow: tan(α) = Wk/Lb = 1/IIi.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128422938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Li, Thomas Page, Guojie Luo, Wentai Zhang, Pei Wang, Peng Zhang, P. Maass, M. Jiang, J. Cong
{"title":"FPGA Acceleration for Simultaneous Medical Image Reconstruction and Segmentation","authors":"Peng Li, Thomas Page, Guojie Luo, Wentai Zhang, Pei Wang, Peng Zhang, P. Maass, M. Jiang, J. Cong","doi":"10.1109/FCCM.2014.54","DOIUrl":"https://doi.org/10.1109/FCCM.2014.54","url":null,"abstract":"The conventional approach of computed tomography (CT) is to solve each image processing task individually in sequence. An obvious drawback is that the measured data is only used once at the first step, and the possible errors, from noises in the measured data, inappropriate modeling, or inappropriate parameters, are not easy to be corrected and will be propagated into the later steps. As a consequence, approaches that combine the reconstruction and the specific processing task have become popular. This work adopts an iterative algorithm with simultaneous reconstruction and segmentation using the Mumford-Shah model, which can be applied not only to regularize the ill-posedness of the tomographic reconstruction problem, but also to compute segmentation directly from the measured data. The Mumford-Shah model is both mathematically and computationally difficult. In this paper, we accelerated this computation and data intensive application by FPGA devices and achieved 9.24X speedup over the conventional CPU implementation.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126805169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Signal Processors on FPGAs","authors":"Di Wu, Andreas Moshovos","doi":"10.1109/FCCM.2014.58","DOIUrl":"https://doi.org/10.1109/FCCM.2014.58","url":null,"abstract":"An Image Signal Processor (ISP) converts raw imaging sensor data into a format appropriate for further processing and human inspection. This work explores FPGA-based ISP designs considering specialized and programmable implementations and proposes an ISP using a programmable generic processing unit with comparable performance versus the dedicated implementations.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115510570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerator of Stacked Convolutional Independent Subspace Analysis for Deep Learning-Based Action Recognition","authors":"Lu He, Yan Luo, Yu Cao","doi":"10.1109/FCCM.2014.37","DOIUrl":"https://doi.org/10.1109/FCCM.2014.37","url":null,"abstract":"Action recognition has been a research challenge in multimedia computing and machine vision. Recent advances in deep learning combined with stacked convolutional Independent Subspace Analysis (ISA) has achieved a better performance superior to all previously published results on several public available data sets. Unfortunately, one major issue in large-scale deployment of this new deep learning-based approach is the unacceptable latency of training with high-dimension data. In this paper, we propose a new hardware accelerator that can reduce the training time substantially for deep learning-based action recognition. Specifically, our proposed approach focuses on accelerating the convolutional stacked ISA algorithm, the core components of the deep learning-based action recognition algorithms. We design parallel pipelines, data parallelisms and look-up table to speed up the algorithm. With an embedded heterogeneous platform consisting of a general purpose processor and a FPGA, we are able to achieve up to 10X speedup for stacked ISA training compared to a software-only implementation.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128588776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}