Nasibeh Nasiri, Oren Segal, M. Margala, W. Vanderbauwhede, S. R. Chalamalasetti
{"title":"High Level Programming of Document Classification Systems for Heterogeneous Environments using OpenCL (Abstract Only)","authors":"Nasibeh Nasiri, Oren Segal, M. Margala, W. Vanderbauwhede, S. R. Chalamalasetti","doi":"10.1145/2684746.2689136","DOIUrl":"https://doi.org/10.1145/2684746.2689136","url":null,"abstract":"Document classification is at the heart of several of the applications that have been driving the proliferation of the internet in our daily lives. The ever growing amounts of data and the need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogenous CPU based implementations. We investigate a heterogeneous system where CPUs are combined with FPGAs as system accelerators. Incorporating FPGAs as accelerators in a heterogeneous computing environment allows for the creation of flexible custom hardware solutions that can potentially offer increased power efficiency and performance gains. One of the main issues delaying wide spread adoption of FPGAs as standard heterogeneous system accelerators is the difficulty in programming them. The OpenCL standard offers a unified C programming model for any device that adheres to its standards. An Altera OpenCL FPGA based implementation of a document classification system is investigated in which a stream of HTML documents is scored according to a profile on a document-by-document basis. The results show that the throughput of the document classification application with and without Bloom Filters is 312MB/s and 343MB/s respectively, when running on CPU, and 354MB/s and 452MB/s respectively, when running on an FPGA. Our results also show up to 32% power efficiency improvement for the FPGA implementation over the CPU implementation. We would like to thank Davor Capalija from Altera for his invaluable advice during our work on the FPGA version of the algorithm.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129468389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient and Flexible FPGA Implementation of a Face Detection System (Abstract Only)","authors":"Hichem Ben Fekih, A. Elhossini, B. Juurlink","doi":"10.1145/2684746.2689095","DOIUrl":"https://doi.org/10.1145/2684746.2689095","url":null,"abstract":"Robust and rapid face detection systems are constantly gaining more interest, since they represent the first stone for many challenging tasks in the field of computer vision. In this paper a software-hardware co-design approach is presented, that enables the detection of frontal faces in real time. A complete hardware implementation of all components taking part of the face detection is introduced. This work is based on the object detection framework of Viola and Jones, which makes use of a cascade of classifiers to reduce the computation time. The proposed architecture is flexible, as it allows the use of multiple instances of the face detector. This makes developers free to choose the speed range and reserved resources for this task. The current implementation runs on the Zynq SoC and receives images over IP network, which allows exposing the face detection task as a remote service that can be consumed from any device connected to the network. We performed several measurements for the final detector and the software equivalent. Using three Evaluator cores, the ZedBoard system achieves a maximal average frame rate of 13.4 FPS when analysing an image containing 640x480 pixels. This stands for an improvement of 5.25 times compared to the software solution and represents acceptable results for most real-time systems. On the ZC706 system, a higher frame rate of 16.58 FPS is achieved. The proposed hardware solution achieved 92% accuracy, which is low compared to the software solution (97%) due to different scaling algorithm. The proposed solution achieved higher frame rate compared to other solutions found in the literature.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130651399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eddie Hung, Joshua M. Levine, Edward A. Stott, G. Constantinides, W. Luk
{"title":"Delay-Bounded Routing for Shadow Registers","authors":"Eddie Hung, Joshua M. Levine, Edward A. Stott, G. Constantinides, W. Luk","doi":"10.1145/2684746.2689075","DOIUrl":"https://doi.org/10.1145/2684746.2689075","url":null,"abstract":"The on-chip timing behaviour of synchronous circuits can be quantified at run-time by adding shadow registers, which allow designers to sample the most critical paths of a circuit at a different point in time than the user register would normally. In order to sample these paths precisely, the path skew between the user and the shadow register must be tightly controlled and consistent across all paths that are shadowed. Unlike a custom IC, FPGAs contain prefabricated resources from which composing an arbitrary routing delay is not trivial. This paper presents a method for inserting shadow registers with a minimum skew bound, whilst also reducing the maximum skew. To preserve circuit timing, we apply this to FPGA circuits post place-and-route, using only the spare resources left behind. We find that our techniques can achieve an average STA reported delay bound of +/-200ps on a Xilinx device despite incomplete timing information, and achieve <1ps accuracy against our own delay model.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121531813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Platform-Independent Gigabit Communication for Low-Cost FPGAs (Abstract Only)","authors":"R. Salomon, R. Joost, Matthias Hinkfoth","doi":"10.1145/2684746.2689150","DOIUrl":"https://doi.org/10.1145/2684746.2689150","url":null,"abstract":"Among other things, field-programmable gate arrays (FPGAs) available today contain numerous bit-serial transceivers for communication purposes. Unlike analog modulation schemes, such as quadrature amplitude modulation, bit-serial communication is relatively easy to implement in digital hardware, and is thus usually used for inter FPGA communication. In this view, only the data rate and frequency limit the bandwidth of the circuit. In order to overcome the bandwidth limit, this research proposes a pulse-width modulation (PWM) scheme for data transmission. The information is coded by modulating the length of the high and low voltage parts of the pulse. Although this approach is not new, existing PWM modulators have unsatisfactorial data rates due to their synchronous implementation nature. Therefore, this research implements both the modulator and demodulator by using asynchronous logic. The result is a proof-of-concept comprising two Terasic DE2-70 development boards and a 1 m coaxial cable. Both the PWM modulator and demodulator run at 333 MHz, and pulses are transmitted every 3 ns. Each pulse carries 3 to 4 bits of data. The experimental results indicate an achievable data rate of one gigabit per second, which is about 50 % larger than the FPGA's handbook states.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131975091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shane T. Fleming, David B. Thomas, G. Constantinides, D. Ghica
{"title":"System-level Linking of Synthesised Hardware and Compiled Software Using a Higher-order Type System","authors":"Shane T. Fleming, David B. Thomas, G. Constantinides, D. Ghica","doi":"10.1145/2684746.2689089","DOIUrl":"https://doi.org/10.1145/2684746.2689089","url":null,"abstract":"Devices with tightly coupled CPUs and FPGA logic allow for the implementation of heterogeneous applications which combine multiple components written in hardware and software languages, including first-party source code and third-party IP. Flexibility in component relationships is important, so that the system designer can move components between software and hardware as the application design evolves. This paper presents a system-level type system and linker, which allows functions in software and hardware components to be directly linked at link time, without requiring any modification or recompilation of the components. The type system is designed to be language agnostic, and exhibits higher-order features, to enables design patterns such as notifications and callbacks to software from within hardware functions. We demonstrate the system through a number of case studies which link compiled software against synthesised hardware in the Xilinx Zynq platform.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130038533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martinianos Papadopoulos, Christos Ttofis, C. Kyrkou, T. Theocharides
{"title":"Real-Time Obstacle Avoidance for Mobile Robots via Stereoscopic Vision Using Reconfigurable Hardware (Abstract Only)","authors":"Martinianos Papadopoulos, Christos Ttofis, C. Kyrkou, T. Theocharides","doi":"10.1145/2684746.2689099","DOIUrl":"https://doi.org/10.1145/2684746.2689099","url":null,"abstract":"An embedded, real-time, and low power obstacle avoidance system is a critical component towards fully autonomous robots that can be used in safety missions, space exploration, and transportation systems among others. In this paper a complete prototyping platform for the evaluation of obstacle avoidance systems and autonomous robots is realized on reconfigurable hardware. An efficient stereo vision algorithm for producing the necessary 3D and an obstacle avoidance subsystem were both implemented on an ATLYS Spartan-6 FPGA board equipped with a VmodCam stereo camera module. A modified FDX Vantage 1/10 electric car platform was used for testing the proposed architecture in indoor and outdoor real-world scenes. The system receives stereo image data from the VmodCam module and a decision-making algorithm is applied on a specified Region of Interest (RoI) on the produced disparity map. The algorithm outputs the direction that the robot should move to in order to avoid any obstacles present. Experimental evaluation results indicate that the FPGA-based robotic platform can avoid obstacles in real-time (i.e. can process and identify obstacles within a 1/30th of a second that a stereo image takes to be processed) in both indoor and outdoor environments, with 91.7% accuracy, equivalent to software implementations. The overall power consumption of the proposed architecture, excluding the electronic car platform, is 6 W, making it ideal for use on mobile robots, without becoming a significant drain on its battery life.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Chen, Yuanfu Zhao, Zhiping Wen, Jing Zhou, Xuewu Li, Yanlong Zhang, Huabo Sun
{"title":"300 Thousand Gates Single Event Effect Hardened SRAM-based FPGA for Space Application (Abstract Only)","authors":"Lei Chen, Yuanfu Zhao, Zhiping Wen, Jing Zhou, Xuewu Li, Yanlong Zhang, Huabo Sun","doi":"10.1145/2684746.2689120","DOIUrl":"https://doi.org/10.1145/2684746.2689120","url":null,"abstract":"SRAM-based FPGAs have been widely used in space engineering. However, the configuration memory in SRAM-based FPGA is susceptible to the single event effects (SEE). It can disrupt the communication or control functions of the spacecraft. To mitigate SEE effects of the SRAM-based FPGAs used in space radiation environment, Beijing Microelectronics Technology Institute (BMTI) developed a 300 thousand gates Single Event Effect hardened SRAM-based FPGA -- BQVR300RH. The BQVR300RH employs Radiation Harden by Design (RHBD) technique. Hardened standard cell library based on Adaptive SRAM (ASRAM) structure is established. For especially sensitive and important resource, other assistant techniques are also adopted. The experiment results show that the BQVR300RH improved the anti-SEU characteristic a lot, compared with Xilinx 300 thousand gates space-grade SRAM-based FPGA (XQVR300). The SEU threshold of BQVR300RH is 19.06 MeV⋅cm2/mg. The anti-SEU characteristic improves three orders of magnitude than XQVR300. The improvement of anti-SEU behavior expands the usage of SRAM-based FPGA in aerospace applications. Currently, BQVR300RH has been used in space field in China.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"222 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124405781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA","authors":"Ren Chen, Sruja Siriyal, V. Prasanna","doi":"10.1145/2684746.2689068","DOIUrl":"https://doi.org/10.1145/2684746.2689068","url":null,"abstract":"Parallel sorting networks are widely employed in hardware implementations for sorting due to their high data parallelism and low control overhead. In this paper, we propose an energy and memory efficient mapping methodology for implementing bitonic sorting network on FPGA. Using this methodology, the proposed sorting architecture can be built for a given data parallelism while supporting continuous data streams. We propose a streaming permutation network (SPN) by \"folding\" the classic Clos network. We prove that the SPN is programmable to realize all the interconnection patterns in the bitonic sorting network. A low cost design for sorting with minimal resource usage is obtained by reusing one SPN . We also demonstrate a high throughput design by trading off area for performance. With a data parallelism of p (2 ≤ p ≤ N/ log2 N), the high throughput design sorts an N-key sequence with latency O(N/p), throughput (# of keys sorted per cycle) O(p) and uses O(N) memory. This achieves optimal memory efficiency (defined as the ratio of throughput to the amount of on-chip memory used by the design) of O(p/N). Another noteworthy feature of the high throughput design is that only single-port memory rather than dual-port memory is required for processing continuous data streams. This results in 50% reduction in memory consumption. Post place-and-route results show that our architecture demonstrates 1.3x ∼1.6x improvment in energy efficiency and 1.5x ∼ 5.3x better memory efficiency compared with the state-of-the-art designs.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114975767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Tamiya, Yoshinori Tomita, Toshiyuki Ichiba, Kaoru Kawamura
{"title":"Sequence-based In-Circuit Breakpoints for Post-Silicon Debug (Abstract Only)","authors":"Y. Tamiya, Yoshinori Tomita, Toshiyuki Ichiba, Kaoru Kawamura","doi":"10.1145/2684746.2689102","DOIUrl":"https://doi.org/10.1145/2684746.2689102","url":null,"abstract":"Recently, simulation and/or formal verification in pre-silicon verification cannot accomplish the whole system-level verification with exhaustive input data and run-time because of lack of sufficient speed and logic capacities. Consequently, post-silicon validation, such as in-circuit debugging, becomes increasingly important. In this paper we propose a novel breakpoint mechanism, which improves controllability of in-circuit debugging. Our contributions are summarized as follows: (1) A basic concept of a new breakpoint method is proposed, which stops the target hardware by detecting a data sequence of arbitrary length, (2) The breakpoint is shown to be implemented in an efficient pipelined hardware, which works \"at-speed\", in realtime and with small area overheads using CRC (Cyclic Redundancy Check), and (3) Our experimental results of detecting a data sequence in a pseudo random stream data shows that false positives can be suppressed by the CRC width and the number of sub-sequences. Since changing breakpoint conditions does not require re-implementation of the hardware, it is expected to reduce much debugging effort in post-silicon validation.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126171855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 8: Applications","authors":"K. Bazargan","doi":"10.1145/3251658","DOIUrl":"https://doi.org/10.1145/3251658","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130161811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}