Yongfu He, Shaojun Wang, Yu Peng, Y. Pang, Ning Ma, Jingyue Pang
{"title":"High performance relevance vector machine on HMPSoC","authors":"Yongfu He, Shaojun Wang, Yu Peng, Y. Pang, Ning Ma, Jingyue Pang","doi":"10.1109/FPT.2014.7082812","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082812","url":null,"abstract":"Relevance Vector Machine (RVM) with the uncertainty expressing ability has spawned broad applications in Prognostic and Health Management (PHM). However computationally intensive intrinsic nature of RVM greatly limits its usage. This paper presents a software and hardware co-design approach based on HMPSoC technology, which efficiently exploited sequential and parallel nature of RVM. Multi-channel and pipelined hardware architecture for the acceleration of kernel formulation and intermediate values calculation is proposed. The hardware that wrapped with AXI-Stream interface is integrated into HMPSoC as an acceleration engine. We implement the design on an on-board PHM prototype platform with a Xilinx Zynq XC7Z020 AP SoC. The experiment results show 5.3× and 46.8× speed up in terms of the time cost than the RVM running on PC with a Xeon 5620 processor and ARM Cortex A9 processor. The energy consumption is reduced by 153.0× and 37.3×, respectively.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"35 1","pages":"334-337"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81126948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the reliability of RO PUF using frequency offset","authors":"Bin Tang, Yaping Lin, Jiliang Zhang","doi":"10.1109/FPT.2014.7082813","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082813","url":null,"abstract":"Physical unclonable function (PUF) is a promising hardware security primitive that can be applied to various security related areas. The ring oscillator (RO) PUF is one of the most popular PUFs that can generate the volatile key by comparing the frequency between ROs. Previous RO PUFs incur unacceptable hardware overheads to improve the reliability in order to eliminate the effect of environment factors. In this paper, we propose a frequency offset algorithm (FOA) to enhance the reliability and low the hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs. Experimental results show that our proposed FOA method has the better reliability and lower hardware overhead than the temperature-aware cooperative (TAC). Especially, our proposed method can achieve the 100% utilization of ROs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"60 1","pages":"338-341"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81084593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"No zero padded sparse matrix-vector multiplication on FPGAs","authors":"Jiasen Huang, Junyan Ren, Wenbo Yin, Lingli Wang","doi":"10.1109/FPT.2014.7082800","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082800","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) algorithms suffer heavy performance penalties due to irregular memory accesses. In this paper, we introduce a novel compressed element storage (CES) format, in which the additional data structures for indexing are abandoned, and each location associated with the non-zero element of the matrix is now indicated by the name of a variable multiplied by the corresponding element of the vector. To ensure fastest access and parallel access without data hazards, on-chip registers are used exclusively to replace the BRAM or off-chip DRAM/SRAM to hold all the SpMxV data. On-chip DSP resources are fully utilized so as to ensure a maximum number of multipliers concurrently working.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"28 4 1","pages":"290-291"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78570045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A pure-CMOS nonvolatile multi-context configuration memory for dynamically reconfigurable FPGAs","authors":"K. Tatsumura, Masato Oda, S. Yasuda","doi":"10.1109/FPT.2014.7082778","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082778","url":null,"abstract":"Multi-context configuration memory stores multiple sets of configuration data and changes the entire configuration of FPGA quickly, enabling enhancement of hardware utilization with dynamic reconfiguration architectures. The memory area for one set of configuration data should be much smaller than the computational resource it controls. In this paper, we propose a pure-CMOS, nonvolatile, and small-footprint multi-context configuration memory. The multi-context memory includes multiple 2Tr nonvolatile memory elements, which are programmed by channel hot-electron injection, and allows context switching in a single clock cycle. A primitive dynamically reconfigurable device having a lookup table and minimum interconnect backed by 16-bit 8-context configuration memory was fabricated by a 0.18 um CMOS process and its functionality was demonstrated. The 2Tr nonvolatile memory element is more than 4 times denser than 6Tr SRAM, enabling achievement of greater logic density. The pure-CMOS and nonvolatile features would enhance the attractiveness of the technology in many applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"41 1","pages":"215-222"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88066363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA implementation of Blokus Duo player using hardware/software co-design","authors":"A. Kojima","doi":"10.1109/FPT.2014.7082825","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082825","url":null,"abstract":"Blokus Duo is an abstract strategy game for two players. In this paper, we describe our FPGA implementation of Blokus Duo player for ICFPT2014 design contest, which is the revised version of the previous design for ICPFT2013 design contest. Our design consists of hardware logic part and software part using soft IP processor. The hardware logic part calculates evaluation value of the board status which is a heavy task for the software part. Our implementation uses recursive Alpha-Beta pruning and iteration deepening algorithm by the software part, which are complex to implement as the hardware logic circuit. The current version of our implementation on Xilinx Artix7 can run at 142MHz. The hardware logic part evaluates about 90,000 nodes in one second at the beginning of the game.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"378-381"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84034239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon
{"title":"RotoRouter: Router support for endpoint-authorized decentralized traffic filtering to prevent DoS attacks","authors":"Albert Kwon, Kaiyu Zhang, P. L. Lim, Yuchen Pan, Jonathan M. Smith, A. DeHon","doi":"10.1109/FPT.2014.7082774","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082774","url":null,"abstract":"RotoRouter addresses Denial-of-Service (DoS) attacks on networks with a novel protocol and router implementation. Sets of RotoRouters cooperate in detecting and filtering out invalid network traffic before it reaches network endpoints; a new router-enforceable connection protocol queries destination endpoints to authorize traffic flows and uses per-packet digital signatures to distinguish allowed from disallowed connections. A RotoRouter prototype was implemented on a four-port 1000BASE-T NetFPGA-10G platform and supports 1024 simultaneous active connections using 74 BRAMs (less than one quarter of the available NetFPGA-10G BRAMs). It is able to sustain 800 Mbps per port throughputs for 1500B packets with less than 0.3/its latency, even during a DoS attack. With additional logic and memory resources, the required validation and switching operations scale to port speeds in excess of 10 Gbps and links with more than 10,000 active flows.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"183-190"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84325900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architectural synthesis of computational pipelines with decoupled memory access","authors":"Shaoyi Cheng, J. Wawrzynek","doi":"10.1109/FPT.2014.7082758","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082758","url":null,"abstract":"As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor-centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of-the-art HLS tool.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"83-90"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85643289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie
{"title":"Industrial session","authors":"Sukjin Kim, Jason Wong, P. Kane, Dylan Wang, Xiaolong Xie","doi":"10.1109/fpt.2014.7082827","DOIUrl":"https://doi.org/10.1109/fpt.2014.7082827","url":null,"abstract":"Xilinx has developed even more advanced FPGAs and 2nd generation SoCs and 3D ICs to stay a generation ahead, and deliver an extra node worth of performance, power, and integration. The UltraScale architecture was developed to scale from 20nm planar through 16nm and beyond FinFET (FF) technologies, and from monolithic through 3D ICs. In this talk, we will study the cases about Xilinx FPGA in cutting edge applications, also the advantages of UltraScale architecture 2nd generation SoCs, and design tools. IoT and Wearable Applications Enabled by Bluetooth Low Energy (BLE) Solutions Patrick Kane, Cypress Abstract: The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session The Internet of things is happening right now. The newest standard is Bluetooth Low Energy or BLE. This may or may not be the long term answer to IoT communication, but it is certainly in the race to become the leading IoT communication standard. Industrial Session","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"42 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83589874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Time sharing of Runtime Coarse-Grain Reconfigurable Architectures processing elements in multi-process systems","authors":"Benjamin Carrión Schäfer","doi":"10.1109/FPT.2014.7082757","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082757","url":null,"abstract":"","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"76-82"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83458446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Put FPGAs in your CPU socket?","authors":"P. Chow","doi":"10.1109/FPT.2013.6718320","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718320","url":null,"abstract":"Summary form only given. Ever since FPGAs were invented, there has been great interest in using them as computing devices, and with the logic densities of today's devices, many interesting functions have been shown to have significant performance and energy benefits when implemented in FPGAs. However, when an application requires the combination of a high-performance CPU and an FPGA accelerator, the effectiveness of the FPGA is highly determined by the latency and bandwidth between the CPU, the CPU memory system and the FPGA and its memory system. Putting FPGAs into the CPU socket is one way to address this issue. This talk will present the history, the advantages and disadvantages, the challenges, architectures, programming models and applications of \"insocket\" accelerator systems.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81476395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}