{"title":"A dataflow system for anomaly detection and analysis","authors":"A. Bara, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2014.7082793","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082793","url":null,"abstract":"This paper proposes DeADA, a dataflow architecture incorporating an automated, unsupervised and online learning algorithm. Compared with 24 core software implementations, DeADA achieves up to 6.17 times lower data drop rate and 10.7 times higher power efficiency. More importantly, experimental results for the Heartbleed case study suggest that DeADA is capable of detecting unknown attacks under network speeds of at least 18Mbps, a feature which is essential for modern network intrusion detection.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"276-279"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85170374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-accelerated Monte-Carlo integration using stratified sampling and Brownian bridges","authors":"M. D. Jong, V. Sima, K. Bertels, David B. Thomas","doi":"10.1109/FPT.2014.7082755","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082755","url":null,"abstract":"Monte-Carlo Integration (MCI) is a numerical technique for evaluating integrals which have no closed form solution. Naive MCI randomly samples the integrand at uniformly distributed points. This naive approach converges very slowly. Stratified sampling can be used to concentrate the samples on segments of the integration domain where the integrand has the highest variance. Even with stratified sampling, MCI converges very slowly for multidimensional integrals. In this work, we implement an FPGA-accelerated design for MISER, a widely used adaptive MCI algorithm applying stratified sampling. We show how to eliminate the recursion from MISER and partition the algorithm between CPUs and FPGAs. The CPUs manage the control-heavy stratification strategy, while the FPGA is responsible for sampling the integrand. The integrand is compiled into a deep pipeline on the FPGA, producing one function evaluation per clock cycle. We demonstrate the FPGA-accelerated design by pricing a path dependent financial derivative called an Asian option. To make optimal use of the stratification, we implement a Brownian bridge on the FPGA that produces one entire bridge per clock cycle. The FPGA-accelerated design is up to 880 times faster compared to a software reference using the GSL implementation of MISER. Compared to naive MCI in software, our design even requires up to 3572 times less execution time to achieve the same accuracy.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"40 19 1","pages":"68-75"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89238331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using C to implement high-efficient computation of dense optical flow on FPGA-accelerated heterogeneous platforms","authors":"Zhilei Chai, Haojie Zhou, Zhibin Wang, Dong Wu","doi":"10.1109/FPT.2014.7082789","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082789","url":null,"abstract":"High-quality algorithms for dense optical flow computation are computationally intensive. To compute them with high speed and low power is vital to make optical flow computation applicable in real-world applications. In contrast to only the Horn-Schunck model being studied on FPGA-based systems today, one of the best linear variational methods for dense optical flow computation, Combine-Brightness-Gradient, is implemented on FPGA-accelerated heterogeneous platforms in this paper. C instead of HDLs is employed and optimizing techniques based on the algorithmic parallelism and hardware architecture are introduced. Experimental results show that 30-110x improvement of the computing efficiency over CPUs was achieved. The FPGA-accelerated version is able to process 640 × 480 image at 12 fps with 0.38 J per frame, while it is 0.8 fps and around 40 J on CPUs. Through demonstrating high performance and low power of dense optical flow algorithm on FPGA-based heterogeneous platforms implemented in C, this paper shows that the off-the-shelf commodity FPGAs coupled with High-Level-Synthesis (HLS) tools could provide an available option when computational efficiency together with development speed are required.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"260-263"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75331649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyi Qiao, Chen Xu, Lei Xie, Ji Yang, Chengchen Hu, X. Guan, Jianhua Zou
{"title":"Network recorder and player: FPGA-based network traffic capture and replay","authors":"Siyi Qiao, Chen Xu, Lei Xie, Ji Yang, Chengchen Hu, X. Guan, Jianhua Zou","doi":"10.1109/FPT.2014.7082815","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082815","url":null,"abstract":"An appropriate tool to generate real network traffic plays an important role in testing network system. Traditionally, such a tool relies on software solutions that copies data back and forth between different part of memory to capture or replay network traffic. In this paper, we propose an FPGA-centric approach using parallel logic, which can ensure high accuracy of time and high throughput. We first design an FPGA add-on board dealing with the multifarious work like adding content or calculate statistical value. The system is implemented on an own designed off-the-shelf FPGA network add-on card to demonstrate the viability of our assumption. Experiments demonstrate reasonable performance improvement (higher throughput and replay time precision) when compared with software based solutions.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"45 2 1","pages":"342-345"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78137756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient FPGA implementation of QR decomposition using a novel systolic array architecture based on enhanced vectoring CORDIC","authors":"Jianfeng Zhang, P. Chow, Hengzhu Liu","doi":"10.1109/FPT.2014.7082764","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082764","url":null,"abstract":"Multiple input multiple output (MIMO) - Orthogonal frequency division multiplexing (OFDM) systems typically use Orthogonal-triangular (QR) decomposition. In this paper, we present a novel systolic array architecture to realize QR decomposition based on the Givens rotation method for a 4 × 4 real matrix. The coordinate rotation digital computer (CORDIC) algorithm is adopted and modified to speed up and simplify the Givens rotation. To verify the function and evaluate the performance, the proposed architectures are validated on a Virtex 5 FPGA development platform. Compared to a commercial implementation of vectoring CORDIC, an enhanced vectoring CORDIC is presented that uses 37.7% less hardware resources, dissipates 76.8% less power and provides a 1.8 times speed-up while maintaining the same computation accuracy. The novel QR systolic array architecture based on the enhanced vectoring CORDIC saves 5% in hardware and the throughput is improved by a factor of two with no accuracy penalty when compared with the best previous version of the QR systolic array.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"35 1","pages":"123-130"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85145372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zero latency encryption with FPGAs for secure time-triggered automotive networks","authors":"Shanker Shreejith, Suhaib A. Fahmy","doi":"10.1109/FPT.2014.7082788","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082788","url":null,"abstract":"Security has emerged as a key concern in increasingly complex embedded automotive networks. The distributed architecture and broadcast transmission characteristics mean they are vulnerable and provide little resistance to intrusive and non-intrusive attack mechanisms. Incorporating data security using traditional approaches introduces significant latency which can be problematic in the presence of real-time deadlines. We demonstrate how a security layer can be added within the network communication controller in modern time-triggered systems, without introducing additional latency or processing overheads. This allows critical communications to be secured in a manner that is transparent to the processors in the electronic control units (ECUs), while also safeguarding network communication properties.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"33 1","pages":"256-259"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80431527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative processing of Least-Square Monte Carlo for American options","authors":"Jinzhe Yang, Ce Guo, W. Luk, Terence Nahar","doi":"10.1109/FPT.2014.7082753","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082753","url":null,"abstract":"American options are popularly traded in the financial market, so pricing those options becomes crucial in practice. In reality, many popular pricing models do not have analytical solutions. Hence techniques such as Monte Carlo are often used in practice. This paper presents a CPU-FPGA collaborative accelerator using state-of-the-art Least-Square Monte Carlo method, for pricing American options. We provide a new sequence of generating the Monte Carlo paths, and a precalculation strategy for the regression process. Our design is customisable for different pricing models, discretisation schemes, and regression functions. The Heston model is used as a case study for evaluating our strategy. Experimental results show that an FPGA-based solution could provide 22 to 64.5 times faster than a single-core CPU implementation.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"08 1","pages":"52-59"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86368483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junying Huang, C. Y. Lin, Yang Liu, Zhihua Li, Haigang Yang
{"title":"Size aware placement for island style FPGAs","authors":"Junying Huang, C. Y. Lin, Yang Liu, Zhihua Li, Haigang Yang","doi":"10.1109/FPT.2014.7082749","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082749","url":null,"abstract":"In this paper we first examine the impact of FPGA size on overall performance and run-time of placement and routing in the context of cluster-based island-style FPGAs. Based on the observations, an FPGA placement algorithm, Min-Size, is introduced to alleviate the deterioration of performance and run-time of placement and routing when using a large FPGA to implement a circuit. We achieve this by allowing Min-Size to generate a more compact placement of logic, I/O and hard blocks. Our experimental results have shown a 3X and 4X speedup in placement and routing run-time, a 38% and 41% reduction in wire length, and a 8% and 5% improvement in critical path delay when FPGA size increases 10 times.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 5 1","pages":"28-35"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83495011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Inggs, Shane T. Fleming, David B. Thomas, W. Luk
{"title":"Is high level synthesis ready for business? A computational finance case study","authors":"G. Inggs, Shane T. Fleming, David B. Thomas, W. Luk","doi":"10.1109/FPT.2014.7082747","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082747","url":null,"abstract":"High Level Synthesis (HLS) tools for Field Programmable Gate Arrays (FPGAs) have made considerable progress, and are now sufficiently mature that a novice developer could create functionally correct implementation with limited understanding of the target hardware. In this case study, a novice developer considers a benchmark of financial problems for implementation upon FPGA via HLS. This novice starts by extending an existing implementation for a CPU or GPU using tools such as Xilinx's Vivado HLS, the Altera OpenCL SDK or Maxeler's MaxCompiler. When their direct source code translation inevitably didn't meet performance expectations, this developer then applies optimisations such as exploiting task or pipeline parallelism as well as C-slowing. When a combination of these optimisations are considered for a range of devices and process technologies, an acceleration of up to 220 times is achieved using these tools, the sort of acceleration expected of custom architectures. Compared to the 31 times improvement shown by an optimised Multicore CPU implementation, the 60 times improvement by a GPU and 207 times by a Xeon Phi, these results suggest that HLS is indeed ready for industrial adoption.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"66 1","pages":"12-19"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86139429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware/software co-design architecture for Blokus Duo solver","authors":"N. Sugimoto, H. Amano","doi":"10.1109/FPT.2014.7082820","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082820","url":null,"abstract":"This paper presents a software and hardware design of an FPGA-based Blokus Duo solver. We used Embedded system called ZYNQ-7000 All Programmable SoC to implement the solver. By combining hardware with software, efficient acceleration is performed. Our system searches a game tree by using the miniMax algorithm with alpha-beta pruning. The implemented solver works at 75MHz with Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 on the Digilent ZedBoard. It can search states after three moves in most cases.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"358-361"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91201197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}