Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li
{"title":"Accelerating Large-Scale Graph Analytics with FPGA and HMC","authors":"Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li","doi":"10.1109/FCCM.2017.58","DOIUrl":"https://doi.org/10.1109/FCCM.2017.58","url":null,"abstract":"Graph analytics that explores the relationship among interconnected entities is becoming increasingly important due to its broad applicability from machine learning to social science. However, one major challenge for graph processing systems is the irregular data access pattern of graph computation which can significantly degrade the performance. The algorithms, software, and hardware that have been tailored for mainstream parallel applications are, as a result, generally not effective for massive-scale sparse graphs from the real world due to their complexity and irregularity. To address the performance issues in large-scale graph analytics, we combine the emerging Hybrid Memory Cube (HMC) with a modern FPGA in order to achieve exceptional random access performance without any loss of flexibility or efficiency in computation. In particular, we develop collaborative software/hardware techniques to perform a level-synchronized breadth first search (BFS) on the FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that fully exploits the platform's capability to improve data locality and memory access efficiency. For each input graph, this algorithm provides an efficient data layout that allows the FPGA to coalesce memory requests into the largest possible HMC payload requests so that the number of memory requests, which is the primary factor in runtime, can be minimized. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by adding a merging unit. The merging unit takes the best advantage of the increased data locality resulting from graph clustering. We evaluated the performance of our BFS implementation using the AC-510 development kit from Micron over a set of benchmarks from a wide range of applications. We observed that the combination of the clustering algorithm and the merging hardware achieved 2.8 × average performance improvement compared to the latest FPGA-HMC based graph processing system.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131377641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Customizing Neural Networks for Efficient FPGA Implementation","authors":"Mohammad Samragh, M. Ghasemzadeh, F. Koushanfar","doi":"10.1109/FCCM.2017.43","DOIUrl":"https://doi.org/10.1109/FCCM.2017.43","url":null,"abstract":"We propose a novel end-to-end framework to customize execution of deep neural networks on FPGA platforms. Our framework employs a reconfigurable clustering approach that encodes the parameters of deep neural networks in accordance with the application's accuracy requirement and the underlying platform constraints. The throughput of FPGA-based realizations of neural networks is often bounded by the memory access bandwidth. The use of encoded parameters reduces both the required memory bandwidth and the computational complexity of neural networks, increasing the effective throughput. Our framework enables systematic customization of encoded deep neural networks for different FPGA platforms. Proof-of-concept evaluations on four different applications demonstrate up to 9-fold reduction in memory footprint and 15-fold improvement in the operational throughput while the drop in accuracy remains below 0.1%.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123241513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngsoo Kim, Hossein Shahdoost, Shrikant S. Jadhav, C. Gloster
{"title":"Improving the Accuracy of Arctan for Face Detection","authors":"Youngsoo Kim, Hossein Shahdoost, Shrikant S. Jadhav, C. Gloster","doi":"10.1109/FCCM.2017.48","DOIUrl":"https://doi.org/10.1109/FCCM.2017.48","url":null,"abstract":"Significant barriers to real time face detection have been the complexity of computation kernels, minimal costand superior accuracy requirements for both software and hardware implementation based on traditional high performance computing. It is desirable to develop variable precision face detection block for high dynamic range applications including night vision and infrared face detection applications. This paper developed an Arctan fucntion for face detection which supports input ranges upto 360 degrees for Histogram of Oriented Graph. Our implementation takes advantage of mathematical identities for the pedestrian HOG computation. We compare our HOG block design to fixed point implementations and found that using floating point HOG is not be computationally expensive and can accelerate face detection process.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122836908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Megrez: Parallelizing FPGA Routing with Strictly-Ordered Partitioning","authors":"Minghua Shen, Guojie Luo","doi":"10.1109/FCCM.2017.18","DOIUrl":"https://doi.org/10.1109/FCCM.2017.18","url":null,"abstract":"FPGAs play a crucial role in the space of customizable accelerators over the next few years. A chief limiting factor is that FPGA CAD tools are cumbersome and time-consuming to most application developers. Routing is the most complex step in FPGA design flow and NP-complete problem. The PathFinder routing algorithm is in dominant use in FPGA CAD research. However, PathFinder is sequential in nature and lengthy in runtime. Parallelization has the potential to solve the issue but faces non-trivial challenges. In this work we introduce Megrez that uses strictly-ordered partitioning to explore the parallelism on GPU. Experimental results show that Megrez achieves an average of 15.13× speedup on GPU with negligible influence on the routing quality.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125829932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relocating Encrypted Partial Bitstreams by Advance Task Address Loading","authors":"Adewale Adetomi, Godwin Enemali, T. Arslan","doi":"10.1109/FCCM.2017.50","DOIUrl":"https://doi.org/10.1109/FCCM.2017.50","url":null,"abstract":"The ability to relocate hardware tasks in FPGAs is an attractive task management technique, especially in reconfigurable operating systems. A method of relocation involves the modification of the location address of the task while it is being configured. However, the use of encryption to protect bitstreams requires that decryption is done on-chip before relocation. This usually results in a significant resource overhead, arising from the introduced decryption circuit. This paper presents Advance Task Address Loading (ATAL), a unique solution that involves loading the unencrypted task address ahead of the encrypted task's configuration frame data. We have developed a software named Splixbit, which processes the bitstream offline, and a corresponding hardware configuration controller that configures the bitstream on the FPGA. Our results confirmed the possibility of avoiding on-chip dedicated decryption circuit in relocating encrypted partial bitstreams.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126624699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Terabyte Sort on FPGA-Accelerated Flash Storage","authors":"S. Jun, Shuotao Xu, Arvind","doi":"10.1109/FCCM.2017.53","DOIUrl":"https://doi.org/10.1109/FCCM.2017.53","url":null,"abstract":"Sorting is one of the most fundamental and usefulapplications in computer science, and continues to be animportant tool in analyzing large datasets. An important andchallenging subclass of sorting problems involves sorting terabytescale datasets with hundreds of billions of records. Theconventional method of sorting such large amounts of datais to distribute the data and computation over a cluster ofmachines. Such solutions can be fast but are often expensiveand power-hungry. In this paper, we propose a solution basedon flash storage connected to a collection of FPGA-based sortingaccelerators that perform large-scale merge-sort in storage. Theaccelerators include highly efficient sorting networks and mergetrees that use bitonic sorting to emit multiple sorted valuesevery cycle. We show that by appropriate use of acceleratorswe can remove all the computation bottlenecks so that the endto-endsorting performance is limited only by the flash storagebandwidth. We demonstrate that our flash-based system matchesthe performance of existing distributed-cluster solutions of muchlarger scale. More importantly, our prototype is able to showalmost twice the power efficiency compared to the existingJoulesort record holder. An optimized system with less wastefulcomponents is projected to be four times more efficient comparedto the current record holder, sorting over 200,000 records perjoule of energy.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123766312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Real-Time Embedded FPGA Processor for a Stand-Alone Dual-Mode Assistive Device","authors":"A. Jafari, Maysam Ghovanloo, T. Mohsenin","doi":"10.1109/FCCM.2017.55","DOIUrl":"https://doi.org/10.1109/FCCM.2017.55","url":null,"abstract":"This paper presents a stand-alone Dual-mode Tongue DriveSystem (sdTDS) which is designed for people with severedisabilities to control their environment using their tonguemotion and speech. The sdTDS detects user's tongue motion using a magnetic tracer placed on tongue and an array of magnetic sensors embedded in a wireless headset and at the same time it can capture the user's voice using a small microphone embedded in the same headset. A real-time FPGA-based local processor is proposed which can perform all required signal processing at sensor side, rather than sending all raw data out to a PC or smartphone. The proposed sdTDS significantly reduces the transmitter power consumption and subsequently increases the battery life.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122386174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication-Aware MCMC Method for Big Data Applications on FPGAs","authors":"Shuanglong Liu, C. Bouganis","doi":"10.1109/FCCM.2017.9","DOIUrl":"https://doi.org/10.1109/FCCM.2017.9","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) based methods have been the main tool for Bayesian Inference for some years now, and recently they find increasing applications in modern statistics and machine learning. Nevertheless, with the availability of large datasets and increasing complexity of Bayesian models, MCMC methods are becoming prohibitively expensive for real-world problems. At the heart of these methods, lies the computation of likelihood functions that requires access to all input data points in each iteration of the method. Current approaches, based on data subsampling, aim to accelerate these algorithms by reducing the number of the data points for likelihood evaluations at each MCMC iteration. However the existing work doesn't consider the properties of modern memory hierarchies, but treats the memory as one monolithic storage space. This paper proposes a communication-aware MCMC framework that takes into account the underlying performance of the memory subsystem. The framework is based on a novel subsampling algorithm that utilises an unbiased likelihood estimator based on Probability Proportional-to-Size (PPS) sampling, allowing information on the performance of the memory system to be taken into account during the sampling stage. The proposed MCMC sampler is mapped to an FPGA device and its performance is evaluated using the Bayesian logistic regression model on MNIST dataset. The proposed system achieves a 3.37x speed up over a highly optimised traditional FPGA design, therefore the risk in the estimates based on the generated samples is largely decreased.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127901854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"K-Mer Counting Using Bloom Filters with an FPGA-Attached HMC","authors":"Nathaniel McVicar, Chih-Ching Lin, S. Hauck","doi":"10.1109/FCCM.2017.23","DOIUrl":"https://doi.org/10.1109/FCCM.2017.23","url":null,"abstract":"As FPGAs are integrated into to the cloud, they become useful in a number of areas where they were not traditionally considered, such as processing genomics data. For many genomics applications, such as K-mer counting, the off-chip DRAM (and sometimes SRAM) memory subsystems provided by most FPGA boards for high capacity storage are not efficient. Recently new styles of memory have been developed, though their role in reconfigurable computing systems can be unclear. One of the challenges these memory systems present to FPGA designers is identifying how they can be used in current systems, and what new applications become possible. In this paper we describe how and why K-mer counting is one such use for an FPGA-attached Hybrid Memory Cube (HMC). The HMC's high random-access rate is ideal for large Bloom filters, an efficient structure for checking membership in a set, or even counting occurrences. Our HMC based counting Bloom filter, useful in a bioinformatics context, achieves a speedup of 13x over traditional FPGA-attached DRAM and 9.31x to 17.6x over multi-core, multi-threaded software on our host system.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125796158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuzhi Zhang, Xiaozhe Shao, George Provelengios, Naveen Kumar Dumpala, Lixin Gao, R. Tessier
{"title":"Scalable Network Function Virtualization for Heterogeneous Middleboxes","authors":"Xuzhi Zhang, Xiaozhe Shao, George Provelengios, Naveen Kumar Dumpala, Lixin Gao, R. Tessier","doi":"10.1109/FCCM.2017.24","DOIUrl":"https://doi.org/10.1109/FCCM.2017.24","url":null,"abstract":"Over the past decade, a wide-ranging collection of network functions in middleboxes has been used to accommodate the needs of network users. Although the use of general-purpose processors has been shown to be feasible for this purpose, the serial nature of microprocessors limits network functional virtualization (NFV) performance. In this paper, we describe a new heterogeneous hardware-software approach to NFV construction that provides scalability and programmability, while supporting significant hardware-level parallelism and reconfiguration. Our computing platform uses both field-programmable gate arrays (FPGA) and microprocessors to implement numerous NFV operations that can be dynamically customized to specific network flow needs. As the number of required functions and their characteristics change, the hardware in the FPGA is automatically reconfigured to support the updated requirements. Traffic management and hardware reconfiguration functions are performed by a global coordinator which allows for the rapid sharing of middlebox state and continuous evaluation of network function needs. To evaluate our approach, a series of software tools and NFV modules have been implemented. Our system is shown to be scalable for collections of network functions exceeding one million shared states.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129896259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}