Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, I. Eslick, Raphael Rubin, Tomás E. Uribe, T. Knight, A. DeHon
{"title":"GraphStep: A System Architecture for Sparse-Graph Algorithms","authors":"Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, I. Eslick, Raphael Rubin, Tomás E. Uribe, T. Knight, A. DeHon","doi":"10.1109/FCCM.2006.45","DOIUrl":"https://doi.org/10.1109/FCCM.2006.45","url":null,"abstract":"Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this \"memory wall,\" we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreading-activation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123947962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Hardware Architecture for Real-Time Dynamic Programming Applications","authors":"B. Matthews, I. Elhanany","doi":"10.1109/FCCM.2006.61","DOIUrl":"https://doi.org/10.1109/FCCM.2006.61","url":null,"abstract":"This paper introduces a novel architecture for performing the core computations required by dynamic programming (DP) techniques. The latter pertain to a vast range of applications that necessitate an optimal sequence of decisions to be issued. An underlying assumption is that a complete model of the environment is provided, whereby the dynamics are governed by a Markov decision process (MDP). Existing DP implementations have traditionally been realized in software. Here, we present a method for exploiting the data parallelism associated with computing both the value function and optimal action set. An optimal policy is obtained four orders of magnitude faster than traditional software-based schemes, establishing the viability of the approach for real-time applications","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126390377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uday Bondhugula, A. Devulapalli, James Dinan, Joseph A. Fernando, P. Wyckoff, E. Stahlberg, P. Sadayappan
{"title":"Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths","authors":"Uday Bondhugula, A. Devulapalli, James Dinan, Joseph A. Fernando, P. Wyckoff, E. Stahlberg, P. Sadayappan","doi":"10.1109/FCCM.2006.48","DOIUrl":"https://doi.org/10.1109/FCCM.2006.48","url":null,"abstract":"Field-programmable gate arrays (FPGAs) are being employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines. Parallel FPGA-based designs often yield a very high speedup. Applications using these designs on reconfigurable supercomputers involve software on the system managing computation on the FPGA. To extract maximum performance from an FPGA design at the application level, it becomes necessary to minimize associated data movement costs on the system. We address this hardware/software integration challenge in the context of the all-pairs shortest-paths (APSP) problem in a directed graph. We employ a parallel FPGA-based design using a blocked algorithm to solve large instances of APSP. With appropriate design choices and optimizations, experimental results on the Cray XD1 show that the FPGA-based implementation sustains an application-level speedup of 15 over an optimized CPU-based implementation","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125466200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systematic Characterization of Programmable Packet Processing Pipelines","authors":"Michael Attig, G. Brebner","doi":"10.1109/FCCM.2006.67","DOIUrl":"https://doi.org/10.1109/FCCM.2006.67","url":null,"abstract":"This paper considers the elaboration of custom pipelines for network packet processing, built upon flexible programmability of pipeline stage granularity. A systematic procedure for accurately characterizing throughput, latency, and FPGA resource requirements, of different programmed pipeline variants is presented. This procedure may be exploited at design time, configuration time, or run time, to program pipeline architectures to meet specific networking application requirements. The procedure is illustrated using three case studies drawn from real-life packet processing at different levels of networking protocol. Detailed results are presented, demonstrating that the procedure estimates pipeline characteristics well, thus allowing rapid architecture space exploration prior to elaboration","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124232579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Patel, Christopher A. Madill, Manuel Saldaña, C. Comis, R. Pomès, P. Chow
{"title":"A Scalable FPGA-based Multiprocessor","authors":"A. Patel, Christopher A. Madill, Manuel Saldaña, C. Comis, R. Pomès, P. Chow","doi":"10.1109/FCCM.2006.17","DOIUrl":"https://doi.org/10.1109/FCCM.2006.17","url":null,"abstract":"It has been shown that a small number of FPGAs can significantly accelerate certain computing tasks by up to two or three orders of magnitude. However, particularly intensive large-scale computing applications, such as molecular dynamics simulations of biological systems, underscore the need for even greater speedups to address relevant length and time scales. In this work, we propose an architecture for a scalable computing machine built entirely using FPGA computing nodes. The machine enables designers to implement large-scale computing applications using a heterogeneous combination of hardware accelerators and embedded microprocessors spread across many FPGAs, all interconnected by a flexible communication network. Parallelism at multiple levels of granularity within an application can be exploited to obtain the maximum computational throughput. By focusing on applications that exhibit a high computation-to-communication ratio, we narrow the extent of this investigation to the development of a suitable communication infrastructure for our machine, as well as an appropriate programming model and design flow for implementing applications. By providing a simple, abstracted communication interface with the objective of being able to scale to thousands of FPGA nodes, the proposed architecture appears to the programmer as a unified, extensible FPGA fabric. A programming model based on the MPI message-passing standard is also presented as a means for partitioning an application into independent computing tasks that can be implemented on our architecture. Finally, we demonstrate the first use of our design flow by developing a simple molecular dynamics simulation application for the proposed machine, which runs on a small platform of development boards","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128637759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parrotfish: Task Distribution in a Low Cost Autonomous ad hoc Sensor Network through Dynamic Runtime Reconfiguration","authors":"D. Efstathiou, Konstantinos Kazakos, A. Dollas","doi":"10.1109/FCCM.2006.56","DOIUrl":"https://doi.org/10.1109/FCCM.2006.56","url":null,"abstract":"The Parrotfish project is a low cost, distributed environment for (partial) reconfiguration of distributed field programmable systems, e.g. sensor networks. In this paper we present architectures and results in which the wireless nodes of a distributed system can undergo runtime task-reversals under triggering from external conditions, using Bluetooth as a low cost wireless medium. The project gets its name from the small fish found in Florida waters, which can change gender as needed under group dynamics","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114221981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Anderson, J. Agron, W. Peck, Jim Stevens, Fabrice Baijot, E. Komp, R. Sass, D. Andrews
{"title":"Enabling a Uniform Programming Model Across the Software/Hardware Boundary","authors":"E. Anderson, J. Agron, W. Peck, Jim Stevens, Fabrice Baijot, E. Komp, R. Sass, D. Andrews","doi":"10.1109/FCCM.2006.40","DOIUrl":"https://doi.org/10.1109/FCCM.2006.40","url":null,"abstract":"In this paper, we present hthreads, a unifying programming model for specifying application threads running within a hybrid CPU/FPGA system. Threads are specified from a single pthreads multithreaded application program and compiled to run on the CPU or synthesized to run on the FPGA. The hthreads system, in general, is unique within the reconfigurable computing community as it abstracts the CPU/FPGA components into a unified custom threaded multiprocessor architecture platform. To support the abstraction of the CPU/FPGA component boundary, we have created the hardware thread interface (HWTI) component that frees the designer from having to specify and embed platform specific instructions to form customized hardware/software interactions. Instead, the hardware thread interface supports the generalized pthreads API semantics, and allows passing of abstract data types between hardware and software threads. Thus the hardware thread interface provides an abstract, platform independent compilation target that enables thread and instruction-level parallelism across the software/hardware boundary","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121671758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M.R. Bodnar, J. Humphrey, P. Curt, J. Durbano, D. Prather
{"title":"Floating-Point Accumulation Circuit for Matrix Applications","authors":"M.R. Bodnar, J. Humphrey, P. Curt, J. Durbano, D. Prather","doi":"10.1109/FCCM.2006.41","DOIUrl":"https://doi.org/10.1109/FCCM.2006.41","url":null,"abstract":"Many scientific algorithms require floating-point reduction operations, or accumulations, including matrix-vector-multiply (MVM), vector dot-products, and the discrete cosine transform (DCT). Because FPGA implementations of each of these algorithms are desirable, it is clear that a high-performance, floatingpoint accumulation unit is necessary. However, this type of circuit is difficult to design in an FPGA environment due to the deep pipelining of the floatingpoint arithmetic units, which is needed in order to attain high performance designs (Durbano et al., 2004, Leeser and Wang, 2004). A deep pipeline requires special handling in feedback circuits because of the long delay, which is further complicated by a continuous input data stream. Proposed accumulator architectures, which overcome such performance bottlenecks, are described in Zuo et al. (2005) and Zuo and Prassana (2005). This paper presents a floating-point accumulation circuit that is a natural evolution of this work. The system can handle streams of arbitrary length, requires modest area, and can handle interrupted data inputs. In contrast to the designs proposed by Zhuo et al., the proposed architecture maintains buffers for partial result storage which utilize significantly less embedded memory resources, while maintaining fixed size and speed characteristics, regardless of stream length. The results for both single- and double-precision accumulation architectures was verified in a Virtex-II 8000-4 part clocked at more than 150 MHz, and the power of this design was demonstrated in a computationally intense, matrix-matrix-multiply application","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124995167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intrinsic Hardware Evolution of Neural Networks in Reconfigurable Analogue and Digital Devices","authors":"John Maher, Brian McGinley, P. Rocke, F. Morgan","doi":"10.1109/FCCM.2006.53","DOIUrl":"https://doi.org/10.1109/FCCM.2006.53","url":null,"abstract":"In this paper a genetic algorithm has been developed to evolve a neural network (NN) implementation of a two input XOR function. This GA will subsequently be used to contrast the relative difficulties of implementing the XOR NN on FPGA's and FPAA's respectively. Two case studies are presented to demonstrate intrinsic evolution of the XOR network on reconfigurable analogue and digital devices. In both cases the GA evolves the synaptic weights and threshold values for an NN implemented on both field programmable gate array (FPGA) and field programmable analogue array (FPAA) hardware platforms","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129666100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSynth: A Pipeline Synthesis Environment for FPGAs","authors":"M. Wirthlin, Welson Sun","doi":"10.1109/FCCM.2006.37","DOIUrl":"https://doi.org/10.1109/FCCM.2006.37","url":null,"abstract":"A synthesis environment called DSynth has been created for synthesizing high-performance pipelined circuits for FPGAs from synchronous data flow specifications. The goal of this work is to generate the minimum size circuit that meets the throughput constraint of the data flow model. To achieve this constraint efficiently, this approach relies heavily upon a library of pre-characterized pipelined circuit modules. In addition, resource sharing is used extensively to reduce the overall hardware cost","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129348343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}