高性能计算技术Pub Date : 2009-11-15DOI: 10.1145/1646461.1646467
V. Aggarwal, A. George, K. Yalamanchili, C. Yoon, H. Lam, G. Stitt
{"title":"Bridging parallel and reconfigurable computing with multilevel PGAS and SHMEM+","authors":"V. Aggarwal, A. George, K. Yalamanchili, C. Yoon, H. Lam, G. Stitt","doi":"10.1145/1646461.1646467","DOIUrl":"https://doi.org/10.1145/1646461.1646467","url":null,"abstract":"Reconfigurable computing (RC) systems based on FPGAs are becoming an increasingly attractive solution to building parallel systems of the future. Applications targeting such systems have demonstrated superior performance and reduced energy consumption versus their traditional counterparts based on microprocessors. However, most of such work has been limited to small system sizes. Unlike traditional HPC systems, lack of integrated, system-wide, parallel-programming models and languages presents a significant design challenge for creating applications targeting scalable, reconfigurable HPC systems. In this paper, we introduce and investigate a novel programming model based on Partitioned Global Address Space (PGAS), which simplifies development of parallel applications for such systems. The new multilevel PGAS programming model captures the unique characteristics of these systems, such as the existence of multiple levels of memory hierarchy and heterogeneous computation resources. To evaluate this multilevel PGAS model, we extend and adapt the SHMEM programming language to become what we call SHMEM+, the first known SHMEM library enabling coordination between FPGAs and CPUs in a reconfigurable, heterogeneous HPC system. Our design of SHMEM+ is highly portable and provides peak communication bandwidth comparable to vendor-proprietary versions of SHMEM. In addition, applications designed with SHMEM+ yield improved developer productivity compared to current methods of multi-device RC design and achieve a high degree of portability.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"2 1","pages":"47-54"},"PeriodicalIF":0.0,"publicationDate":"2009-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87872122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2009-11-15DOI: 10.1145/1646461.1646463
Krishna K. Nagar, Yan Zhang, J. Bakos
{"title":"An integrated reduction technique for a double precision accumulator","authors":"Krishna K. Nagar, Yan Zhang, J. Bakos","doi":"10.1145/1646461.1646463","DOIUrl":"https://doi.org/10.1145/1646461.1646463","url":null,"abstract":"The accumulation operation, An+1 = An + X, is perhaps one of the most fundamental and widely-used operations in numerical mathematics and digital signal processing. However, designing double-precision floating-point accumulators presents a unique set of challenges: double-precision addition is usually deeply pipelined and without special micro-architectural or data scheduling techniques, the data hazard that exists between An+1 and An requires that each new value of X delivered to the accumulator wait for the latency of the adder. There have been several techniques proposed for alleviating this problem, but each carries significant overheads and/or restrictions on input characteristics. In this paper we present a design for a double precision accumulator that requires no timing overhead relative to the underlying add operation. We achieve this by integrating a coalescing reduction circuit within the low-level design of a base-converting floating-point adder. To demonstrate our accumulator design, we use it in a sparse matrix vector multiplication architecture, achieving a throughput of up to 3.7 GFLOPS.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"29 1","pages":"11-18"},"PeriodicalIF":0.0,"publicationDate":"2009-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82405023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2009-11-15DOI: 10.1145/1646461.1646464
V. Aggarwal, Rafael García, G. Stitt, A. George, H. Lam
{"title":"SCF: a device- and language-independent task coordination framework for reconfigurable, heterogeneous systems","authors":"V. Aggarwal, Rafael García, G. Stitt, A. George, H. Lam","doi":"10.1145/1646461.1646464","DOIUrl":"https://doi.org/10.1145/1646461.1646464","url":null,"abstract":"Heterogeneous computing systems comprised of accelerators such as FPGAs, GPUs, and Cell processors coupled with standard microprocessors are becoming an increasingly popular solution to building future computing systems. Although programming languages and tools have evolved to simplify device-level design, programming such systems is still difficult and time-consuming due to system-level challenges involving synchronization and communication between heterogeneous devices, which currently require ad-hoc solutions. To solve this problem, this paper presents the System-Level Coordination Framework (SCF), which enables transparent communication and synchronization between tasks running on heterogeneous processing devices in the system. By hiding low-level architectural details from the application designer, SCF can improve application development productivity, provide higher levels of application portability, and offer rapid design-space exploration of different task/device mappings. In addition, SCF enables custom communication synthesis, which can provide performance improvements over generic solutions employed previously.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"55 1","pages":"19-28"},"PeriodicalIF":0.0,"publicationDate":"2009-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75119234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2009-11-15DOI: 10.1145/1646461.1646465
Gongyu Wang, G. Stitt, H. Lam, A. George
{"title":"A framework for core-level modeling and design of reconfigurable computing algorithms","authors":"Gongyu Wang, G. Stitt, H. Lam, A. George","doi":"10.1145/1646461.1646465","DOIUrl":"https://doi.org/10.1145/1646461.1646465","url":null,"abstract":"Reconfigurable computing (RC) is rapidly becoming a vital technology for many applications, from high-performance computing to embedded systems. The inherent advantages of custom-logic hardware devices, such as the FPGA, combined with the versatility of software-driven hardware configuration often boost performance while reducing power consumption. However, compared to software design tools, the relatively immature state of RC design tools significantly limits productivity and consequently limits widespread adoption of RC. Long and tedious design-translate-execute (DTE) processes for RC applications (e.g., using RTL through HDL) must be repeated in order to meet mission requirements. Novel methods for rapid virtual prototyping and performance prediction can reduce DTE repetitions by providing fast and accurate tradeoff analysis before the design stage. This paper presents a novel core-level modeling and design (CMD) framework for RC algorithms to support fast, accurate and early design-space exploration (DSE). The framework provides support for core-level modeling, performance prediction, and rapid bridging to design and translation. Core-level modeling enables detailed DSE without the need for coding. Performance prediction, such as maximum clock frequency, supports core-level DSE and can help system-level modeling and design tools to achieve more accurate system-level DSE. Finally, core-level models can be used to generate code templates and design constraints that feed translation tools and to rapidly obtain predicted performance.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"42 1","pages":"29-38"},"PeriodicalIF":0.0,"publicationDate":"2009-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80939574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2009-11-15DOI: 10.1145/1646461.1646466
R. Chamberlain, N. Ganesan
{"title":"Sorting on architecturally diverse computer systems","authors":"R. Chamberlain, N. Ganesan","doi":"10.1145/1646461.1646466","DOIUrl":"https://doi.org/10.1145/1646461.1646466","url":null,"abstract":"Sorting is an important problem that forms an essential component of many high-performance applications. Here, we explore the design space of sorting algorithms in recon-figurable hardware, looking to maximize the benefit associated with high-bandwidth, multiple-port access to memory. Rather than focus on an individual implementation, we investigate a family of approaches that exploit characteristics fairly unique to reconfigurable hardware.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"51 1","pages":"39-46"},"PeriodicalIF":0.0,"publicationDate":"2009-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82745052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745681
Miaoqing Huang, H. Simmler, P. Saha, T. El-Ghazawi
{"title":"Hardware task scheduling optimizations for reconfigurable computing","authors":"Miaoqing Huang, H. Simmler, P. Saha, T. El-Ghazawi","doi":"10.1109/HPRCTA.2008.4745681","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745681","url":null,"abstract":"Reconfigurable computers (RC) can provide significant performance improvement for domain applications. However, wide acceptance of todaypsilas RCs among domain scientist is hindered by the complexity of design tools and the required hardware design experience. Recent developments in hardware/software co-design methodologies for these systems provide the ease of use, but they are not comparable in performance to manual co-design. This paper aims at improving the overall performance of hardware tasks assigned to FPGA. Particularly the analysis of inter-task communication as well as data dependencies among tasks are used to reduce the number of configurations and to minimize the communication overhead and task processing time. This work leverages algorithms developed in the RC and reconfigurable hardware (RH) domains to address efficient use of hardware resources to propose two algorithms, weight-based scheduling (WBS) and highest priority first-next fit (HPF-NF). However, traditional resource based scheduling alone is not sufficient to reduce the performance bottleneck, therefore a comprehensive algorithm is necessary. The reduced data movement scheduling (RDMS) algorithm is proposed to address dependency analysis and inter-task communication optimizations. Simulation shows that compared to WBS and HPF-NF, RDMS is able to reduce the amount of FPGA configurations to schedule random generated graphs with heavy weight nodes by 30% and 11% respectively. Additionally, the proof-of-concept implementation of a complex 13-node example task graph on the SGI RC100 reconfigurable computer shows that RDMS is not only able to trim down the amount of necessary configurations from 6 to 4 but also to reduce communication overhead by 48% and the hardware processing time by 33%.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"86 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79397351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745683
E. El-Araby, I. González, T. El-Ghazawi
{"title":"Virtualizing and sharing reconfigurable resources in High-Performance Reconfigurable Computing systems","authors":"E. El-Araby, I. González, T. El-Ghazawi","doi":"10.1109/HPRCTA.2008.4745683","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745683","url":null,"abstract":"High-performance reconfigurable computers (HPRCs) are parallel computers but with added FPGA chips. Examples of such systems are the Cray XT5h and Cray XD1, the SRC-7 and SRC-6, and the SGI Altix/RASC. The execution of parallel applications on HPRCs mainly follows the single-program multiple-data (SPMD) model, which is largely the case in traditional high-performance computers (HPCs). In addition, the prevailing usage of FPGAs in such systems has been as co-processors. The overall system resources, however, are often underutilized because of the asymmetric distribution of the reconfigurable processors relative to the conventional processors. This asymmetry is often a challenge for using the SPMD programming model on these systems. In this work, we propose a resource virtualization solution based on partial run-time reconfiguration (PRTR). This technique will allow sharing the reconfigurable processors among the underutilized processors. We will present our virtualization infrastructure augmented with an analytical investigation. We will verify our proposed concepts with experimental implementations using the Cray XD1 as a testbed. It will be shown that this approach is quite promising and will allow full exploitation of the system resources with fair sharing of the reconfigurable processors among the microprocessors. Our approach is general and can be applied to any of the available HPRC systems.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"11 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74834443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745686
S. Murtaza, A. Hoekstra, P. Sloot
{"title":"Floating point based Cellular Automata simulations using a dual FPGA-enabled system","authors":"S. Murtaza, A. Hoekstra, P. Sloot","doi":"10.1109/HPRCTA.2008.4745686","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745686","url":null,"abstract":"With the recent emergence of multicore architectures, the age of multicore computing might have already dawned upon us. This shift might have triggered the evolution of von Neumann architecture towards a parallel processing paradigm. Cellular Automata- inherently decentralized spatially extended systems consisting of large numbers of simple and identical components with local connectivity, also proposed by von Neumann in 1950s, is the potential candidate among the parallel processing alternatives. The spatial parallelism available on field programmable gate arrays make them the ideal platform to investigate the cellular automata systems as potential parallel processing paradigm on multicore architectures. The authors have been experimenting with this idea for quite some time now and report their progress from a single to a dual FPGA chip based cellular automata accelerator implementation. For D2Q9 Lattice Boltzmann method implementation, we were able to achieve an overall speed-up of 2.3 by moving our Fortran implementation to our single FPGA-based implementations. Further, with our dual FPGA-based implementation, we achieved a speed-up close to 1.8 compared to our single FPGA-based implementation.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"3 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85150282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745682
Manuel Saldaña, A. Patel, Christopher A. Madill, Daniel Nunes, Danyao Wang, Henry Styles, Andrew Putnam, Ralph Wittig, P. Chow
{"title":"MPI as an abstraction for software-hardware interaction for HPRCs","authors":"Manuel Saldaña, A. Patel, Christopher A. Madill, Daniel Nunes, Danyao Wang, Henry Styles, Andrew Putnam, Ralph Wittig, P. Chow","doi":"10.1109/HPRCTA.2008.4745682","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745682","url":null,"abstract":"High performance reconfigurable computers (HPRCs) consist of one or more standard microprocessors tightly coupled with one or more reconfigurable FPGAs. HPRCs have been shown to provide good speedups and good cost/performance ratios, but not necessarily ease of use, leading to a slow acceptance of this technology. HPRCs introduce new design challenges, such as the lack of portability across platforms, incompatibilities with legacy code, users reluctant to change their code base, a prolonged learning curve, and the need for a system-level hardware/software co-design development flow. This paper presents the evolution and current work on TMD-MPI, which started as an MPI-based programming model for multiprocessor systems-on-chip implemented in FPGAs, and has now evolved to include multiple X86 processors. TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures. Also presented is the TMD-MPI ecosystem, which consists of research projects and tools that are developed around TMD-MPI to further improve HPRC usability.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"36 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85015427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
高性能计算技术Pub Date : 2008-11-01DOI: 10.1109/HPRCTA.2008.4745679
K. Sano, Luzhou Wang, Yoshiaki Hatsuda, S. Yamamoto
{"title":"Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes","authors":"K. Sano, Luzhou Wang, Yoshiaki Hatsuda, S. Yamamoto","doi":"10.1109/HPRCTA.2008.4745679","DOIUrl":"https://doi.org/10.1109/HPRCTA.2008.4745679","url":null,"abstract":"For numerical computations requiring a relatively high ratio of data access to operation, the scalability of memory bandwidth is key to performance improvement. In this paper, we propose a scalable FPGA-array to achieve custom computing machines for high-performance and power-efficient scientific simulations based on difference schemes. With the FPGA-array, we construct a systolic computational-memory array (SCMA) by homogeneously partitioning the SCMA among multiple tightly-coupled FPGAs. A large SCMA implemented using a lot of FPGAs achieves high-performance computation with scalable memory-bandwidth and scalable arithmetic-performance according to the array size. For feasibility demonstration and quantitative evaluation, we design and implement the SCMA of 192 processing elements over two ALTERA StratixII FPGAs. The implemented SCMA running at 106 MHz achieves the sustained performances of 32.8 to 36.5 GFlops in single precision for three benchmark computations while the peak performance is 40.7 GFlops. In comparison with a 3.4GHz Pentium4 processor, the SCMAs consume 70% to 87% power and require only 3% to 7% energy consumption for the same computations. Based on the requirement model for inter-FPGA bandwidth, we illustrate that SCMAs are completely scalable for the currently available high-end to low-end FPGAs, while the SCMA implemented with the two FPGAs demonstrates the doubled performance of that by the single-FPGA SCMA.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"4 1","pages":"1-9"},"PeriodicalIF":0.0,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82020750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}