{"title":"Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic","authors":"D. Caliga, David Barker","doi":"10.1145/582034.582066","DOIUrl":"https://doi.org/10.1145/582034.582066","url":null,"abstract":"SRC Computers, Inc. has integrated adaptive computing into its SRC-6 high-end server, incorporating reconfigurable processors as peers to the microprocessors. Performance improvements resulting from reconfigurable computing can provide orders of magnitude speedups for a wide variety of algorithms. Reconfigurable logic in Field Programmable Gate Arrays (FPGAs) has shown great advantage to date in special purpose applications and specialty hardware. SRC Computers is working to bring this technology into the general purpose HPC world via an advanced system interconnect and enhanced compiler technology.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115446839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Stewart, Dave Hart, Donald K. Berry, G. Olsen, E. Wernert, William Fischer
{"title":"Parallel Implementation and Performance of FastDNAml - A Program for Maximum Likelihood Phylogenetic Inference","authors":"C. Stewart, Dave Hart, Donald K. Berry, G. Olsen, E. Wernert, William Fischer","doi":"10.1145/582034.582054","DOIUrl":"https://doi.org/10.1145/582034.582054","url":null,"abstract":"This paper describes the parallel implementation of fastDNAml, a program for the maximum likelihood inference of phylogenetic trees from DNA sequence data. Mathematical means of inferring phylogenetic trees have been made possible by the wealth of DNA data now available. Maximum likelihood analysis of phylogenetic trees is extremely computationally intensive. Availability of computer resources is a key factor limiting use of such analyses. fastDNAml is implemented in serial, PVM, and MPI versions, and may be modified to use other message passing libraries in the future. We have developed a viewer for comparing phylogenies. We tested the scaling behavior of fastDNAml on an IBM RS/6000 SP up to 64 processors. The parallel version of fastDNAml is one of very few computational phylogenetics codes that scale well. fastDNAml is available for download as source code or compiled for Linux or AIX.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114818908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Removing the Overhead from Software-Based Shared Memory","authors":"Z. Radovic, Erik Hagersten","doi":"10.1145/582034.582090","DOIUrl":"https://doi.org/10.1145/582034.582090","url":null,"abstract":"The implementation presented in this paper — DSZOOM-WF — is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. All interrupt- and/or poll-based asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130052862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing Temporal Locality with Skewing and Recursive Blocking","authors":"G. Jin, J. Mellor-Crummey, R. Fowler","doi":"10.1145/582034.582077","DOIUrl":"https://doi.org/10.1145/582034.582077","url":null,"abstract":"We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27% and secondary cache misses by 119%.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125434525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Parallel Irregular Reductions Using Partial Array Expansion","authors":"E. Gutiérrez, O. Plata, E. Zapata","doi":"10.1145/582034.582072","DOIUrl":"https://doi.org/10.1145/582034.582072","url":null,"abstract":"Much effort has been devoted recently to efficiently parallelize irregular reductions. In this paper, parallelizing techniques for these computations are analyzed in terms of three performance aspects: parallelism, data locality and memory overhead. These aspects have a strong influence in the overall performance and scalability of the parallel code. We will discuss how the parallelization techniques usually try to optimize some of these aspects, while missing the other(s). We will show that by combining complementary techniques we can improve the overall performance/scalability of the parallel irregular reduction, obtaining an effective solution for large problems on large machines. Specifically, a combination of array expansion and a locality-oriented method (DWA-LIP), named partial array expansion, is introduced. An implementation of the proposed method is discussed, showing that the transformation that the compiler must apply to the irregular reduction code is not excessively complex. Finally, the method is analyzed and experimentally evaluated.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129246660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C","authors":"Kenneth M. Wilson, B. Aglietti","doi":"10.1145/582034.582067","DOIUrl":"https://doi.org/10.1145/582034.582067","url":null,"abstract":"The use of CC-NUMA multiprocessors complicates the placement of physical memory pages. Memory closest to a processor provides the best access time, but optimal memory page placement is a difficult problem with process movement, multiple processes requiring access to the same physical memory page, and application behavior changing over execution time. We use dynamic page placement to move memory pages where needed for the database benchmark TPC-C executing on a four node CC-NUMA multiprocessor. Dynamic page placement achieves local memory accesses up to 73% of the time instead of the static page placement results of 34% locality achieved with first touch and 25% with round robin. This can result in a 17% improvement in performance.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"07 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127266591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Graphics and Interactivity with the Scaleable Graphics Engine","authors":"Kenneth A. Perrine, Donald R. Jones","doi":"10.1145/582034.582039","DOIUrl":"https://doi.org/10.1145/582034.582039","url":null,"abstract":"A parallel rendering environment is being developed to utilize the IBM Scaleable Graphics Engine (SGE), a hardware frame buffer for parallel computers. Goals of this software development effort include finding efficient ways of producing and displaying graphics generated on IBM SP nodes and of assisting programmers in adapting or creating scientific simulation applications to use the SGE. Four software development phases discussed utilize the SGE: tunneling, SMP rendering, development of an OpenGL API implementation which utilizes the SGE in parallel environments, and additions to the SGE-enabled OpenGL implementation that uses threads. The performance observed in software tests show that programmers would be able to utilize the SGE to output interactive graphics in a parallel environment.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125739305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Case Study in Application I/O on Linux Clusters","authors":"R. Ross, Daniel Nurmi, A. Cheng, M. Zingale","doi":"10.1145/582034.582045","DOIUrl":"https://doi.org/10.1145/582034.582045","url":null,"abstract":"A critical but often ignored component of system performance is the I/O system. Today’s applications demand a great deal from underlying storage systems and software, and both high-performance distributed storage and high level interfaces have been developed to fill these needs. In this paper we discuss the I/O performance of a parallel scientific application on a Linux cluster, the FLASH astrophysics code. This application relies on three I/O software components to provide high-performance parallel I/O on Linux clusters: the Parallel Virtual File System, the ROMIO MPI-IO implementation, and the Hierarchical Data Format library. Through instrumentation of both the application and underlying system software code we discover the location of major software bottlenecks. We work around the most inhibiting of these bottlenecks, showing substantial performance improvement. We point out similarities between the inefficiencies found here and those found in message passing systems, indicating that research in the message passing field could be leveraged to solve similar problems in high-level I/O interfaces.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131527543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Sun Fireplane System Interconnect","authors":"Alan E. Charlesworth","doi":"10.1145/582034.582041","DOIUrl":"https://doi.org/10.1145/582034.582041","url":null,"abstract":"System interconnect is a key determiner of the cost, performance, and reliability of large cache-coherent, shared-memory multiprocessors. Interconnect implementations have to accommodate ever greater numbers of ever faster processors. This paper describes the Sun™ Fireplane two-level cache-coherency protocol, and its use in the medium and large-sized UltraSPARC-III-based Sun Fire™ servers.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122366633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Linh Truong, T. Fahringer, Georg Madsen, A. Malony, H. Moritsch, S. Shende
{"title":"On Using SCALEA for Performance Analysis of Distributed and Parallel Programs","authors":"Hong Linh Truong, T. Fahringer, Georg Madsen, A. Malony, H. Moritsch, S. Shende","doi":"10.1145/582034.582068","DOIUrl":"https://doi.org/10.1145/582034.582068","url":null,"abstract":"In this paper we give an overview of SCALEA, which is a new performance analysis tool for OpenMP, MPI, HPF, and mixed parallel/distributed programs. SCALEA instruments, executes and measures programs and computes a variety of performance overheads based on a novel overhead classification. Source code and HWprofiling is combined in a single system which significantly extends the scope of possible overheads that can be measured and examined, ranging from HW-counters, such as the number of cache misses or floating point operations, to more complex performance metrics, such as control or loss of parallelism. Moreover, SCALEA uses a new representation of code regions, called the dynamic code region call graph, which enables detailed overhead analysis for arbitrary code regions. An instrumentation description file is used to relate performance information to code regions of the input program and to reduce instrumentation overhead. Several experiments with realistic codes that cover MPI, OpenMP, HPF, and mixed OpenMP/MPI codes demonstrate the usefulness of SCALEA.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123198518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}