A. Mirin, R. Cohen, B. C. Curtis, W. Dannevik, A. Dimits, M. A. Duchauneau, D. Eliason, D. Schikore, S. E. Anderson, D. Porter, P. Woodward, L. Shieh, Steven W. White
{"title":"Very High Resolution Simulation of Compressible Turbulence on the IBM-SP System","authors":"A. Mirin, R. Cohen, B. C. Curtis, W. Dannevik, A. Dimits, M. A. Duchauneau, D. Eliason, D. Schikore, S. E. Anderson, D. Porter, P. Woodward, L. Shieh, Steven W. White","doi":"10.1145/331532.331601","DOIUrl":"https://doi.org/10.1145/331532.331601","url":null,"abstract":"Understanding turbulence and mix in compressible flows is of fundamental importance to real-world applications such as chemical combustion and supernova evolution. The ability to run in three dimensions and at very high resolution is required for the simulation to accurately represent the interaction of the various length scales, and consequently, the reactivity of the intermixin species. Toward this end, we have carried out a very high resolution (over 8 billion zones) 3-D simulation of the Richtmyer-Meshkov instability and turbulent mixing on the IBM Sustained Stewardship TeraOp (SST) system, developed under the auspices of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI) and located at Lawrence Livermore National Laboratory. We have also undertaken an even higher resolution proof-of-principle calculation (over 24 billion zones) on 5832 processors of the IBM system, which executed for over an hour at a sustained rate of 1.05 Tflop/s, as well as a short calculation with a modified algorithm that achieved a sustained rate of 1.18Tflop/s. The full production scientific simulation, using a further modified algorithm, ran for 27,000 timesteps in slightly over a week of wall time using 3840 processors of the IBM system, clockin a sustained throughput of roughly 0.6 teraflop per second (32-bit arithmetic). Nearly 300,000 graphics files comprising over three terabytes of data were produced and post-processed. The capability of running in 3-D at high resolution enabled us to get a more accurate and detailed picture of the fluid-flow structure - in particular, to simulate the development of fine scale structures from the interactions of long-and short-wavelength phenomena, to elucidate differences between two-dimensional and three-dimensional turbulence, to explore a conjecture regarding the transition from unstable flow to fully developed turbulence with increasing Reynolds number, and to ascertain convergence of the computed solution with respect to mesh resolution.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116459256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SunTM MPI I/O: Efficient I/O for Parallel Applications","authors":"L. Wisniewski, Brad Smisloff, N. Nieuwejaar","doi":"10.1145/331532.331546","DOIUrl":"https://doi.org/10.1145/331532.331546","url":null,"abstract":"Many parallel applications require high-performance I/O to avoid negating some or all of the benefit derived from parallelizing its computation. When these applications are run on a loosely-coupled cluster of SMPs, the limitations of existing hardware and software present even more hurdles to performing high-performance I/O. In this paper, we describe our full implementation of the I/O portion of the MPI-2 specification. In particular, we discuss the limitations inherent in performing high-performance I/O on a cluster of SMPs and demonstrate the benefits of using a cluster-based filesystem over a traditional node-based filesystem.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123047255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Cost-Benefit Scheme for High Performance Predictive Prefetching","authors":"V. Vellanki, A. Chervenak","doi":"10.1145/331532.331582","DOIUrl":"https://doi.org/10.1145/331532.331582","url":null,"abstract":"High-performance computing systems will increasingly rely on prefetching data from disk to overcome long disk access times and maintain high utilization of parallel I/O systems. This paper evaluates a prefetching technique that chooses which blocks to prefetch based on their probability of access and decides whether to prefetch a particular block at a given time using a cost-benefit analysis. The algorithm uses a probability tree to record past accesses and to predict future access patterns. We simulate this prefetching algorithm with a variety of I/O traces. We show that our predictive prefetching scheme combined with simple one-block-lookahead prefetching produces good performance for a variety of workloads. The scheme reduces file cache miss rates by up to 36% for workloads that receive no benefit from sequential prefetching. We showthat the memory requirements for building the probability tree are reasonable, requiring about a megabyte for good performance. The probability tree constructed by the prefetching scheme predicts around 60-70% of the accesses. Next, we discuss ways of improving the performance of the prefetching scheme. Finally, we show that the cost-benefit analysis enables the tree-based prefetching scheme to perform an optimal amount of prefetching.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123390638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Karamanos, C. Evangelinos, R. C. Boes, R. Kirby, G. Karniadakis
{"title":"Direct Numerical Simulation of Turbulence with a PC/Linux Cluster: Fact or Fiction?","authors":"G. Karamanos, C. Evangelinos, R. C. Boes, R. Kirby, G. Karniadakis","doi":"10.1145/331532.331585","DOIUrl":"https://doi.org/10.1145/331532.331585","url":null,"abstract":"Direct Numerical Simulation (DNS) of turbulence requires many CPU days and Gigabytes of memory. These requirements limit most DNS to using supercomputers, available at supercomputer centres. With the rapid development and low cost of PCs, PC clusters are evaluated as a viable low-cost option for scientific computing. Both low-end and high-end PC clusters, ranging from 2 to 128 processors, are compared to a range of existing supercomputers, such as the IBM SP nodes, Silicon Graphics Origin 2000, Fujitsu AP3000 and Cray T3E. The comparison concentrates on CPU and communication performance. At the kernel level, BLAS libraries are used for CPU performance evaluation. Regarding communication, the free implementations of MPICH and LAM are used on fast-ethernet-based systems and compared to myrinet-based and supercomputer networks. At the application level, serial and parallel simulations are performed on state of the art DNS, such as turbulent wake flows in stationary and moving computational domains.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126561750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization of MPI Collectives on Clusters of Large-Scale SMP’s","authors":"S. Sistare, Rolf vande Vaart, E. Loh","doi":"10.1145/331532.331555","DOIUrl":"https://doi.org/10.1145/331532.331555","url":null,"abstract":"Implementors of message-passing libraries have focused on optimizing point-to-point protocols and have largely ignored the performance of collective operations. In addition, algorithms for collectives have been tuned to run well on networks of uni-processor machines, ignoring the performance that may be gained on large-scale SMP’s in wide-spread use as compute nodes. This is unfortunate, because the high backplane bandwidths and shared-memory capabilities of large SMP’s are a perfect match for the requirements of collectives. We present new algorithms for MPI collective operations that take advantage of the capabilities of fat-node SMP’s and provide models that show the characteristics of the old and new algorithms. Using the SunTM MPI library, we present results on a 64-way StarfireTM SMP and a 4-node cluster of 8-way Sun EnterpriseTM 4000 nodes that show performance improvements ranging typically from 2x to 5x for the collectives we studied.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"796 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113999158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}