S. Attaway, E. Barragy, K. Brown, D. Gardner, B. Hendrickson, S. Plimpton, C. Vaughan
{"title":"Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer","authors":"S. Attaway, E. Barragy, K. Brown, D. Gardner, B. Hendrickson, S. Plimpton, C. Vaughan","doi":"10.1109/SC.1997.10054","DOIUrl":"https://doi.org/10.1109/SC.1997.10054","url":null,"abstract":"We describe our parallelization of PRONTO, Sandia's transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes on problems involving millions of finite elements. We can simulate models using more than ten million elements in a few tenths of a second per timestep.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127090446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Simulation of Parallel File Systems and I/O Programs","authors":"R. Bagrodia, Stephen Docy, Andy Kahn","doi":"10.1145/509593.509640","DOIUrl":"https://doi.org/10.1145/509593.509640","url":null,"abstract":"Efficient I/O implementations can have a significant impact on the performance of parallel applications. This paper describes the design and implementation of PIOSIM, a parallel simulation library for MPI-IO programs. The simulator can be used to predict the performance of existing MPI-IO programs as a function of architectural characteristics, caching algorithms, and alternative implementations of collective I/O operations. We describe the simulator and presents the results of a number of performance studies to evaluate the impact of the preceding factors on a set of MPI-IO benchmarks, including programs from the NAS benchmark suite.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124465454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Takefusa, S. Matsuoka, Hirotaka Ogawa, H. Nakada, Hiromitsu Takagi, M. Sato, S. Sekiguchi, U. Nagashima
{"title":"Multi-client LAN/WAN Performance Analysis of Ninf: a High-Performance Global Computing System","authors":"A. Takefusa, S. Matsuoka, Hirotaka Ogawa, H. Nakada, Hiromitsu Takagi, M. Sato, S. Sekiguchi, U. Nagashima","doi":"10.1145/509593.509615","DOIUrl":"https://doi.org/10.1145/509593.509615","url":null,"abstract":"Rapid increase in speed and availability of network of supercomputers is making high performance global computing possible, in which computational and data resources in the network are collectively employed to solve large-scale problems. There have been several recent proposals of global computing including our Ninf system. However, critical issues regarding system performance characteristics in global computing have been little investigated, especially under multi-clients, multi-sites WAN settings. In order to investigate the feasibility of Ninf and similar systems, we conducted benchmarks with different communication/computation characteristics on a variety of combinations of clients and servers in their performance, architecture, etc. under LAN, single-site WAN, multi-site WAN situations.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125160053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Tobis, C. Schafer, Ian T Foster, R. Jacob, John Anderson
{"title":"FOAM: Expanding the Horizons of Climate Modeling","authors":"M. Tobis, C. Schafer, Ian T Foster, R. Jacob, John Anderson","doi":"10.1145/509593.509620","DOIUrl":"https://doi.org/10.1145/509593.509620","url":null,"abstract":"We report here on a project that expands the applicability of dynamic climate modeling to very long time scales. The Fast Ocean Atmosphere Model (FOAM) is a coupled ocean- atmosphere model that incorporates physics of interest in understanding decade to century time scale variability. It addresses the high computational cost of this endeavor with a combination of improved ocean model formulation, low atmosphere resolution, and efficient coupling. It also uses message-passing parallel processing techniques, allowing for the use of cost-effective distributed memory platforms. The resulting model runs over 6000 times faster than real time with good fidelity and has yielded significant results.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126149618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines","authors":"Toshio Endo, K. Taura, A. Yonezawa","doi":"10.1145/509593.509641","DOIUrl":"https://doi.org/10.1145/509593.509641","url":null,"abstract":"This work describes implementation of a mark-sweep garbage collector (GC) for shared-memory machines and reports its performance. It is a simple 'parallel' collector in which all processors cooperatively traverse objects in the global shared heap. The collector stops the application program during a collection. To achieve scalability, collector performs dynamic load balancing, which exchanges objects to be scanned between processors. However, we observed that the implementation detail affects the performance heavily. For example, large objects, which become a source of significant load imbalance are split into small pieces. With all careful implementation, we achieved 28-fold speed-up on 64 processors.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126830157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed High Performance Computation for Remote Sensing","authors":"K. Hawick, H. James","doi":"10.1145/509593.509633","DOIUrl":"https://doi.org/10.1145/509593.509633","url":null,"abstract":"We describe distributed and parallel algorithms for processing remotely sensed data such as geostationary satellite imagery. We have built a distributed data repository based around the client-server computing model across wide-area ATM networks, with embedded parallel and high performance processing modules. We focus on algorithms for classification, georectification, correlation and histogram analysis of the data. We consider characteristics of image data collected from the Japanese GMS5 geostationary meteorological satellite, and some analysis techniques we have applied to it. As well as providing a browsing interface to our data collection, our system provides processing and analysis services on-demand.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114279026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MultiMATLAB Integrating MATLAB with High Performance Parallel Computing","authors":"V. Menon, Anne E. Trefethen","doi":"10.1145/509593.509623","DOIUrl":"https://doi.org/10.1145/509593.509623","url":null,"abstract":"MultiMATLAB is an extension of the popular MATLAB environment to distributed memory multiprocessors. We present a MultiMATLAB architecture that provides performance on multiprocessors while maintaining the functionality and usability of MATLAB. This system will enable users to access high performance parallel routines from within MATLAB, to extend MATLAB with new parallel routines, and to use these routines to develop parallel applications with the MATLAB language. We discuss a general MultiMATLAB architecture and present two implementations built upon MPI. Preliminary results indicate that the MultiMATLAB system can offer the full performance of the underlying multiprocessor to the MATLAB environment.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121880632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cristina Hristea Seibert, D. Lenoski, John S. Keen
{"title":"Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks","authors":"Cristina Hristea Seibert, D. Lenoski, John S. Keen","doi":"10.1145/509593.509638","DOIUrl":"https://doi.org/10.1145/509593.509638","url":null,"abstract":"Even with today's large caches, the increasing performance gap between processors and memory systems imposes a memory bottleneck for many important scientific and com mercial applications. This bottleneck is intensified in shared-memory multiprocessors by contention and the ef fects of cache coherency. Under heavy memory contention, the memory latency may increase two or three times. Nonethless, as more sophisticated techniques are used to hide latency and increase bandwidth, measuring memory performance has become increasingly difficult. Previous simple methods to measure memory performance can overestimate unipro cessor memory latency and underestimate bandwidth by tens of percent. We introduce a micro benchmark suite that measures memory hierarchy performance in light of both uniprocessor optimizations and the contention and coherence effects of multiprocessors. The benchmark suite has been used to improve the memory system performance of the SGI Origin multiprocessor.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127832479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Alpatov, Greg Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert A. van de Geijn, Yuan-Jye J. Wu
{"title":"PLAPACK Parallel Linear Algebra Package Design Overview","authors":"P. Alpatov, Greg Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert A. van de Geijn, Yuan-Jye J. Wu","doi":"10.1145/509593.509622","DOIUrl":"https://doi.org/10.1145/509593.509622","url":null,"abstract":"The Parallel Linear Algebra Package (PLAPACK) is a maturing fourth generation linear algebra infrastructure which uses a application-centric view of vector and matrix distribution, Physically Based Matrix Distribution. It also uses an \"MPI-like\" programming interface that hides distribution and indexing details in opaque objects, provides a natural layering in the library, and provides a straight-forward application interface. We give an overview of the design of PLAPACK.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133971828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLIP: A Checkpointing Tool for Message Passing Parallel Programs","authors":"Yuqun Chen, Kai Li, J. Plank","doi":"10.1145/509593.509626","DOIUrl":"https://doi.org/10.1145/509593.509626","url":null,"abstract":"Checkpointing is a useful technique for rollback recovery. We present CLIP, a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. Creating an actual tool for checkpointing a complex machine like the Paragon is not easy, because many issues arise that require careful design decisions to be made. We detail what these decisions are, and how they were made in CLIP. We present performance data when checkpointing several long-running parallel applications. These results show that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer with good performance.","PeriodicalId":315276,"journal":{"name":"ACM/IEEE SC 1997 Conference (SC'97)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127579756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}