S. Sistare, Erica Dorenkamp, Nicholas J. Nevin, E. Loh
{"title":"MPI Support in the PrismTM Programming Environment","authors":"S. Sistare, Erica Dorenkamp, Nicholas J. Nevin, E. Loh","doi":"10.1109/SC.1999.10018","DOIUrl":"https://doi.org/10.1109/SC.1999.10018","url":null,"abstract":"The PrismTM multi-process debugger was designed from its inception to support scalable debugging paradigms, implemented on top of a scalable architecture. Its features provide a good base for developing MPI programs, and we have extended them in a number of ways to improve its support for MPI. We have added more visualization support, including the ability to specify the geometry of globally-distributed arrays, to visualize MPI’s internal message queues, and to visualize linked data structures. We have provided an integrated MPI performance analysis capability in the form of an event-capture and display tool. Lastly, we have further enhanced the Prism architecture to improve scalability and ease-of-use for MPI debugging.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129118062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiqiang Wang, J. A. Lupo, A. McKenney, R. Pachter
{"title":"Large Scale Molecular Dynamics Simulations with Fast Multipole Implementations","authors":"Zhiqiang Wang, J. A. Lupo, A. McKenney, R. Pachter","doi":"10.1145/331532.331588","DOIUrl":"https://doi.org/10.1145/331532.331588","url":null,"abstract":"We present the performance of the fast molecular dynamics (FMD) code designed for efficient, object-oriented, and scalable large scale molecular simulations. FMD uses an implementation of the three-dimensional fast multipole method, FMM3D, developed in our group. The Fast Multipole Method offers an efficient way (order O(N)) to handle long range electrostatic interactions, thus enabling a more realistic molecular dynamics simulation of large molecular systems. The performance testing was carried out on IBM SP2, SGI Origin 2000, and CRAY T3E systems with the MPI message passing system. Two models, a random charged particle model of up to 100,000 charges wth only non-bonded interactions, and a real molecular model of more than 35,000 atoms with full atomic interactions, are used for the order-N and parallel scalability testing. An application to a liquid crystalline material will be discussed.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129839952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Mitchell, L. Carter, J. Ferrante, D. Tullsen
{"title":"ILP versus TLP on SMT","authors":"Nicholas Mitchell, L. Carter, J. Ferrante, D. Tullsen","doi":"10.1145/331532.331569","DOIUrl":"https://doi.org/10.1145/331532.331569","url":null,"abstract":"By sharing processor resources among threads at a very fine granularity, a simultaneous multithreading processor (SMT) renders thread-level parallelism (TLP) and instruction-level parallelism (ILP) operationally equivalent. Under what circumstances are they performance equivalent? In this paper, we show that operational equivalence does not imply performance equivalence. Rather, for some codes they perform equally well, for others ILP outperforms TLP, and for yet others, the opposite is true. In this paper, we define the performance characteristics that divide codes into one of these three circumstances. We present evidence from three codes to support the factors involved in the model.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124357199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BIP-SMP : High Performance Message Passing over a Cluster of Commodity SMPs","authors":"P. Geoffray, L. Prylli, B. Tourancheau","doi":"10.1145/331532.331552","DOIUrl":"https://doi.org/10.1145/331532.331552","url":null,"abstract":"Device Interface Channel Interface NX Check_incoming \"short\", \"eager\", P4 TCP/IP Paragon SP/2 Generic ADI code, datatype mgmt, heterogeneity request queues mgmt \" Protocol interface\" SGI port. other ports shared-mem port MPL BIP MPI BIP \"rendez-vous\" Protocols Figure 1: The architecture of MPI-BIP implemented with one or several messages of the underlying c ommunication system (BIP in our case). The cost of MPI-BIP is approximately an overhead of 2 s (mainly CPU) over BIP for the latency on our cluster. Thus, the latency of the non-S MP MPI-BIP is very good, 7 s, and the bandwidth reaches 110 MB/s.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115300170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frederick C. Wong, R. Martin, Remzi H. Arpaci-Dusseau, D. Culler
{"title":"Architectural Requirements and Scalability of the NAS Parallel Benchmarks","authors":"Frederick C. Wong, R. Martin, Remzi H. Arpaci-Dusseau, D. Culler","doi":"10.1145/331532.331573","DOIUrl":"https://doi.org/10.1145/331532.331573","url":null,"abstract":"We present a study of the architectural requirements and scalability of the NAS Parallel Benchmarks. Through direct measurements and simulations, we identify the factors which affect the scalability of benchmark codes on two relevant and distinct platforms; a cluster of workstations and a ccNUMA SGI Origin 2000. We find that the benefit of increased global cache size is pronounced in certain applications and often offsets the communication cost. By constructing the working set profile of the benchmarks, we are able to visualize the improvement of computational efficiency under constant-problem-size scaling. We also find that, while the Origin MPI has better point-to-point performance, the cluster MPI layer is more scalable with communication load. However, communication performance within the applications is often much lower than what would be achieved by micro-benchmarks. We show that the communication protocols used by MPI runtime library are influential to the communication performance in applications, and that the benchmark codes have a wide spectrum of communication requirements.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127006378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Kurç, Chialin Chang, R. Ferreira, A. Sussman, J. Saltz
{"title":"Querying Very Large Multi-dimensional Datasets in ADR","authors":"T. Kurç, Chialin Chang, R. Ferreira, A. Sussman, J. Saltz","doi":"10.1145/331532.331544","DOIUrl":"https://doi.org/10.1145/331532.331544","url":null,"abstract":"Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space, and access to data items is described by range queries. The basic processing involves mapping input data items to output data items, and some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on distributed-memory parallel architectures with multiple disks attached to each node. In this paper we address efficient execution of range queries on distributed memory parallel machines within ADR framework. We present three potential strategies, and evaluate them under different application scenarios and machine configurations. We present experimental results on the scalability and performance of the strategies on a 128-node IBM SP.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123666256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Hashimoto, Hiroto Tomita, K. Inoue, Katsuhiko Metsugi, K. Murakami, Shinjiro Inabata, S. Yamada, N. Miyakawa, Hajime Takashima, K. Kitamura, Shigeru Obara, T. Amisaki, K. Tanabe, U. Nagashima
{"title":"MOE: A Special-Purpose Parallel Computer for High-Speed, Large-Scale Molecular Orbital Calculation","authors":"K. Hashimoto, Hiroto Tomita, K. Inoue, Katsuhiko Metsugi, K. Murakami, Shinjiro Inabata, S. Yamada, N. Miyakawa, Hajime Takashima, K. Kitamura, Shigeru Obara, T. Amisaki, K. Tanabe, U. Nagashima","doi":"10.1145/331532.331590","DOIUrl":"https://doi.org/10.1145/331532.331590","url":null,"abstract":"We are constructing a high-performance, special-purpose parallel machine for ab initio Molecular Orbital calculations, called MOE (Molecular Orbital calculation Engine). The sequential execution time is O(N4) where N is the number of basis functions, and most of time is spent to the calculations of electron repulsion integrals (ERIs). The calculation of ERIs have a lot of parallelism of O(N4), and therefore MOE tries to exploit the parallelism. This paper discuss the MOE architecture and examines important aspects of architecture design, which is required to calculate ERIs according to the \"Obara method\". We conclude that n-way parallelization is the most cost-effective, hence we designed the MOE prototype system with a host computer and many processing nodes. The processing node includes a 76 bit oating-point MULTIPLY-and-ADD unit and internal memory, etc., and it performs ERI computations efficiently. We estimate that the prototype system with 100 processing nodes calculate the energy of proteins in a few days.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126254480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architecture-Cognizant Divide and Conquer Algorithms","authors":"K. Gatlin, L. Carter","doi":"10.1145/331532.331557","DOIUrl":"https://doi.org/10.1145/331532.331557","url":null,"abstract":"Divide and conquer programs can achieve good performance on parallel computers and computers with deep memory hierarchies. We introduce architecture-cognizant divide and conquer algorithms, and explore how they can achieve even better performance. An architecture-cognizant algorithm has functionally-equivalent variants of the divide and/or combine functions, and a variant policy that specifies which variant to use at each level of recursion. An optimal variant policy is chosen for each target computer via experimentation. With h levels of recursion, an exhaustive search requires theta(vh) experiments (where v is the number of variants). We present a method based on dynamic programming that reduces this to theta(vc) (where c is typically a small constant) experiments for a class of architecture-cognizant programs. We verify our technique on two kernels (matrix multiply and 2-D Point Jacobi) using three architectures. Our technique improves performance by up to a factor of two, compared to architecture-oblivious divide and conquer implementations. Further our dynamic programming approach succeeds in selecting the optimal variant policy.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134472922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Informed Prefetching of Collective Input/Output Requests","authors":"T. Madhyastha, Garth A. Gibson, C. Faloutsos","doi":"10.1145/331532.331545","DOIUrl":"https://doi.org/10.1145/331532.331545","url":null,"abstract":"Optimizing collective input/output (I/O) is important for improving throughput of parallel scientific applications. Current research suggests that a specialized collective application programming interface, coupled with system-level optimizations, is necessary to obtain good I/O performance. Unfortunately, collective interfaces require an application to disclose its entire access pattern to fully reorder I/O requests, and cannot flexibly utilize additional memory to improve performance. In this paper we propose and analyze a method of optimizing collective access patterns using informed prefetching that is capable of exploiting any amount of available memory to overlap I/O with computation. We compare this approach to disk-directed I/O, an efficient implementation of a collective I/O interface. Moreover, we prove that under certain conditions, a per-processor prefetch depth equal to the number of drives can guarantee sequential disk accesses for any collectively accessed file. In empirical studies, a prefetch horizon of one to two times the number of disks per processor is sufficient to match the performance of disk-directed I/O for sequentially allocated files. Finally, we develop accurate analytical models to predict the throughput of informed prefetching for collective reads as a function of the per-processor prefetch depth.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132582462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Masumoto, T. Kagimoto, T. Yamagata, M. Yoshida, M. Fukuda, N. Hirose
{"title":"Simulated Circulation in the Indonesian Archipelago from a High Resolution Global Ocean General Circulation Model on the Numerical Wind Tunnel","authors":"Y. Masumoto, T. Kagimoto, T. Yamagata, M. Yoshida, M. Fukuda, N. Hirose","doi":"10.1145/331532.331567","DOIUrl":"https://doi.org/10.1145/331532.331567","url":null,"abstract":"To represent both the basin-scale circulation in the ocean and the local small-scale variations at the same time, a high resolution global ocean general circulation model (OGCM) is essential. Recent progress in parallel computing techniques together with the high ability of computer itself makes it possible to use such a high resolution OGCM for climate variability issues. We have developed and are using a global OGCM based on the Princeton Ocean Model, with 1/6 degrees horizontal grid resolutions. This model has been running on the \"Numerical Wind Tunnel\" at National Aerospace Laboratory in Japan, using 64 PEs with 256 MB memory for each PE, and is successful in reproducing very realistic variations of the currents and tracer fields. In the present study, details of the model and some examples of those variations from 16-year calculation with climatological forcing fields are introduced.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"514 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131642087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}