N. Topham, A. Rawsthorne, Callum McLean, M. Mewissen, Peter L. Bird
{"title":"Compiling and Optimizing for Decoupled Architectures","authors":"N. Topham, A. Rawsthorne, Callum McLean, M. Mewissen, Peter L. Bird","doi":"10.1145/224170.224301","DOIUrl":"https://doi.org/10.1145/224170.224301","url":null,"abstract":"Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127924457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message Passing Versus Distributed Shared Memory on Networks of Workstations","authors":"Honghui Lu, S. Dwarkadas, A. Cox, W. Zwaenepoel","doi":"10.1145/224170.224285","DOIUrl":"https://doi.org/10.1145/224170.224285","url":null,"abstract":"The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the shared memory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks. More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data by the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116302175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Sterling, D. Savarese, P. MacNeice, K. Olson, C. Mobarry, B. Fryxell, P. Merkey
{"title":"A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer","authors":"T. Sterling, D. Savarese, P. MacNeice, K. Olson, C. Mobarry, B. Fryxell, P. Merkey","doi":"10.1145/224170.285573","DOIUrl":"https://doi.org/10.1145/224170.285573","url":null,"abstract":"The Convex SPP-1000 is the first commercial implementation of a new generation of scalable shared memory parallel computers with full cache coherence. It employs a hierarchical structure of processing communication and memory name-space management resources to provide a scalableNUMA environment. Ensembles of 8 HP PA-RISC7100 microprocessorsemploy an internal cross-bar switch and directory based cache coherence scheme to provide a tightly coupled SMP.Up to 16 processing ensembles are interconnected by a 4 ring network incorporating a full hardware implementation of the SCI protocol for a full system configuration of 128 processors. This paper presents the findings of a set of empirical studies using both synthetic test codes and full applications for the Earth and space sciences to characterize the performance properties of this new architecture. It is shown that overhead and latencies of global primitive mechanisms, while low in absolute time, are significantly more costly than similar functions local to an individual processor ensemble.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Use of Cellular Automata in the Classroom","authors":"H. A. Lilly","doi":"10.1145/224170.224204","DOIUrl":"https://doi.org/10.1145/224170.224204","url":null,"abstract":"The paper explains what a cellular automaton is and why schools would want to integrate the study of cellular automata into their curricula. Examples are given and suggestions for sample exercises follow. Each example is given a title, a discipline to which it relates, a source from which the example or the motivation for the example was taken, and a recommended grade level--middle school or high school. Source code in Microsoft's FORTRAN PowerStation, Version 1.0 is available for all of the examples. Each of the programs show a visualization of a particular cellular automaton over time. A cellular automaton is a modeling tool that can be used in the classroom with either pencil and paper or on computers. Cellular automata can be important in motivating students, reaching students with certain learning styles, helping students develop modeling skills, and in the development of curricula for teaching certain computer technologies.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"01 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127449738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Astrophysical N-Body Simulations on the GRAPE-4 Special-Purpose Computer","authors":"J. Makino, M. Taiji","doi":"10.1145/224170.224400","DOIUrl":"https://doi.org/10.1145/224170.224400","url":null,"abstract":"We report on resent astrophysical N-body simulations performed on the GRAPE-4 (GRAvity PipE 4) system, a special-purpose computer for astrophysical N-body simulations. We first review the astrophysical motivation, the algorithm, the structure of the GRAPE system, and the actual performance. The GRAPE-4 system consists of 1692 pipeline processors. The peak speed of one pipeline processor is 523 Mflops and that of the total system is 884 Gflops. The performance obtained is 529 Gflops for the simulation of two massive black holes in the core of a galaxy with 700,000 stars.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multicast Virtual Topologies for Collective Communication in MPCs and ATM Clusters","authors":"Y. Huang, Chengchang Huang, P. McKinley","doi":"10.1145/224170.224188","DOIUrl":"https://doi.org/10.1145/224170.224188","url":null,"abstract":"This paper defines and describes the properties of a multicast virtual topology, the M-array and a resource-efficient variation, the REM-array. It is shown how several collective operations can be implemented efficiently using these virtual topologies, while maintaining low complexity. Because the methods are applicable to any parallel computing environment that supports multicast communication in hardware, they provide a framework for collective communication libraries that are portable and yet take advantage of such low-level hardware functionality. In particular, the paper describes the practical issues of using these methods in wormhole-routed massively parallel computers (MPCs) and in workstation clusters connected by Asynchronous Transfer Mode (ATM) networks. Performance results are given for both environments.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122772680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pittsburgh Supercomputing Center High School Initiative in Computational Science Report on Findings School Years: 1991-92, 1992-93, 1993-4","authors":"C. Porto","doi":"10.1145/224170.224200","DOIUrl":"https://doi.org/10.1145/224170.224200","url":null,"abstract":"The purpose of the Pittsburgh Supercomputing Center's High School Initiative was to motivate students to pursue careers in science, mathematics, engineering and computer science. The initiative generated excitement among teachers and their students by providing them with the opportunity to work on a project of their choosing using the world's fastest supercomputer — the same machine used by leading researchers working on today's most challenging scientific problems. The program gave teachers the means and support to institutionalize their computational science project into the curriculum so that the impact of the program would continue from year to year with each new class of students.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131301639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"I/O Limitations in Parallel Molecular Dynamics","authors":"T. Clark, L. R. Scott, S. Wlodek, J. McCammon","doi":"10.1145/224170.224220","DOIUrl":"https://doi.org/10.1145/224170.224220","url":null,"abstract":"We discuss data production rates and their impact on the performance of scientific applications using parallel computers. On one hand, too high rates of data production can be overwhelming, exceeding logistical capacities for transfer, storage and analysis. On the other hand, the rate limiting step in a computationally-based study should be the human-guided analysis, not the calculation. We present performance data for a biomolecular simulation of the enzyme, acetylcholinesterase, which uses the parallel molecular dynamics program EulerGROMOS. The actual production rates are compared against a typical time frame for results analysis where we show that the rate limiting step is the simulation, and that to overcome this will require improved output rates.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128988791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-Level Algorithm For Partitioning Graphs","authors":"B. Hendrickson, R. Leland","doi":"10.1145/224170.224228","DOIUrl":"https://doi.org/10.1145/224170.224228","url":null,"abstract":"The graph partitioning problem is that of dividing the vertices of a graph into sets of specified sizes such that few edges cross between sets. This NP-complete problem arises in many important scientific and engineering problems. Prominent examples include the decomposition of data structures for parallel computation, the placement of circuit elements and the ordering of sparse matrix computations. We present a multilevel algorithm for graph partitioning in which the graph is approximated by a sequence of increasingly smaller graphs. The smallest graph is then partitioned using a spectral method, and this partition is propagated back through the hierarchy of graphs. A variant of the Kernighan-Lin algorithm is applied periodically to refine the partition. The entire algorithm can be implemented to execute in time proportional to the size of the original graph. Experiments indicate that, relative to other advanced methods, the multilevel algorithm produces high quality partitions at low cost.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123779319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallel Software Infrastructure for Structured Adaptive Mesh Methods","authors":"S. Kohn, S. Baden","doi":"10.1145/224170.224283","DOIUrl":"https://doi.org/10.1145/224170.224283","url":null,"abstract":"Structured adaptive mesh algorithms dynamically allocate computational resources to accurately resolve interesting portions of a numerical calculation. Such methods are difficult to implement and parallelize because they rely on dynamic, irregular data structures. We have developed an efficient, portable, parallel software infrastructure for adaptive mesh methods; our software provides computational scientists with high-level facilities that hide low-level details of parallelism and resource management. We have applied our software infrastructure to the solution of adaptive eigenvalue problems arising in materials design. We describe our software infrastructure and analyze its performance. We also present computational results which indicate that the uniformity restrictions imposed by a data parallel Fortran implementation of a structured adaptive mesh application would significantly impact performance.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128174520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}