{"title":"Balancing interprocessor communication and computation on torus-connected multicomputers running compiler-parallelized code","authors":"M. Annaratone, R. Rühl","doi":"10.1109/SHPCC.1992.232672","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232672","url":null,"abstract":"The machine model considered in this paper is that of a distributed memory parallel processor (DMPP) with a two-dimensional torus topology. Within this framework, the authors study the relationship between the speedup delivered by compiler-parallelized code and the machine's interprocessor communication speed. It is shown that compiler-parallelized code often exhibits more interprocessor communication than manually parallelized code and that the performance of the former is therefore more sensitive to the machine's interprocessor communication speed. Because of this, a parallelizing compiler developed for a platform not explicitly designed to sustain the increased interprocessor communication will produce-in the general case-code that delivers disappointing speedups. Finally, the study provides the point of diminishing return for the interprocessor communication speed beyond which the DMPP designer should focus on improving other architectural parameters, such as the local memory-processor bandwidth.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134454500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a distributed memory implementation of Sisal","authors":"M. Haines, W. Bohm","doi":"10.1109/SHPCC.1992.232668","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232668","url":null,"abstract":"Sisal is a functional language for scientific applications implemented efficiently on shared memory, vector, and hierarchical memory multiprocessors. The current compiler assumes a flat, shared addressing space, and the runtime system is implemented using locks and shared queues. This paper describes a first implementation of Sisal on the nCUBE 2 distributed memory architecture. Most of the effort is focused on altering the runtime system for execution in a message passing environment and providing the Sisal compiler with a distributed shared memory. The authors give preliminary performance results and outline future work.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126851560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalability of data transport","authors":"H. Jordan","doi":"10.1109/SHPCC.1992.232695","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232695","url":null,"abstract":"Peak floating point rate is a very limited way to characterize high performance computer systems. A better method is to use the bandwidth and latency of data transport for the major components of a system. Bandwidth scales well with increasing system size, but latency does not. The demands placed by a program on data transport determine how well an architecture will execute it. The article discusses two program metrics which describe latency characteristics of programs and shows how they can help optimize program structure.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115329528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A test suite approach for Fortran90D compilers on MIMD distributed memory parallel computers","authors":"M.-Y. Wu, G. C. Fox","doi":"10.1109/SHPCC.1992.232667","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232667","url":null,"abstract":"Describes a test suite approach for a Fortran90D compiler, a source-to-source parallel compiler for distributed memory systems. Different from Fortran77 parallelizing compilers, a Fortran90D compiler does not parallelize sequential constructs. Only parallelism expressed by Fortran90D parallel constructs is exploited. The authors discuss compiler directives and the methodology of parallelizing Fortran programs. An introductory example of Gaussian elimination is used, among other programs in the test suite, to explain the compilation techniques.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116434245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel scalable approach to short-range molecular dynamics on the CM-5","authors":"R. Giles, P. Tamayo","doi":"10.1109/SHPCC.1992.232636","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232636","url":null,"abstract":"Presents a scalable algorithm for short-range molecular dynamics which minimizes interprocessor communications at the expense of a modest computational redundancy. The method combines Verlet neighbor lists with coarse-grained cells. Each processing node is associated with a cubic volume of space and the particles it owns are those initially contained in the volume. Data structures for 'own' and 'visitor' particle coordinates are maintained in each node. Visitors are particles owned by one of the 26 neighboring cells but lying within an interaction range of a face. The Verlet neighbor list includes pointers to own-own and own-visitor interactions. To communicate, each of the 26 neighbor cells sends a corresponding block of particle coordinates using message-passing cells. The algorithms has the numerical properties of the standard serial Verlet method and is efficient for hundreds to thousands of particles per node allowing the simulation of large systems with millions of particles. Preliminary results on the new CM-5 supercomputer are described.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116542907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication efficient global load balancing","authors":"D. Nicol","doi":"10.1109/SHPCC.1992.232629","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232629","url":null,"abstract":"Proposes a scalable parallel algorithm, called direct mapping, for balancing workload in a global, synchronous way. Direct mapping is particularly attractive for SIMD architectures, as it makes use of the scan operation. Unlike previously proposed scalable methods for the problem of interest, direct mapping transfers the minimum volume of workload necessary to achieve perfect load balance. This paper describes the algorithm, and studies its performance via simulation in comparison to previously proposed methods.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124590257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Debugging mapped parallel programs","authors":"J. May, F. Berman","doi":"10.1109/SHPCC.1992.232646","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232646","url":null,"abstract":"As more sophisticated tools for parallel programming become available, programmers will inevitably want to use them together. However, some parallel programming tools can interact with each other in ways that make them less useful. In particular, it a mapping tool is used to adapt a parallel program to run on relatively few processors, the information presented by a debugger may become difficult to interpret. The authors examine the problems that can arise when programmers use debuggers to interpret the patterns of message traffic in mapped parallel programs. They also suggest how to avoid these problems and made debugging tools more useful.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131813085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A global synchronization algorithm for the Intel iPSC/860","authors":"S. Seidel, M. Davis","doi":"10.1109/SHPCC.1992.232641","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232641","url":null,"abstract":"Precisely synchronizing the processors of a distributed memory multicomputer provides them with a common baseline from which time can be measured. This amounts to providing the processors with a global clock. This work investigates a global processor synchronization algorithm for the Intel iPSC/860. Previous work has shown that for certain communication problems, such as the one-to-all broadcast and the complete exchange, the most effective use of the iPSC/860 interconnection network is obtained only when communicating pairs of processors are suitably synchronized. For other communication problems, such as the shift operation, global processor synchronization ensures the most effective use of the communication network. This work presents an algorithm that synchronizes processors more closely than the synchronization primitive by Intel. This new synchronization algorithm is used as the basis of an efficient implementation of the shift operation.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124659027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PFP: a scalable parallel programming model","authors":"B. Corda, K. Warren","doi":"10.1109/SHPCC.1992.232653","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232653","url":null,"abstract":"The Parallel Fortran Preprocessor (PFP) is a programming model for multiple instruction multiple data (MIMD) parallel computers. It provides a simple paradigm consisting of data storage modifiers and parallel execution control statements. The model is lightweight and scalable in nature. The control constructs impose no implicit synchronizations, nor do they require off-processor memory references. The model is portable. It is implemented as a source-to-source translator which requires very little support from the back-end compiler. The implementation has an option to option to produce serial code which can then be compiled for serial execution.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125010878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using atomic data structures for parallel simulation","authors":"P. Barth","doi":"10.1109/SHPCC.1992.232691","DOIUrl":"https://doi.org/10.1109/SHPCC.1992.232691","url":null,"abstract":"Synchronizing access to shared data structures is a difficult problem for simulation programs. Frequently, synchronizing operations within and between simulation steps substantially curtails parallelism. The paper presents a general technique for performing this synchronization while sustaining parallelism. The technique combines fine-grained, exclusive locks with futures, a write-once data structure supporting producer-consumer parallelism. The combination allows multiple operations within a simulation step to run in parallel; further, successive simulation steps can overlap without compromising serializability or requiring roll-backs. The cumulative effect of these two sources of parallelism is dramatic: the example presented shows almost 20-fold increase in parallelism over traditional synchronization mechanisms.<<ETX>>","PeriodicalId":254515,"journal":{"name":"Proceedings Scalable High Performance Computing Conference SHPCC-92.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114738187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}