{"title":"Parallel Implementations of the Power System Transient Stability Problem on Clusters of Workstations","authors":"M. T. Bruggencate, S. Chalasani","doi":"10.1145/224170.224279","DOIUrl":"https://doi.org/10.1145/224170.224279","url":null,"abstract":"Power system transient stability analysis computes the response of the rapidly changing electrical components of a power system to a sequence of large disturbances followed by operations to protect the system against the disturbances. Transient stability analysis involves repeatedly solving large, very sparse, time varying non-linear systems over thousands of time steps. In this paper, we present parallel implementations of the transient stability problem in which we use direct methods to solve the linearized systems. One method uses factorization and forward and backward substitution to solve the linear systems. Another method, known as the W-Matrix method, uses factorization and partitioning to increase the amount of parallelism during the solution phase. The third method, the Repeated Substitution method, uses factorization and computations which can be done ahead of time to further increase the amount of parallelism during the solution phase. We discuss the performance of the different methods implemented on a loosely coupled, heterogeneous network of workstations (NOW) and the SP2 cluster of workstations.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114732747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Surveying Molecular Interactions with DOT","authors":"L. T. Eyck, J. Mandell, V. Roberts, M. Pique","doi":"10.1145/224170.224218","DOIUrl":"https://doi.org/10.1145/224170.224218","url":null,"abstract":"The purpose of the molecular interaction program DOT (Daughter of Turnip) is rapid computation of the electrostatic potential energy between two proteins or other charged molecules. DOT exhaustively tests all six degrees of freedom, rotational and translational, and produces a grid of approximate interaction energies and orientations. It is able to do this because the problem is cast as the convolution of the potential field of the first molecule and any rotated charge distribution of the second. The algorithm lends itself to both parallelization and vectorization, permitting huge increases in computational speed over other methods for obtaining the same information. For example, a complete mapping of interactions between plastocyanin and cytochrome c was done in eight minutes using 256 nodes of an Intel Paragon. DOT is expected to be particularly useful as a rapid screen to find configurations for more detailed study using exact energy models.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121304749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPC Undergraduate Curriculum Development at SDSU Using SDSC Resources","authors":"Kris Stewart","doi":"10.1145/224170.224209","DOIUrl":"https://doi.org/10.1145/224170.224209","url":null,"abstract":"Results from the development and teaching of a senior-level undergraduate multidisciplinary course in high performance computing are presented. Having been taught four times, there are several \"Lesson Learned\" presented in this paper. Help from the technical staff at the San Diego Supercomputer Center and support from the National Science Foundation has been instrumental in the evolution of this course. The work of faculty at other universities has influenced the author's courses and is gratefully acknowledged. A subsequent sophomore level course was developed at SDSU and has become part of a voluntary, cooperative program, Undergraduate Computational Science and Engineering.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121967495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Approach Towards Automatic Data Distribution","authors":"Jordi Garcia, E. Ayguadé, Jesús Labarta","doi":"10.1145/224170.224500","DOIUrl":"https://doi.org/10.1145/224170.224500","url":null,"abstract":"Data distribution is one of the key aspects that a parallelizing compiler for a distributed memory architecture should consider, in order to get efficiency from the system. The cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect performance. In this paper, we present a novel approach to automatically perform static data distribution. All the constraints related to parallelism and data movement are contained in a single data structure, the Communication-Parallelism Graph (CPG). The problem is solved using a linear 0-1 integer programming model and solver. In this paper we present the solution for one-dimensional array distributions, although its extension to multi-dimensional array distributions is also outlined. The solution is static in the sense that the layout of the arrays does not change during the execution of the program. We also show the feasibility of using this approach to solve the problem in terms of compilation time and quality of the solutions generated.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123203336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelizing the Phylogeny Problem","authors":"Je Jones, K. Yelick","doi":"10.1145/224170.224224","DOIUrl":"https://doi.org/10.1145/224170.224224","url":null,"abstract":"The problem of determining the evolutionary history of species in the form of phylogenetic trees is known as the phylogeny problem. We present a parallelization of the character compatibility method for solving the phylogeny problem. Abstractly, the algorithm searches through all subsets of characters, which may be traits like opposable thumbs or DNA sequence values, looking for a maximal consistent subset. The notion of consistency in this case is the existence of a particular kind of phylogenetic tree called a perfect phylogeny tree. The two challenges to achieving an efficient implementation are load balancing and efficient sharing of information to enable pruning. In both cases, there is a trade-off between communication overhead and the quality of the solution. For load balancing we use a distributed task queue, which has imperfect load information but avoids centralization bottlenecks. For sharing pruning information, we use a distributed trie, which also avoids centralization but maintains incomplete information. We evaluate several implementations of the trie, the best of which achieves speedups of 50 on a 64-processor CM-5.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123623797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microparallelism and High-Performance Protein Matching","authors":"B. Alpern, L. Carter, K. Gatlin","doi":"10.1145/224170.224222","DOIUrl":"https://doi.org/10.1145/224170.224222","url":null,"abstract":"The Smith-Waterman algorithm is a computationally-intensive string-matching operation that is fundamental to the analysis of proteins and genes. In this paper, we explore the use of some standard and novel techniques for improving its performance. We begin by tuning the algorithm using conventional techniques. These make modest performance improvements by providing efficient cache usage and inner-loop code. One novel technique uses the z-buffer operations of the Intel i860 architecture to perform 4 independent computations in parallel. This achieves a five-fold speedup over the optimized code (six-fold over the original). We also describe a related technique that could be used by processors that have 64-bit integer operations, but no z-buffer. Another new technique uses floating-point multiplies and adds in place of the standard algorithm's integer additions and maximum operations. This gains more than a three-fold speedup on the IBM POWER2 processor. This method doesn't give the identical answers as the original program, but experimental evidence shows that the inaccuracies are small and do not affect which strings are chosen as good matches by the algorithm.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120967891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet","authors":"S. Pakin, Mario Lauria, A. Chien","doi":"10.1109/SUPERC.1995.32","DOIUrl":"https://doi.org/10.1109/SUPERC.1995.32","url":null,"abstract":"In most computer systems, software overhead dominates the cost of messaging, reducing delivered performance, especially for short messages. Efficient software messaging layers are needed to deliver the hardware performance to the application level and to support tightly-coupled workstation clusters. Illinois Fast Messages (FM) 1.0 is a high speed messaging layer that delivers low latency and high bandwidth for short messages. For 128-byte packets, FM achieves bandwidths of 16.2MB/s and one-way latencies 32 µs on Myrinet-connected SPARCstations (user-level to user-level). For shorter packets, we have measured one-way latencies of 25 µs, and for larger packets, bandwidth as high as to 19.6MB/s — delivered bandwidth greater than OC-3. FM is also superior to the Myrinet API messaging layer, not just in terms of latency and usable bandwidth, but also in terms of the message half-power point (n_{frac{1}{2}}), which is two orders of magnitude smaller (54 vs. 4,409 bytes). We describe the FM messaging primitives and the critical design issues in building a low-latency messaging layers for workstation clusters. Several issues are critical: the division of labor between host and network coprocessor, management of the input/output (I/O) bus, and buffer management. To achieve high performance, messaging layers should assign as much functionality as possible to the host. If the network interface has DMA capability, the I/Obus should be used asymmetrically, with the host processor moving data to the network and exploiting DMA to move data to the host. Finally, buffer management should be extremely simple in the network coprocessor and match queue structures between the network coprocessor and host memory. Detailed measurements show how each of these features contribute to high performance.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126608000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gigabit I/O for Distributed-Memory Machines: Architecture and Applications","authors":"Michael Hemy, P. Steenkiste","doi":"10.1145/224170.224375","DOIUrl":"https://doi.org/10.1145/224170.224375","url":null,"abstract":"Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. We implemented an architecture for network I/O for the iWarp system with the following two key characteristics: first, application-specific tasks are off-loaded from the network interface to the distributed-memory system, and second, these tasks are performed in close cooperation with the application. The network interface has been used by several applications for over a year. In this paper we describe the network interface software that manages the communication between the iWarp distributed-memory system and the network interface, we validate the main features of our network interface architecture based on application experience, and we discuss how this architecture can be used by other distributed-memory systems.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121524154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Erlichson, B. A. Nayfeh, J. Singh, K. Olukotun
{"title":"The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation","authors":"Andrew Erlichson, B. A. Nayfeh, J. Singh, K. Olukotun","doi":"10.1145/224170.224397","DOIUrl":"https://doi.org/10.1145/224170.224397","url":null,"abstract":"Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cluster can share data more effectively. We investigate the performance benefits that can be obtained by clustering on a range of important scientific and engineering applications in moderate to large scale cache coherent machines with small degrees of clustering (up to one eighth of the total number of processors in a cluster). We find that except for applications with near neighbor communication topologies this degree of clustering is not very effective in reducing the inherent communication to computation ratios. Clustering is more useful in reducing the the number of remote capacity misses in unstructured applications, and can improve performance substantially when small first-level caches are clustered in these cases. This suggests that clustering at the first level cache might be useful in highly-integrated, relatively fine-grained environments. For less integrated machines such as current distributed shared memory multiprocessors, our results suggest that clustering at the first-level caches is not very useful in improving application performance; however our results also suggest that in an machine with long interprocessor communication latencies, clustering further away from the processor can provide performance benefits.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116765219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCIRun: A Scientific Programming Environment for Computational Steering","authors":"S. Parker, C. R. Johnson","doi":"10.1145/224170.224354","DOIUrl":"https://doi.org/10.1145/224170.224354","url":null,"abstract":"We present the design, implementation and application of SCIRun, a scientific programming environment that allows the interactive construction, debugging and steering of large scale scientific computations. Using this \"computational workbench,\" a scientist can design and modify simulations interactively via a dataflow programming model. SCIRun enables scientists to design and modify models and automatically change parameters and boundary conditions as well as the mesh discretization level needed for an accurate numerical solution. As opposed to the typical \"off-line\" simulation mode - in which the scientist manually sets input parameters, computes results, visualizes the results via a separate visualization package, then starts again at the beginning - SCIRun \"closes the loop\" and allows interactive steering of the design and computation phases of the simulation. To make the dataflow programming paradigm applicable to large scientific problems, we have identified ways to avoid the excessive memory use inherent in standard dataflow implementations, and have implemented fine-grained dataflow in order to further promote computational efficiency. In this paper, we describe applications of the SCIRun system to several problems in computational medicine. In addition, an we have included an interactive demo program in the form of an application of SCIRun system to a small electrostatic field problem.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}