{"title":"Packing/unpacking information generation for efficient generalized kr/spl rarr/r and r/spl rarr/kr array redistribution","authors":"Ching-Hsien Hsu, Yeh-Ching Chung, C. Dow","doi":"10.1109/FMPC.1999.750588","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750588","url":null,"abstract":"Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods to generate the packing/unpacking information for BOLCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BOLCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution with arbitrary source/destination processor sets. The most significant improvement of this paper is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information derived from kr/spl rarr/r and r/spl rarr/kr redistributions, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the PITFALLS method and the Prylli's method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the PITFALLS method and the Prylli's method for all test samples.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114968924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient VLSI layouts of hypercubic networks","authors":"C. Yeh, Emmanouel Varvarigos, B. Parhami","doi":"10.1109/FMPC.1999.750589","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750589","url":null,"abstract":"In this paper we present efficient VLSI layouts of several hypercubic networks. We show that an N-node hypercube and an N-node cube-connected cycles (CCC) graph can be laid out in 4N/sup 2//9+o(N/sup 2/) and 4N/sup 2//(9 log/sub 2//sup 2/N)+o(N/sup 2//log/sup 2/ N) areas, respectively, both of which are optimal within a factor of 1.7~+o(1). We introduce the multilayer grid model, and present efficient layouts of hypercubes that use more than 2 layers of wires. We derive efficient layouts for butterfly networks, generalized hypercubes, hierarchical swapped networks, and indirect swapped networks, that are optimal within a factor of 1+o(1). We also present efficient layouts for folded hypercubes, reduced hypercubes, recursive hierarchical swapped networks, and enhanced-cubes, which are the best results reported for these networks thus far.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121159253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Java for numerically intensive computing: from flops to gigaflops","authors":"S. Midkiff, J. Moreira, M. Snir","doi":"10.1109/FMPC.1999.750607","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750607","url":null,"abstract":"Java is not thought of as being competitive with Fortran for numerical programming. In this paper, we discuss technologies that can and will deliver Fortran-like performance in Java. These techniques include new and existing compiler technologies, the exploitation of parallelism, and a collection of Java libraries for numerical computing. We also present experimental data to show the effectiveness of our approaches. In particular we achieve 1 Gflops with a linear algebra kernel on an RS/6000 SMP machine. Most of these techniques require no language changes; a few depend on extensions to Java currently under consideration.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128941574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for generating task parallel programs","authors":"U. Fissgus, T. Rauber, G. Runger","doi":"10.1109/FMPC.1999.750586","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750586","url":null,"abstract":"We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development starts with a specification of the maximum degree of task and data parallelism and proceeds by performing several derivation steps in which the degree of parallelism is adapted to a specific parallel machine. We show how the final message-passing programs are generated and how the interaction between the task and data parallel levels can be established. We demonstrate the usefulness of the approach by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117147946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A recursive PVM implementation of an image segmentation algorithm with performance results comparing the HIVE and the Cray T3E","authors":"J. Tilton","doi":"10.1109/FMPC.1999.750594","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750594","url":null,"abstract":"A recursive PVM (Parallel Virtual Machine) implementation of a high quality but computationally intensive image segmentation approach is described and the performance of the algorithm on the HIVE and on the Cray T3E is contrasted. The image segmentation algorithm, which is designed for the analysis of multispectral or hyperspectral remotely sensed imagery data, is a hybrid of region growing and spectral clustering that produces a hierarchical set of image segmentations based on detected natural convergence points. The HIVE is a Beowulf-class parallel computer consisting of 66 Pentium Pro PCs (64 slaves and 2 controllers) with 2 processors per PC (for 128 total slave processors) which was developed and assembled by the Applied Information Sciences Branch at NASA's Goddard Space Flight Center. The Cray T3E is a supercomputer with 512 available processors, which is installed at the NASA Center for Computational Science at NASA's Goddard Space Flight Center. Timing results on Landsat Multispectral Scanner data show that the algorithm runs approximately 1.5 times faster on the HIVE, even though the HIVE is some 86 times less costly than the Cray T3E.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124472412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A data-parallel algorithm for iterative tomographic image reconstruction","authors":"C. Johnson, A. Sofer","doi":"10.1109/FMPC.1999.750592","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750592","url":null,"abstract":"In the tomographic imaging problem images are reconstructed from a set of measured projections. Iterative reconstruction methods are computationally intensive alternatives to the more traditional Fourier-based methods. Despite their high cost, the popularity of these methods is increasing because of the advantages they pose. Although numerous iterative methods have been proposed over the years, all of these methods can be shown to have a similar computational structure. This paper presents a parallel algorithm that we originally developed for performing the expectation maximization algorithm in emission tomography. This algorithm is capable of exploiting the sparsity and symmetries of the model in a computationally efficient manner. Our parallelization scheme is based upon decomposition of the measurement-space vectors. We demonstrate that such a parallelization scheme is applicable to the vast majority of iterative reconstruction algorithms proposed to date.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Dorband, J. Kouatchou, J. Michalakes, U. Ranawake
{"title":"Implementing MM5 on NASA Goddard Space Flight Center computing systems: a performance study","authors":"J. Dorband, J. Kouatchou, J. Michalakes, U. Ranawake","doi":"10.1109/FMPC.1999.750601","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750601","url":null,"abstract":"We analyze and test the performance of the fifth-generation PSU/NCAR mesoscale model MM5 on parallel computers at NASA Goddard Space Flight Center. We show how MM5 code scales on the Cray J90, the Cray T3E and a cluster of PCs. More precisely, we are interested in finding the elapsed time, load balancing, speedup, number of floating point operations per second, and performance versus cost. Results obtained with two test problems show the efficiency of MM5 on the above computers especially with large size problems.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127667478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization of a parallel pseudospectral MHD code","authors":"A. Dubey, T. Clune","doi":"10.1109/FMPC.1999.750602","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750602","url":null,"abstract":"In this article we outline some techniques for optimizing spectral codes using multidimensional real-to-complex FFT's. We have successfully applied these techniques on a pseudospectral MHD code running on the CRAY T3E. The code uses half precision, and runs up to 2.5 times faster than the version that uses full precision CRAY SCILIB parallel FFT routines. The half precision version without these optimizations is slower does not scale very well, and cannot support more than 128 processors. The optimized code achieved a performance of 100 Gflops on 1024 nodes of a CRAY T3E-600 at NASA Goddard Space Flight Center.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"45 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Token space minimization by simulated annealing","authors":"Rafi Lohev, I. Gottlieb","doi":"10.1109/FMPC.1999.750604","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750604","url":null,"abstract":"We describe a heuristic solution for the minimum token space scheduling (MTSS) problem, based on simulated annealing. In MTSS, one schedules a set of tasks with precedence constraints, represented by a directed graph. The arcs in the graph represent data, or tokens, which the tasks must receive before they can be processed. MTSS seeks to minimize the maximum number of tokens extant at any time during execution, while minimizing completion time. We motivate MTSS with an application from computer architecture: maximizing the locality of data required for execution of a program by multiprocessors. Simulation results demonstrating the effectiveness of our method are presented.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114081511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Superconducting processors for HTMT: issues and challenges","authors":"K. B. Theobald, G. Gao, T. Sterling","doi":"10.1109/FMPC.1999.750608","DOIUrl":"https://doi.org/10.1109/FMPC.1999.750608","url":null,"abstract":"The Hybrid Technology Multi-Threading project is a long-term study of the feasibility of combining several emerging technologies to reach 1 petaFLOPS within ten years. HTMT will combine high-speed superconductor processors, semiconductor memories with built-in processors, high-speed optical interconnects, and high-density holographic storage. While there are major challenges in all aspects of this project, those in processor architecture are the focus of this paper. Fundamental differences between RSFQ circuits and conventional semiconductor circuits, including a radical jump in clock speed, make today's processor design approaches inappropriate for HTMT. Sequential instruction dispatching, even within the lowest programming unit (a strand), will lead to unacceptably high latencies, hence poor performance. We propose alternative processor designs which use fine-grain synchronizations between individual instructions in order to avoid these bottlenecks.","PeriodicalId":405655,"journal":{"name":"Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126545258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}