{"title":"Optimal Total Exchange on an SIMD Distributed-Memory Hypercube","authors":"D. Delesalle, D. Trystram, D. Wenzek","doi":"10.1109/DMCC.1991.633143","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633143","url":null,"abstract":"This paper deals with optimality results on the implementation of fundamental communication schemes on a distributed-memory SIMD hypercubemultiprocessor (namely, global exchange and personalized global exchange with accumulation). Some experiments are given on a Connection Machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114093378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype","authors":"Marc Baber","doi":"10.1109/DMCC.1991.633213","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633213","url":null,"abstract":"This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem o","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121223041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Z-Buffer on a Transputer-Based Machine","authors":"Jian-jin Li, S. Miguet","doi":"10.1109/DMCC.1991.633155","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633155","url":null,"abstract":"This paper describes the parallel implementation of the Z-Buffer algorithm on a distributed memory machine. The Z-Buffer is one of the most popular techniques used to generate a representation of a scene consisting of objects in a 3-dimensional world. We propose and compare two different parallel implementations on a network of Transputers. In the first approach, the description of the scene is distributed among the processors configured as a tree. The picture is processed in a pipelined fashion, in order to output parts of the image during the computation of the remainder. In a second approach, both the picture and the scene description are distributed to the processors. interconnected in a ring. We have therefore to redistribute dynamically the tiles among the processors at the beginning of the computation. We show thlat the two approaches are complementary : for small pictures or large scenes, a tree-based algorithm performs better than a ringbased algorithm, but for large pictures or small scenes, it is the other way round. We obtain substantial speedups over the sequential implementation, with up to 32 processors.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121436931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Aball, B.D. Gavril, R. Hadsell, L. Lam, B. Shimamoto
{"title":"Many/370: A Parallel Computer Prototype For I/0 Intensive Applications","authors":"B. Aball, B.D. Gavril, R. Hadsell, L. Lam, B. Shimamoto","doi":"10.1109/DMCC.1991.633364","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633364","url":null,"abstract":"This article is an overview of Many/370, an IBM System/370 parallel processor prototype built JOT I/Ointensive app1ication.s. The prototype consists of 8 processor nodes, 128 small disk drives, and a host c omputer. The nodes h ave a high performance disk I/O capability which distinguishes Many/37O from other multiprocessors. The eight nodes and the host are interconnected By a non-blocking switch, and they corn.tion set. Each node has a disk adaptcr attach,ed to it. The disk adapter has 4 separate SCSI buses and it controls 16 disk d rives. The disk adapter performs the functions a Systetn/370 c hannel and a control unit. municate using e xtensions ol the System/37O I ’ r1 st TU c","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Hatcher, M. J. Quinn, A. Lapadula, R. Anderson, R. R. Jones
{"title":"Dataparallel C: A SIMD Programming Language for Multicomputers","authors":"P. Hatcher, M. J. Quinn, A. Lapadula, R. Anderson, R. R. Jones","doi":"10.1109/DMCC.1991.633095","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633095","url":null,"abstract":"Dataparallel C is a SIMD extensiotii to the standard C programming language, It is derived from the original C* language developed by Thinking Machine,r Corporation, We have completed a third-generation Dataparalle1 C compiler, which produces SPMD-style C code suitable for execution on Intel and nCUBE multicomputers. In this paper we discuss the characteristics and strengths of data-parallel programming languages, summarize the syntax and semantics of Dataparallel C', and document the perjbrmance of six benchmark programs executing on the nCUBE 3200 multicomputer. Our work demonstrates that SIMD programs can achieve reasonable speedup when compiled and executed on multicomputers.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127686174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Access based data decomposition fam distributed memory machines","authors":"J. Ramanujam, P. Sadayappan","doi":"10.1109/DMCC.1991.633122","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633122","url":null,"abstract":"This paper addresses the problem of partitioning data for distributed memory machines or multicomputers. If in-suucient attention is paid to the data allocation problem, then the amount of time spent in interprocessor communication might be so high as to seriously undermine the beneets of parallelism. It is therefore worthwhile for a compiler to analyze patterns of data usage to determine allocation, in order to minimize interprocessor communication. We present a matrix notation to describe array accesses in fully parallel loops which lets us derive suu-cient conditions for communication-free decomposition of arrays.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129314884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Matrix Multiplication on Hypercubes Using Full Bandwith and Constant Storage","authors":"Ching-Tien Ho, Lennart Johnsson, Alan Edelman","doi":"10.1109/DMCC.1991.633211","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633211","url":null,"abstract":"For matrix multiplicatioln on hypercube multiprocessors with the product matrix accumulated in place a processor must receive albout P2/n elements of each input operand, with opeicands of size P x P distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P2/fllog1J for each input operand. We present a two-level partitioning of the matrices and an algolrithm for the matrix: multiplication with optimal data. motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P3/N. The algorithm has been implemented oin the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured iibout 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Carter, N. Nayar, J. Gustafson, D. Hoffman, D. Kouri, O. Sharafeddin
{"title":"When \"Grain Size\" Doesn't Matter","authors":"M. Carter, N. Nayar, J. Gustafson, D. Hoffman, D. Kouri, O. Sharafeddin","doi":"10.1109/DMCC.1991.633317","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633317","url":null,"abstract":"We describe insights gained from putting a quantum scattering problem on two very different parallel architectures: MasPar MP-I (massively parallel) and nCUBE 2 (moderately parallel). Our nearly trivial port from the SIMD MasPar to the MIMD nCUBE demonstrates that it is not categorically difficult to move software from one parallel architecture class to another. These machines show widely different processor and problem grain sizes. Their performance is strikingly similar on mal l problems, a fact not predicted by machine grain size, problem grain size, or peak speed comparisons. We introduce a new metric, fixed-time efficiency, that correlates very well with our experiments and has predictive value. Data and control decomposition and communication considerations are analyzed for each machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131155928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Decomposition and Incomplete Factorisation Methods for Partial Differential Equations","authors":"C. Christara","doi":"10.1109/DMCC.1991.633166","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633166","url":null,"abstract":"In this paper we develop and study a method which tries to combine the merits of Domain Decompoxition (DD) and Incomplete Cholesky preconditioned Con,iugate Gradient method (ICCG) for the parallel solution of linear elliptic Partial Differential Equations (PDEs) on rectangular domains. We frst discretise the PDE problem, using Spline Collocation, a method of Finite Element type based on smooth splines. This gives rise to a sparse linear system of equations. The ICCG method provides us with a very effient, but not straightfarward parallelisable linear solver for such systems. On the (other hand, DD methods are very effective for elliptic PD.Es. A combination of DD and ICCG methods, in which the subdomain solves are carried out with ICCG, leads to eflcient and highly parallelisable solvers. We implement this hybrid DD-ICCG method on a hypercube, discuss its parallel eflciency, and show results from expieriments on configurations with up to 32 processors. We apply a totally local communication scheme and discuss its performance on the iPSCI2 hypercube. A similsrr approach can be used with other PDE discretisation methods.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125040106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Domain Decomposition to Solve Positive-Definite Systems on the Hypercube Computer","authors":"G.L. Hennigan, S. Castillo, E. Hensel","doi":"10.1109/DMCC.1991.633214","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633214","url":null,"abstract":"A distributed method of solving sparse, positive-definite systems of equations on a hypercube computer, like those arising fiom many finite-element problems, is studied. A domain decomposition method is introduced wherein the domain of the problem to be solved is physically split into several sub-domains. This physical split is based on an ordering known as one-way dissection [ I ] . The one-way dissection ordering generates a block-diagonal system of equations which is well suited to a parallel implementation. Once the ordering has been accomplished each of the subdomains is then distributed to a processor in the hypercube computer as necessary. The method is applied to two-dimensional electrostatic problems which are governed by Laplace’s equation. Since the finite-element method is used to discretize the problem the method is developed to take full advantage of the inherent sparsity. The algorithm is applied to several geometries.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115594556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}