{"title":"An Approximate Agreement Algorithm for Wraparound Meshes","authors":"R. Cheng, C. Chung","doi":"10.1142/S0129053395000221","DOIUrl":"https://doi.org/10.1142/S0129053395000221","url":null,"abstract":"An appropriate algorithm, the neighboring exchange, for reaching an approximate agreement in a wraparound mesh is proposed. The algorithm is characterized by its isotropic nature, which is of particular usefulness when applied in any symmetric system. The behavior of this algorithm can be depicted by recurrence relations which can be used to derive the convergence rate. The convergence rate is meaningful when the algorithm is used to synchnize clocks. The rate of synchronizing clocks is derived, and it can be applied to all wraparound meshes with practical scale. With the recurrence relations, we also prove the correctness of this algorithm.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117023150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Matrix Multiplication Algorithms on Hypercube Multiprocessors","authors":"Peizong Lee","doi":"10.1142/S012905339500021X","DOIUrl":"https://doi.org/10.1142/S012905339500021X","url":null,"abstract":"In this paper, we present three parallel algorithms for matrix multiplication. The first one, which employs pipelining techniques on a mesh grid, uses only one copy of data matrices. The second one uses multiple copies of data matrices also on a mesh grid. Although data communication operations of the second algorithm are reduced, the requirement of local data memory for each processing element increases. The third one, which uses a cubic grid, shows the trade-offs between reducing the computation time and reducing the communication overhead. Performance models and feasibilities of these three algorithms are studied. We analyze the interplay among the numbers of processing elements, the communication overhead, and the requirements of local memory in each processing element. We also present experimental results of these three algorithms on a 32-node nCUBE-2 computer.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multithreaded Decoupled Architecture","authors":"M. Dorojevets, V. Oklobdzija","doi":"10.1142/S0129053395000257","DOIUrl":"https://doi.org/10.1142/S0129053395000257","url":null,"abstract":"A new computer architecture called the Multithreaded Decoupled Architecture has been proposed for exploiting fine-grain parallelism. It develops further some of the ideas of parallel processing implemented in the Russian MARS-M computer in the 1980s. The MTD architecture aims at enhancing both total machine throughput and a single thread performance. To achieve this goal, we propose a two-level parallel computation model. Its low level defines the decoupled parallel execution of instructions within program fragments not containing branches. We will be referring to these fragments as basic blocks. The model’s high level defines the parallel execution of multiple basic blocks representing a function or procedure. This scheduling hierarchy reflects the MTD storage hierarchy. Together the scheduling and storage models allow a processor with multiple execution units to exploit several forms of parallelism within a procedure. The compiler provides the hardware with thread register usage masks to allow run-time enforcing of control and data dependencies between the high level threads. We present a possible implementation of the MTD-processor with multiple execution units and two-level distributed register memory.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116828787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Optimal Weighted Binary Trees","authors":"J. Pradhan, C. V. Sastry","doi":"10.1142/S0129053395000245","DOIUrl":"https://doi.org/10.1142/S0129053395000245","url":null,"abstract":"A new recursive top-down algorithm for the construction of a unique Huffman tree is introduced. We show that the prefix codes generated from the Huffman tree are unique and the weighted path length is optimal. Initially we have not imposed any restriction on the maximum length (the number of bits) a prefix code can take. But if buffering of the source is required, we have to put a restriction on the length of the prefix code. In this context we extend the top-down recursive algorithm for generating length-limited prefix codes.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123319503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Fortran Intrinsic Functions","authors":"Toru Nagai","doi":"10.1142/S0129053395000129","DOIUrl":"https://doi.org/10.1142/S0129053395000129","url":null,"abstract":"High performance of mathematical functions is essential to speed up scientific calculations because they are very frequently used in scientific computing. This paper presents performance of important Fortran intrinsic functions on the fastest vector supercomputers. It is assumed that a relationship between CPU-time and the number of function arguments given to calculate function values is linear, and speeds of a function were measured using the parameters and . The author also examines how the speed of the function varies with respect to the selection of arguments. The computers tested in the present paper are Cray C9016E/16256– 4, Fujitsu VP2600/10, Hitachi S-3800/480 and NEC SX-3/14R.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123460574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Block Preconditioned Conjugate Gradient Methods on a Distributed Virtual Shared Memory Multiprocessor","authors":"L. Giraud","doi":"10.1142/S0129053395000105","DOIUrl":"https://doi.org/10.1142/S0129053395000105","url":null,"abstract":"We study both shared and distributed approaches for the parallel implementation of the SSOR and Jacobi block preconditioned Krylov methods on a distributed virtual shared memory computer: a BBN TC2000. We consider the solution of block tridiagonal systems arising from the discretization of 3D partial differential equations, which diagonal blocks correspond to the discretization of 2D partial differential equations. The solution of the diagonal subproblems required for the preconditionings are performed using a domain decomposition method with overlapped subdomains: a variant of the Schwarz alternating method.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124538696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A General-Purpose Parallel Sorting Algorithm","authors":"A. Tridgell, R. Brent","doi":"10.1142/S0129053395000166","DOIUrl":"https://doi.org/10.1142/S0129053395000166","url":null,"abstract":"A parallel sorting algorithm is presented for general purpose internal sorting on MIMD machines. The algorithm initially sorts the elements within each node using a serial sorting algorithm, then proceeds with a two-phase parallel merge. The algorithm is comparison-based and requires additional storage of order the square root of the number of elements in each node. Performance of the algorithm on the Fujitsu AP1000 MIMD supercomputer is discussed.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"430 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123573903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Factorized Sparse Approximate Inverse Preconditioning II: Solution of 3D FE Systems on Massively Parallel Computers","authors":"L. Kolotilina, A. Yeremin","doi":"10.1142/S0129053395000117","DOIUrl":"https://doi.org/10.1142/S0129053395000117","url":null,"abstract":"An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested. The method is based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations. Efficiency of a concurrent implementation of the FSAI-CG iterations is analyzed for a model hypercube, and an estimate of the optimal hypercube dimension is derived. For finite element applications, two strategies for selecting the preconditioner sparsity pattern are suggested. A high convergence rate of the resulting iterations is demonstrated numerically for the 3D equilibrium equations for linear elastic orthotropic materials approximated using both h- and p-versions of the FEM.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125318516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extensions to Cycle Shrinking","authors":"A. Sethi, S. Biswas, A. Sanyal","doi":"10.1142/S0129053395000154","DOIUrl":"https://doi.org/10.1142/S0129053395000154","url":null,"abstract":"An important part of a parallelizing compiler is the restructuring phase, which extracts parallelism from a sequential program. We consider an important restructuring transformation called cycle shrinking [5], which partitions the iteration space of a loop so that the iterations within each group of the partition can be executed in parallel. The method in [5] mainly deals with dependences with constant distances. In this paper, we propose certain extensions to the cycle shrinking transformation. For dependences with constant distances, we present an algorithm which, under certain fairly general conditions, partitions the iteration space in a minimal number of groups. Under such conditions, our method is optimal while the previous methods are not. We have also proposed an algorithm to handle a large class of loops which have dependences with variable distances. This problem is considerably harder and has not been considered before in full generality.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128422085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task Distribution on a Butterfly Multiprocessor","authors":"I. Gottlieb, A. Herold","doi":"10.1142/S0129053395000026","DOIUrl":"https://doi.org/10.1142/S0129053395000026","url":null,"abstract":"We consider the practical performance of dynamic task distribution on a multiprocessor, where overloaded processors dispense tasks to be performed on idle ones which are free to execute them. We propose a topology and an algorithm for routing packets in a network from an arbitrary subset of processors S to an arbitrary subset T, where the exact target node within T for a particular task is unimportant and therefore not specified. The method presented achieves work distribution in O(10* log N) time, where N is the nodes (processors) number. It operates on a Duplex Butterfly, and requires O(log N) size buffers. The solution is dynamic, taking into consideration real time availability of processors, and deterministic. The mechanism includes throttling of the task generation rate. “Software synchronization” in asynchronous mode ensures the insensitivity of the algorithm to hardware propagation delays of signals in large networks.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"278 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125849815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}