{"title":"On Improving the Performance of Tree Machines","authors":"Ajay K. Gupta, Hong Wang","doi":"10.1142/S0129053395000142","DOIUrl":"https://doi.org/10.1142/S0129053395000142","url":null,"abstract":"In this paper we introduce a class of trees, called generalized compressed trees. Generalized compressed trees can be derived from complete binary trees by performing certain ‘contraction’ operations. A generalized compressed tree CT of height h has approximately 25% fewer nodes than a complete binary tree T of height h. We show that these trees have smaller (up to a 74% reduction) 2-dimensional and 3-dimensional VLSI layouts than the complete binary trees. We also show that algorithms initially designed for T can be simulated by CT with at most a constant slow-down. In particular, algorithms having non-pipelined computation structure and originally designed for T can be simulated by CT with no slow-down.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123894904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Load Balancing: a Programmer's Approach or the Impact of Task-Length Parameters on the Load Balancing Performance of Parallel Programs","authors":"Y. Ben-Asher, A. Schuster, J. F. Sibeyn","doi":"10.1142/S0129053395000178","DOIUrl":"https://doi.org/10.1142/S0129053395000178","url":null,"abstract":"We consider the problem of dynamic load balancing in an n processor parallel system. The scheduling process of a parallel program is modeled by randomly throwing weighted balls into n holes. For a given program A, the ball weights (task lengths) are chosen according to a probability distribution , for which we know only some of the following parameters: the expectation μ, variance σ2, maximum M and minimum m. From these parameters, we derive an upper bound for the number of tasks to be generated by A in order to achieve a load balancing ratio for which the run-time is optimal up to a factor (1+e)2 for any 0<e≤0.5, with very high probability. Using the derived relations, the programmer may control the load-balancing of his program by tuning the global parameters of the generated tasks. This can be done regardless of the underlying scheduler used by the parallel machine. We also give experimental results of marine-life simulation in support of our claims.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extensions to Cycle Shrinking","authors":"A. Sethi, S. Biswas, A. Sanyal","doi":"10.1142/S0129053395000154","DOIUrl":"https://doi.org/10.1142/S0129053395000154","url":null,"abstract":"An important part of a parallelizing compiler is the restructuring phase, which extracts parallelism from a sequential program. We consider an important restructuring transformation called cycle shrinking [5], which partitions the iteration space of a loop so that the iterations within each group of the partition can be executed in parallel. The method in [5] mainly deals with dependences with constant distances. In this paper, we propose certain extensions to the cycle shrinking transformation. For dependences with constant distances, we present an algorithm which, under certain fairly general conditions, partitions the iteration space in a minimal number of groups. Under such conditions, our method is optimal while the previous methods are not. We have also proposed an algorithm to handle a large class of loops which have dependences with variable distances. This problem is considerably harder and has not been considered before in full generality.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128422085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task Distribution on a Butterfly Multiprocessor","authors":"I. Gottlieb, A. Herold","doi":"10.1142/S0129053395000026","DOIUrl":"https://doi.org/10.1142/S0129053395000026","url":null,"abstract":"We consider the practical performance of dynamic task distribution on a multiprocessor, where overloaded processors dispense tasks to be performed on idle ones which are free to execute them. We propose a topology and an algorithm for routing packets in a network from an arbitrary subset of processors S to an arbitrary subset T, where the exact target node within T for a particular task is unimportant and therefore not specified. The method presented achieves work distribution in O(10* log N) time, where N is the nodes (processors) number. It operates on a Duplex Butterfly, and requires O(log N) size buffers. The solution is dynamic, taking into consideration real time availability of processors, and deterministic. The mechanism includes throttling of the task generation rate. “Software synchronization” in asynchronous mode ensures the insensitivity of the algorithm to hardware propagation delays of signals in large networks.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"278 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125849815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linear Algebra calculations on a Virtual Shared Memory Computer","authors":"P. Amestoy, I. Duff, M. Daydé, Pierre Morère","doi":"10.1142/S0129053395000038","DOIUrl":"https://doi.org/10.1142/S0129053395000038","url":null,"abstract":"We evaluate the impact of the memory hierarchy of virtual shared memory computers on the design of algorithms for linear algebra. On classical shared memory multiprocessor computers, block algorithms are used for efficiency. We study here the potential and the limitations of such approaches on globally addressable distributed memory computers. The BBN TC2000 belongs to this class of computers and will be used to illustrate our discussion. We describe the implementation of Level 3 BLAS and examine the performance of some of the LAPACK routines. The impact of the number of processors with respect to the choice of the variants of classical matrix factorizations (for example, KJI, JKI, JIK for the LU factorization) is discussed. We also study the factorization of sparse matrices based on a multifrontal approach. The ideas introduced for the parallelization of full linear algebra codes are applied to the sparse case. We discuss and illustrate the limitations of this approach in sparse multifrontal factorization. We show that the speed-ups obtained on the BBN TC2000 for the class of methods presented here are comparable to those obtained on more classical shared memory computers, such as the Alliant FX/80, the CRAY-2 and the IBM 3090/VF.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127738381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Restriction-Free Adaptive Wormhole Routing in Multicomputer Networks","authors":"Jai-Hoon Chung, H. Yoon, S. Maeng","doi":"10.1142/S0129053395000063","DOIUrl":"https://doi.org/10.1142/S0129053395000063","url":null,"abstract":"The adaptive routing approach has been expected as a promising way to improve network performance by utilizing available network bandwidth. Previous adaptive routing strategies in wormhole-routed multicomputer networks restrict the routing of messages by the routing algorithm to prevent deadlock. This results in low degree of adaptivity and low utilization of physical or virtual channels. In this paper, we examine the possibility of performing restriction-free adaptive routing in wormhole-routed networks as an approach to further improving the performance of these networks. A new flow control policy, called message cutting-in, is proposed, and two adaptive routing strategies are presented. Freedom of communication deadlock is achieved by the proposed flow control policy. The proposed adaptive routing strategies do not restrict routing and maximally utilize the physical and virtual channels. Simulation results show that the restriction-free adaptive routing approach is promising from the fact that it has the lowest latency and highest throughput depending on the number of virtual channels per physical channel and patterns of message traffic.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121657368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Parallel Algorithm for Matrix-Vector Multiplication","authors":"B. Hendrickson, R. Leland, S. Plimpton","doi":"10.1142/S0129053395000051","DOIUrl":"https://doi.org/10.1142/S0129053395000051","url":null,"abstract":"The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrix-vector multiplication algorithm which is particularly well suited to dense matrices or matrices with an irregular sparsity pattern. Such matrices can arise from discretizing partial differential equations on irregular grids or from problems exhibiting nearly random connectivity between data structures. The communication cost of the algorithm is independent of the matrix sparsity pattern and is shown to scale as for an n×n matrix on p processors. The algorithm’s performance is demonstrated by using it within the well known NAS conjugate gradient benchmark. This resulted in the fastest run times achieved to date on both the 1024 node nCUBE 2 and the 128 node Intel iPSC/860. Additional improvements to the algorithm which are possible when integrating it with the conjugate gradient algorithm are also discussed.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122637666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel and Pipelined Parallel Consecutive Sums on a Hypercube with Application to Ray Casting","authors":"Jianjian Song, R. Shu","doi":"10.1142/S0129053395000099","DOIUrl":"https://doi.org/10.1142/S0129053395000099","url":null,"abstract":"Communication penalty for parallel computation is related to message startup time and speed of data transmission between the host and processing elements (PEs). We propose two algorithms in this paper to show that the first factor can be alleviated by reducing the number of messages and the second by making the host-PE communication concurrent with computation on the PE array. The algorithms perform 2n consecutive sums of 2n numbers each on a hypercube of degree n. The first algorithm leaves one sum on each processor. It takes n steps to complete the sums and reduces the number of messages generated by a PE from 2n to n. The second algorithm sends all the sums back to the host as the sums are generated one by one. It takes 2n+n−1 steps to complete the sums in a pipeline so that one sum is completed every step after the initial (n−1) steps. We apply our second algorithm to the front-to-back composition for ray casting. For large number of rays, the efficiency and speedup of our algorithm are close to theoretically optimal values.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134014311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Tabu Search Approach to Task Scheduling on Heterogeneous Processors under Precedence Constraints","authors":"S. Porto, C. Ribeiro","doi":"10.1142/S012905339500004X","DOIUrl":"https://doi.org/10.1142/S012905339500004X","url":null,"abstract":"Parallel programs may be represented as a set of interrelated sequential tasks. When multiprocessors are used to execute such programs, the parallel portion of the application can be speeded up by an appropriate allocation of processors to the tasks of the application. Given a parallel application defined by a task precedence graph, the goal of task scheduling (or processor assignment) is thus the minimization of the makespan of the application. In a heterogeneous multiprocessor system, task scheduling consists of determining which tasks will be assigned to each processor, as well as the execution order of the tasks assigned to each processor. In this work, we apply the tabu search metaheuristic to the solution of the task scheduling problem on a heterogeneous multiprocessor environment under precedence constraints. The topology of the Mean Value Analysis solution package for product form queueing networks is used as the framework for performance evaluation. We show that tabu search obtains much better results, i.e., shorter completion times, improving from 20 to 30% the makespan obtained by the most appropriate algorithm previously published in the literature.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124732395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing Linda Tuplespace on a Distributed System","authors":"M. Feng, Yaoqing Gao, C. Yuen","doi":"10.1142/S0129053395000087","DOIUrl":"https://doi.org/10.1142/S0129053395000087","url":null,"abstract":"Linda, a general purpose coordination language, has been used to make a language parallel. Based on a logically shared tuplespace, Linda poses difficulties to be efficiently implemented on a distributed multiprocessor system. This paper reports our approach to solve the problem: processors are divided into groups, and each group has a group manager to provide a local view of the global tuplespace, and handles the tuplespace operations incurred by processors within the group. To maintain the consistency and correctness of the Linda tuplespace operations, we propose the algorithms of a group manager. We also implement the algorithms on a transputer-based multicomputer and show the experiment results.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130684099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}