{"title":"The Communication Machine","authors":"P. Swarztrauber","doi":"10.1142/S0129053304000207","DOIUrl":"https://doi.org/10.1142/S0129053304000207","url":null,"abstract":"The Communication Machine brings to the multicomputer what vectorization brought to the uniprocessor. It provides the same tools to speed communication that have traditionally been used to speed computation; namely, the capability to program optimal communication algorithms on an architecture that can, to the extent possible, replicate their performance in terms of wall-clock time. In addition to the usual complement of logic and arithmetic units, each module contains a programmable communication unit that orchestrates traffic between the network and registers that communicate directly with comparable registers in neighboring modules. Communication tasks are performed out of these registers like computational tasks on a vector uniprocessor. The architecture is balanced in the sense that, on average, the speed of local and global memory is comparable. Theoretical performance is tabulated for both hypercube and mesh interconnection networks. The Communication Machine returns to the somewhat beleaguered, yet intuitive concept that the performance we ultimately seek must come from a truly massive number of processors.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131060426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Time-Parallel Computation of Pseudo-Adjoints for a Leapfrog Scheme","authors":"C. Bischof","doi":"10.1142/S0129053304000219","DOIUrl":"https://doi.org/10.1142/S0129053304000219","url":null,"abstract":"The leapfrog scheme is a commonly used second-order difference scheme for solving differential equations. If Z(t) denotes the state of a system at a particular time step t, the leapfrog scheme computes the state at the next time step as Z(t+1)=H(Z(t),Z(t-1),W), where H is the nonlinear time-stepping operator and W represents parameters that are not time-dependent. In this note, we show how the associativity of the chain rule of differential calculus can be used to compute a so-called adjoint, the derivative of a scalar-valued function applied to the final state Z(T) with respect to some chosen parameters, efficiently in a parallel fashion. To this end, we (1) employ the reverse mode of automatic differentiation at the outermost level, (2) use a sparsity-exploiting version of the forward mode of automatic differentiation to compute derivatives of H at every time step, and (3) exploit chain rule associativity to compute derivatives at individual time steps in parallel. We report on experimental results with a 2-D shallow water equations model problem on an IBM SP parallel computer and a network of Sun SPARCstations.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121267559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparisons of the Parallel Preconditioners for Large Nonsymmetric Sparse Linear Systems on a Parallel Computer","authors":"Sangback Ma","doi":"10.1142/S0129053304000232","DOIUrl":"https://doi.org/10.1142/S0129053304000232","url":null,"abstract":"In this paper we compare various parallel preconditioners for solving large sparse nonsymmetric linear systems. They are Block Jacobi, Point-SSOR, ILU(0) in the wavefront order, ILU(0) in the multi-color order, SPAI(SParse Approximate Inverse), and Multi-Color Block SOR. The Block Jacobi and Point-SSOR are well-known, and ILU(0) is one of the most popular preconditioners, but it is inherently serial. ILU(0) in the wavefront order maximizes the parallelism, and ILU(0) in the multi-color order achieves the parallelism of order (N), where N is the order of the matrix. The SPAI tries to capture the approximate inverse in sparse form, which, then, is expected to be a scalable preconditioner. Finally, we implemented the Multi-Color Block SOR preconditioner combined with direct sparse matrix solver. For the Laplacian matrix the SOR method is known to have a non-deteriorating rate of convergence when used with Multi-Color ordering. Since most of the time is spent on the diagonal inversion, which is done on each processor, we expect it to be a good scalable preconditioner. Finally, due to the blocking effect, it will be effective for ill-conditioned problems. Experiments were conducted for the Finite Difference discretizations of two problems with various meshsizes varying up to 1024×1024, and for an ill-conditioned matrix from the shell problem from the Harwell–Boeing collection. CRAY-T3E with 128 nodes was used. MPI library was used for interprocess communications. The results show that Multi-Color Block SOR and ILU(0) with Multi-Color ordering give the best performances for the finite difference matrices and for the shell problem only the Multi-Color Block SOR and Block Jacobi converges. Based on this we recommend that the Multi-Color Block SOR is the most robust preconditioner out of the preconditioners considered.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124822979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Tripathi, B. K. Sarker, Naveen Kumar, D. P. Vidyarthi
{"title":"A GA Based Multiple Task Allocation Considering Load","authors":"A. Tripathi, B. K. Sarker, Naveen Kumar, D. P. Vidyarthi","doi":"10.1142/S0129053300000187","DOIUrl":"https://doi.org/10.1142/S0129053300000187","url":null,"abstract":"A Distributed Computing System (DCS) comprising networked heterogeneous processors requires ecient tasks to processor allocation to achieve minimum turnaround time and highest possible throughput. Task allocation in DCS remains an important and relevant problem attracting the attention of researchers in the discipline. A good number of task allocation algorithms have been proposed in the literature [3{9]. This algorithm considered allocation of the modules of a single task to various processing nodes and aim to minimize the turnaround time of the given task. But they did not consider execution of modules belonging to various dierent tasks (i.e. multiple tasks). In this work we have considered the number of modules that can be accepted by individual processing nodes along with their memory capacities and arrival of multiple disjoint tasks to the DCS from time to time. In this paper, a method based on genetic algorithm is developed which is memory ecient and give an optimal solution of the problem. The given simulation results also show signicant achievement in this regard.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128034512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enchanced Linked-Based Cache Coherence Protocols with a Hardware Mechanism to Reduce the Migratory Sharing Overhead","authors":"Der-Lin Pean, Cheng Chen","doi":"10.1142/S0129053300000163","DOIUrl":"https://doi.org/10.1142/S0129053300000163","url":null,"abstract":"The linked-based cache coherence protocols, such as the IEEE Scalable Coherence Interface (SCI), have been widely implemented in current highly scalable multiprocessor systems. Thus, we propose several enhanced linked-based cache coherence protocols in multiprocessor systems to evaluate their performance. However, migratory sharing data references in the linked-based systems still incur many cache misses that can be reduced by merging the invalidation/update requests and the cache misses. Research has been devoted to optimizing the migratory sharing references for the centralized directory coherence protocols, but their mechanisms cannot support the linked-based cache coherence protocols. This paper presents enhanced SCI protocols with an effective hardware technique to reduce the overhead of migratory sharing references for the linked-based cache coherence protocols. It reduces cost by eliminating some of the unnecessary supporting mechanisms in centralized directory protocols. The simulation results in SPLASH benchmarks show that our hardware methods enhanced the system performance by up to an average of 10%, by reducing the overhead of the migratory sharing references. The extra benefit of our mechanism is the elimination of the false sharing overhead by degrading a block to shared mode again.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"32 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125709723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Adaptive Fault-Tolerant Wormhole Routing Algorithm for Hypercubes","authors":"Jau-Der Shih","doi":"10.1142/S012905330000014X","DOIUrl":"https://doi.org/10.1142/S012905330000014X","url":null,"abstract":"In this paper, we present an adaptive fault-tolerant wormhole routing algorithm for hypercubes by using 4 virtual networks. Each node is identified to be in one of the four states: safe, ordinarily unsafe, strongly unsafe, and faulty. Based on the concept of unsafe nodes, we design a routing algorithm for hypercubes that can tolerate at least n-1 faulty nodes and can route a message via a path of length no more than the Hamming distance between the source and destination plus four. Previous algorithms which achieve the same fault tolerant ability need at least 5 virtual channels per physical channel. Simulation results show that our algorithm outperforms previous known results.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134482811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computation Time and Idle Time of Tiling Transformation on a Network of Workstations","authors":"S. Sathe, P. Nawghare","doi":"10.1142/S0129053300000126","DOIUrl":"https://doi.org/10.1142/S0129053300000126","url":null,"abstract":"Tiling is a technique for extraction of parallelism which groups iterations of a nest of \"for\" loops into blocks called tiles which can be scheduled for execution on the workstations connected by a network. Extraction of parallelism will be maximum when the workstations are busy in computation most of the time. Hence idle time of tiling is a very important parameter. In this paper we have presented results on the study of tiling transformation with respect to computation time and idle time. In our study we have considered tiles of rectangular shape and of size n1×n2. The iteration space can, however, be rectangular or parallelogram shaped and of size N1×N2. The results presented in this paper can be used for tiling of iteration spaces such that idle time is minimum and can be easily integrated in a parallelising compiler. Modelling communication between workstations is important for tiling transformation. We have developed a new improved model for modelling communication between workstations.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124516940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Embedding Hamiltonian Cycles, Linear Arrays and Rings in a Faulty Supercube","authors":"Jen-Chih Lin","doi":"10.1142/S0129053300000151","DOIUrl":"https://doi.org/10.1142/S0129053300000151","url":null,"abstract":"We consider the problem of finding Hamiltonian cycles, linear arrays and rings of a faulty supercube, if any. The proof of the existence of Hamiltonian cycles in hypercubes is easy due to the fact they are symmetric graphs. Since the supercube is asymmetric, the proof of the existence of Hamiltonian cycles is not trivial. We show that for any supercube SN, where N is the number of nodes in the supercube, there exists a Hamiltonian cycle. This implies that for any r such that 3≤r≤N, there exists a cycle of r nodes in a supercube. There are embedding algorithms proposed in this paper. The embedding algorithms show a ring with any number of nodes which can be embedded in a faulty supercube with load 1, congestion 1 and dilation 4 such that O(n2-(⌊log2 m⌋)2) faults can be tolerated, where n is the dimension of the supercube and m is the number of nodes of the ring.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128703258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the Execution Efficiency of Barrier Synchronization in Software DSM through Static Analysis","authors":"Jae Bum Lee, C. Jhon","doi":"10.1142/S0129053300000138","DOIUrl":"https://doi.org/10.1142/S0129053300000138","url":null,"abstract":"In software Distributed Shared Memory (SDSM) systems, the large coherence granularity imposed by virtual memory page size tends to induce false sharing, which may lead to heavy network traffic or useless page misses on barrier operations. In this paper, we propose a method to alleviate the coherence overhead of barrier synchronization in the SDSM systems. It performs static analysis on a shared-memory program to examine data dependency between processors across global barriers, and then special primitives are inserted into the program in order to exploit the dependency information at run time. If the data modified before a barrier will be accessed by some of the other processors after the barrier, coherence messages are transferred only to the processors through the inserted primitives. Furthermore, if the modified data will not be used by any other processors, the primitives enforce the coherence messages to be delivered only to master process after the parallel execution of the program completes. We implemented the static analysis with SUIF parallelizing compiler and then evaluated the execution performance of modified programs in a 16-node SDSM system supporting AURC protocol. The experimental results show that our method is very effective at reducing the useless coherence messages, and also can improve the execution time substantially by reducing false sharing misses.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130920620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"K-Means-Type Algorithms on Distributed Memory Computer","authors":"M. Ng","doi":"10.1142/S0129053300000096","DOIUrl":"https://doi.org/10.1142/S0129053300000096","url":null,"abstract":"Partitioning a set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means-type algorithm is best suited for implementing this operation because of its efficiency in clustering large numerical and categorical data sets. An efficient parallel k-means-type algorithm for clustering data sets on a distributed share-nothing parallel system is considered. It has a simple communication scheme which performs only one round of information exchange in every iteration. We show that the speedup of our algorithm is asymptotically linear when the number of objects is sufficiently large. We implement the parallel k-means-type algorithm on an IBM SP2 parallel machine. The performance studies show that the algorithm has nice parallelism in experiments.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130045975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}