{"title":"Three-Dimensional Monte Carlo Device Simulation with Parallel Multigrid Solver","authors":"Can K. Sandalci, Ç. Coç, S. Goodnick","doi":"10.1142/S0129053397000143","DOIUrl":"https://doi.org/10.1142/S0129053397000143","url":null,"abstract":"We present the results in embedding a multigrid solver for Poisson's equation into the parallel 3D Monte Carlo device simulator, PMC-3D. First we have implemented the sequential multigrid solver, and embedded it into the Monte Carlo code which previously was using the sequential successive overrelaxation (SOR) solver. Depending on the convergence threshold, we have obtained significant speedups ranging from 5 to 15 on a single HP 712/80 workstation. We have also implemented the parallel multigrid solver by extending the partitioning algorithm and the interprocessor communication routines of the SOR solver in order to service multiple grids. The Monte Carlo code with the parallel multigrid Poisson solver is 3 to 9 times faster than the Monte Carlo code with the parallel SOR code, based on timing results on a 32-node nCUBE multiprocessor.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132175164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rationale and Strategy for a 21st Century Scientific Computing Architecture: the Case for Using Commercial Symmetric Multiprocessors as Supercomputers","authors":"W. Johnston","doi":"10.1142/S0129053397000131","DOIUrl":"https://doi.org/10.1142/S0129053397000131","url":null,"abstract":"In this paper we argue that the next generation of supercomputers will be based on tight-knit clusters of symmetric multiprocessor systems in order to: (i) provide higher capacity at lower cost; (ii) enable easy future expansion, and (iii) ease the development of computational science applications. This strategy involves recognizing that the current vector supercomputer user community divides (roughly) into two groups, each of which will benefit from this approach: One, the \"capacity\" users (who tend to run production codes aimed at solving the science problems of today) will get better throughput than they do today by moving to large symmetric multiprocessor systems (SMPs), and a second group, the \"capability\" users (who tend to be developing new computational science techniques) will invest the time needed to get high performance from cluster-based parallel systems. In addition to the technology-based arguments for the strategy, we believe that it also supports a vision for a revitalization of scientific computing. This vision is that an architecture based on commodity components and computer science innovation will: (i) enable very scalable high performance computing to address the high-end computational science requirements; (ii) provide better throughput and a more productive code development environment for production supercomputing; (iii) provide a path to integration with the laboratory and experimental sciences, and (iv) be the basis of an on-going collaboration between the scientific community, the computing industry, and the research computer science community in order to provide a computing environment compatible with production codes and dynamically increasing in both hardware and software capability and capacity. We put forward the thesis that the current level of hardware performance and sophistication of the software environment found in commercial symmetric multiprocessor (SMP) systems, together with advances in distributed systems architectures, make clusters of SMPs one of the highest-performance, most cost-effective approaches to computing available today. The current capacity users of the C90-like system will be served in such an environment by having more of several critical resources than the current environment provides: much more CPU time per unit of real time, larger memory per node and much larger memory per cluster; and the capability users are served by an MPP-like performance and an architecture that enables continuous growth into the future. In addition to these primary arguments, secondary advantages of SMP clusters include: the ability to replicate this sort of system in smaller units to provide identical computing environments at the home sites and laboratories of scientific users; the future potential for using the global Internet for interconnecting large clusters at a central facility with smaller clusters at other sites to form a very high capability system; and a rapidly growing base of supporting commercial","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115536523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Solution of Dense Linear Systems on the k-Ary n-Cube Networks","authors":"A. Al-Ayyoub, K. Day","doi":"10.1142/S0129053397000088","DOIUrl":"https://doi.org/10.1142/S0129053397000088","url":null,"abstract":"In this paper a parallel algorithm for solving systems of linear equation on the k-ary n-cube is presented and evaluated for the first time. The proposed algorithm is of O(N3/kn) computation complexity and uses O(Nn) communication time to factorize a matrix of order N on the k-ary n-cube. This is better than the best known results for the hypercube, O(N log kn), and the mesh, , each with approximately kn nodes. The proposed parallel algorithm takes advantage of the extra connectivity in the k-ary n-cube in order to reduce the communication time involved in tasks such as pivoting, row/column interchanges, and pivot row and multipliers column broadcasts.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131055220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Amor, Juan López, Francisco Argüello, E. Zapata
{"title":"Mapping Tridiagonal System Algorithms onto Mesh Connected Computers","authors":"M. Amor, Juan López, Francisco Argüello, E. Zapata","doi":"10.1142/S012905339700009X","DOIUrl":"https://doi.org/10.1142/S012905339700009X","url":null,"abstract":"In this work we apply a methodology for the parallelization of algorithms for tridiagonal solvers. We classify tridiagonal solvers as a function of their data flows and present a unified version of the projection of these algorithms onto computers with mesh topology and distributed memory. Finally, we evaluate the algorithms and compare them through specific tests on the Fujitsu AP1000.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115181781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality Optimizations for Parallel Computing Using Data Access Information","authors":"M. Rinard","doi":"10.1142/S0129053397000118","DOIUrl":"https://doi.org/10.1142/S0129053397000118","url":null,"abstract":"Given the large communication overheads characteristic of modern parallel machines, optimizations that improve locality by executing tasks close to data that they will access may improve the performance of parallel computations. This paper describes our experience automatically applying locality optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply locality optimizations. We present performance results for several Jade applications running on the Stanford DASH machine. We use these results to characterize the overall performance impact of the locality optimizations. In our application set the locality optimization level has little effect on the performance of two of the applications and a large effect on the performance of the rest of the applications. We also found that, if the locality optimization level had a significant effect on the performance, the maximum performance was obtained when the programmer explicitly placed tasks on processors rather than relying on the scheduling algorithm inside the Jade implementation.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130555943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coordination of Distributed and Parallel Activities in the IWIM Model","authors":"G. A. Papadopoulos, F. Arbab","doi":"10.1142/S0129053397000106","DOIUrl":"https://doi.org/10.1142/S0129053397000106","url":null,"abstract":"We present an alternative way of designing new as well as using existing coordination models for parallel and distributed environments. This approach is based on a complete symmetry between and decoupling of producers and consumers, as well as a clear distinction between the computation and the coordination/ communication work performed by each process. The novel ideas are: (i) to allow both producer and consumer processes to communicate with each other in a fashion that does not dictate any one of them to have specific knowledge about the rest of the processes involved in a coordinated activity, and (ii) to introduce control or state driven changes (as opposed to the data-driven changes usually employed) to the current state of a computation. Although a direct realisation of this model in terms of a concrete coordination language does exist, we argue that the underlying principles can be applied to other similar models. We demonstrate our point by showing how the functionality of the proposed model can be realised in a general coordination framework, namely the Shared Dataspace one, using as driving force the Linda-based formalism. Our demonstration achieves the following objectives: (i) yields an alternative (control- rather than data-driven) Linda-based coordination framework, and (ii) does it in such a way that the proposed apparatus can be used for other Shared-Dataspace-like coordination formalisms with little modification.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"56 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117216825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Bidirectional Cholesky Factorization Algorithm for Parallel Solution of Sparse Symmetric Positive Definite Systems","authors":"K. Murthy, C. Murthy","doi":"10.1142/S0129053397000064","DOIUrl":"https://doi.org/10.1142/S0129053397000064","url":null,"abstract":"In this paper, we consider the problem of solving sparse linear systems occurring in finite difference applications (or N × N grid problems, N being the size of the linear system). We propose a new algorithm for the problem which is based on the Cholesky factorization, a symmetric variant of Gaussian elimination tailored to symmetric positive definite systems. The algorithm employs a new technique called bidirectional factorization to produce the complete solution vector by solving only one triangular system against two triangular systems in the existing Cholesky factorization after the factorization phase. The effectiveness of the new algorithm is demonstrated by comparing its performance with that of the existing Cholesky factorization for solving regular finite difference grid problems on hypercube multiprocessors.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation of ART1 and ART2 Artificial Neural Networks on Ring and Mesh Architectures","authors":"G. D. Ghare, L. Patnaik","doi":"10.1142/S0129053397000052","DOIUrl":"https://doi.org/10.1142/S0129053397000052","url":null,"abstract":"The Artificial Neural Networks (ANNs) are being used to solve a variety of problems in pattern recognition, robotic control, VLSI CAD and other areas. In most of these applications, a speedy response from the ANNs is imperative. However, ANNs comprise a large number of artificial neurons, and a massive interconnection network among them. Hence, implementation of these ANNs involves execution of computer-intensive operations. The usage of multiprocessor systems therefore becomes necessary. In this article, we have presented the implementation of ART1 and ART2 ANNs on ring and mesh architectures. The overall system design and implementation aspects are presented. The performance of the algorithm on ring, 2-dimensional mesh and n-dimensional mesh topologies is presented. The parallel algorithm presented for implementation of ART1 is not specific to any particular architecture. The parallel algorithm for ARTE is more suitable for a ring architecture.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Cost Optimal Search Technique for the Knapsack Problem","authors":"D. Lou, Chinchen Chang","doi":"10.1142/S0129053397000027","DOIUrl":"https://doi.org/10.1142/S0129053397000027","url":null,"abstract":"The knapsack problem is known to be a typical NP-complete problem, which has 2n possible solutions to search over. Thus a task for solving the knapsack problem can be accomplished in 2n trials if an exhaustive search is applied. In the past decade, much effort has been devoted in order to reduce the computation time of this problem instead of exhaustive search. In 1984, Karnin proposed a brilliant parallel algorithm, which needs O(2n/6) processors to solve the knapsack problem in O(2n/2) time; that is, the cost of Karnin's parallel algorithm is O(22n/3). In this paper, we propose a fast search technique to improve Karnin's parallel algorithm by reducing the search time complexity of Karnin's parallel algorithm to be O(2n/3) under the same O(2n/6) processors available. Thus, the cost of the proposed parallel algorithm is O(2n/2). Furthermore, we extend this search technique to the case that the number of available processors is P = O(2x), where x ≥ 1. From the analytical results, we see that our search technique is indeed superior to the previously proposed methods. We do believe our proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134048959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Parallel Radix Sort Using a Reconfigurable Mesh","authors":"Ju-wook Jang, Kyung-Geun Lee","doi":"10.1142/S0129053397000040","DOIUrl":"https://doi.org/10.1142/S0129053397000040","url":null,"abstract":"In this paper, we present a parallel SIMD algorithm for radix sorting of N numbers of w bits each, taking O(w + N1/4) time with the VLSI area of O(N3/2 w2), 0 < w < N1/4. For w = log N, our algorithm improves a previous known solution on a similar architecture in time complexity by a factor of log N. Since our algorithm uses only radix sort for sorting of subsets and merging of them, no comparator is needed. Our algorithm satisfies the lower bound of AT2 complexity which mainly restricts the VLSI implementation of most sorting algorithms. The same result is obtained in another previously known solution, but it requires a comparator of size w.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121822636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}