{"title":"Defects in parallel Monte Carlo and quasi-Monte Carlo integration using the leap-frog technique","authors":"K. Entacher, Thomas Schell, W. C. Schmid, A. Uhl","doi":"10.1080/1063719031000088021","DOIUrl":"https://doi.org/10.1080/1063719031000088021","url":null,"abstract":"Currently, the most efficient numerical techniques for evaluating high-dimensional integrals are based on Monte Carlo and quasi-Monte Carlo techniques. These tasks require a significant amount of computation and are therefore often executed on parallel computer systems. In order to keep the communication amount within a parallel system to a minimum, each processing element (PE) requires its own source of integration nodes. Therefore, techniques for using separately initialized and disjoint portions of a given point set on a single PE are classically employed. Using the so-called substreams may lead to dramatic errors in the results under certain circumstances. In this work, we compare the possible defects employing leaped quasi-Monte Carlo and Monte Carlo substreams. Apart from comparing the magnitude of the observed integration errors we give an overview under which circumstances (i.e. parallel programming models) such errors can occur.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133254130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Special Issue: A systolic block-Jacobi SVD solver for processor meshes","authors":"G. Okša, M. Vajtersic","doi":"10.1080/1063719031000088003","DOIUrl":"https://doi.org/10.1080/1063719031000088003","url":null,"abstract":"We design the systolic version of the two-sided block-Jacobi algorithm for the singular value decomposition (SVD) of matrix A∈R m×n , and m, n even. The algorithm involves the class CO of parallel orderings on the two-dimensional toroidal mesh with p processors. The mathematical background is based on the QR decomposition (QRD) of local data matrices and on the triangular Kogbetliantz algorithm (TKA) for local SVDs in the diagonal mesh processors. Subsequent updates of local matrices in the diagonal as well as nondiagonal mesh processors are required. We show that all updates can be realized by orthogonal modified Givens rotations. These rotations can be efficiently pipelined in parallel in the horizontal and vertical rings of processor through the toroidal mesh. Our solution requires, per one mesh processor, systolic processing elements (PEs) and additional delay elements. The time complexity can be estimated as where w is the number of global sweeps in the two-sided block-Jacobi algorithm and Δ is the length of the global synchronization time step. The VLSI area per mesh processor, measured by the number of vertical and horizontal wires required for its construction, can be estimated as and the combined VLSI area–time complexity per mesh processor is The theoretical speedup can be estimated as Using the mesh processors of fixed inner size , even, it is possible to construct the square two-dimensional toroidal mesh and to compute the SVD of matrix A, the size of the which matches the shape of mesh processors, i.e. In this sense, the systolic algorithm is scalable.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125612576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LEFTMOST EIGENVALUE OF REAL AND COMPLEX SPARSE MATRICES ON PARALLEL COMPUTER USING APPROXIMATE INVERSE PRECONDITIONING","authors":"G. Pini","doi":"10.1080/10637190208941433","DOIUrl":"https://doi.org/10.1080/10637190208941433","url":null,"abstract":"An efficient parallel approach for the computation of the eigenvalue of smallest absolute magnitude of sparse real and complex matrices is provided. The proposed strategy tries to improve the efficiency of the reverse power method. At each inverse power iteration the linear system is solved either by the conjugate gradient scheme (symmetric case) or by the Bi-CGSTAB method (symmetric case). Both solvers are preconditioned employing the approximate inverse factorization and thus are easily parallelized. The satisfactory speed-ups obtained on the CRAY T3E supercomputer show the high degree of parallelization reached by the proposed algorithm.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127201542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AN O∥LOG P) PARALLEL IMPLEMENTATION OF FEEDBACK GUIDED DYNAMIC LOOP SCHEDULING","authors":"T. Tabirca, Len Freeman, S. Tabirca","doi":"10.1080/10637190208941438","DOIUrl":"https://doi.org/10.1080/10637190208941438","url":null,"abstract":"Feedback Guided Dynamic Loop Scheduling (FGDLS) is a recently proposed dynamic algorithm for loop scheduling. The original algorithm required an O(p) serial computation at each stage to compute the updated loop schedule. In this paper, it is shown that this computation can be implemented in O(log p) operations on p processors","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130489061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NUMERICAL SOLUTION OF DISCRETE STABLE LINEAR MATRIX EQUATIONS ON MULTICOMPUTERS","authors":"P. Benner, E. S. Quintana‐Ortí, G. Quintana-Ortí","doi":"10.1080/10637190208941436","DOIUrl":"https://doi.org/10.1080/10637190208941436","url":null,"abstract":"We investigate the parallel performance of numerical algorithms for solving discrete Sylvester and Stein equations as they appear for instance in discrete-time control problems, filtering, and image restoration. The methods used here are the squared Smith iteration and the sign function method on a Cayley transformation of the original equation. For Stein equations with semidefinite right-hand side these methods are modified such that the Cholesky factor of the solution can be computed directly without forming the solution matrix explicitly. We report experimental results of these algorithms on distributed-memory multicomputers","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"464 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116185894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PORTING REGULAR APPLICATIONS ON HETEROGENEOUS WORKSTATION NETWORKS: PERFORMANCE ANALYSIS AND MODELING","authors":"A. Clematis, A. Corana","doi":"10.1080/01495730108941441","DOIUrl":"https://doi.org/10.1080/01495730108941441","url":null,"abstract":"Abstract Heterogeneous networks of workstations and/or personal computers (NOW) are increasingly used as a powerful platform for the execution of parallel applications. When applications previously developed for traditional parallel machines (homogeneous and dedicated) are ported to NOWs, performance worsens owing in part to less efficient communications but more often to unbalancing. In this paper, we address the problem of the efficient porting to heterogeneous NOWs of data-parallel applications originally developed using the SPMD paradigm for homogeneous parallel systems with regular topology like ring. To achieve good performance, the computation time on the various machines composing the NOW must be as balanced as possible. This can be obtained in two ways: by using an heterogeneous data partition strategy with a single process per node, or by splitting homogeneously data among processes and assigning to each node a number of processes proportional to its computing power. The first method is however more difficult, since some modifications in the code are always needed, whereas the second approach requires very few changes. We carry out a simplified but reliable analysis, and propose a simple model able to simulate performance in the various situations. Two test cases, matrix multiplication and computation of long-range interactions, are considered, obtaining a good agreement between simulated and experimental results. Our analysis shows that an efficient porting of regular homogeneous data-parallel applications on heterogeneous NOWs is possible. Particularly, the approach based on multiple processes per node turns out to be a straightforward and effective way for achieving very satisfying performance in almost all situations, even dealing with highly heterogeneous systems.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127652873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DERIVING A FAST SYSTOLIC ALGORITHM FOR THE LONGEST COMMON SUBSEQUENCE PROBLEM","authors":"Yen-Chun Lin, J. Yeh","doi":"10.1080/10637190208941431","DOIUrl":"https://doi.org/10.1080/10637190208941431","url":null,"abstract":"The longest common subsequence (LCS) problem is to find an LCS of two given sequences and the length of the LCS. In this paper, an efficient systolic algorithm for the LCS problem is derived. For two sequences of length m and n, where m ≥ n, the problem can be solved with only [n/2] processors in m + 2[n/2] − 1 time steps. Compared with other systolic algorithms that solve the LCS problem, our algorithm not only takes fewer time steps but also uses fewer processors. Our algorithm is better suited to implementation on multicomputers than other systolic algorithms.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127810884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A PARALLEL DIVIDE AND CONQUER ALGORITHM FOR NON SYMMETRIC TRIDIAGONAL TOEPLITZ SYSTEMS USING CONJUGATE GRADIENT","authors":"L. Garey, R. E. Shaw, J. Zhang","doi":"10.1080/01495730208941443","DOIUrl":"https://doi.org/10.1080/01495730208941443","url":null,"abstract":"Abstract In this paper, we consider the application of the conjugate gradient method specifically to solve non symmetric systems which are large, tridiagonal and Toeplitz. Under the condition that the system is diagonally dominant, one can pre-multiply the system by the transpose of the coefficient matrix and take advantage of the structure of the new coefficient matrix to perturb and factor it. This allows us to divide the task of solution containing pairs of tridiagonal, symmetric and Toeplitz systems and to solve the pairs of systems using a parallel implementaton of congujate gradient. Final corrections, to account for the perturbations, provide a numerical approximation to the solution.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116687927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"THE LOAD DISTRIBUTION PROBLEM IN A PROCESSOR RING","authors":"F. Lau","doi":"10.1080/01495730108941440","DOIUrl":"https://doi.org/10.1080/01495730108941440","url":null,"abstract":"Abstract Given a global picture of the system load and the average load, the load distribution problem is to find a suitable schedule, consisting of the amount of excess load to transfer along every edge, so that the system load can be balanced in minimal time by executing the schedule. We study this problem for the ring topology We discuss some existing algorithms, show how they fall short of being able to generate optimal schedules, and present a simple algorithm that would generate an optimal schedule for any given system load instance. This simple algorithm relies on an existing algorithm to create a search window in which the optimal solution is to be found.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126049067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Calamoneri, Irene Finocchi, Y. Manoussakis, R. Petreschi
{"title":"ON MAX CUT IN CUBIC GRAPHS","authors":"T. Calamoneri, Irene Finocchi, Y. Manoussakis, R. Petreschi","doi":"10.1080/01495730108941439","DOIUrl":"https://doi.org/10.1080/01495730108941439","url":null,"abstract":"Abstract This paper is concerned with the maximum cut problem in parallel on cubic graphs. New theoretical results characterizing the cardinality of the cut are presented. These results make it possible to design a simple combinatorial O(log n) time parallel algorithm, running on a CRCW P-RAM with O(n) processors. The approximation ratio achieved by the algorithm is 1·3 and improves the best known parallel approximation ratio, i.e. 2, in the special class of cubic graphs. The algorithm also guarantees that the size of the returned cut is at least ((9g −3)/8 g)n, where g is the odd girth of the input graph. Experimental results round off the paper, showing that the solutions obtained in practice are likely to be much better than the theoretical lower bound.","PeriodicalId":406098,"journal":{"name":"Parallel Algorithms and Applications","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116805085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}