{"title":"Memory System Design in Superscalar Processing","authors":"N. Lu, C. Chung","doi":"10.1142/S0129053395000233","DOIUrl":"https://doi.org/10.1142/S0129053395000233","url":null,"abstract":"In this paper, we study the memory system design for superscalar processing. Benchmarking is used to examine the execution behavior of load/store instructions, such as load/store parallelism and memory load/store port utilization. It is found that the use of only a single load/store port forms a system bottle-neck. A superscalar processor benefits from multiple load/store ports and system performance saturates with two load/store ports. The memory system must be carefully designed if multiple load/store ports are supported in a superscalar processor. Thus, we consider the design of the data cache subsystem. The data cache configurations we investigate include multiported cache, multibank cache, and duplicated cache. Through benchmarking, we find that the duplicated cache performs well in most benchmarks. Yet the cost of a duplicated cache is higher. In a superscalar multiprocessing environment, in order to properly maintain memory consistency, we must consider the load/store ordering of the processors. In superscalar processors, the load/store ordering may be in one of three forms: total ordering, load bypassing, and load forwarding. In this research, we conclude that to support the sequential consistency model, the load/store instructions must be totally ordered. Load bypassing and load forwarding are sufficient to support the processor consistency model.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116318642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Approximate Agreement Algorithm for Wraparound Meshes","authors":"R. Cheng, C. Chung","doi":"10.1142/S0129053395000221","DOIUrl":"https://doi.org/10.1142/S0129053395000221","url":null,"abstract":"An appropriate algorithm, the neighboring exchange, for reaching an approximate agreement in a wraparound mesh is proposed. The algorithm is characterized by its isotropic nature, which is of particular usefulness when applied in any symmetric system. The behavior of this algorithm can be depicted by recurrence relations which can be used to derive the convergence rate. The convergence rate is meaningful when the algorithm is used to synchnize clocks. The rate of synchronizing clocks is derived, and it can be applied to all wraparound meshes with practical scale. With the recurrence relations, we also prove the correctness of this algorithm.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117023150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Matrix Multiplication Algorithms on Hypercube Multiprocessors","authors":"Peizong Lee","doi":"10.1142/S012905339500021X","DOIUrl":"https://doi.org/10.1142/S012905339500021X","url":null,"abstract":"In this paper, we present three parallel algorithms for matrix multiplication. The first one, which employs pipelining techniques on a mesh grid, uses only one copy of data matrices. The second one uses multiple copies of data matrices also on a mesh grid. Although data communication operations of the second algorithm are reduced, the requirement of local data memory for each processing element increases. The third one, which uses a cubic grid, shows the trade-offs between reducing the computation time and reducing the communication overhead. Performance models and feasibilities of these three algorithms are studied. We analyze the interplay among the numbers of processing elements, the communication overhead, and the requirements of local memory in each processing element. We also present experimental results of these three algorithms on a 32-node nCUBE-2 computer.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multithreaded Decoupled Architecture","authors":"M. Dorojevets, V. Oklobdzija","doi":"10.1142/S0129053395000257","DOIUrl":"https://doi.org/10.1142/S0129053395000257","url":null,"abstract":"A new computer architecture called the Multithreaded Decoupled Architecture has been proposed for exploiting fine-grain parallelism. It develops further some of the ideas of parallel processing implemented in the Russian MARS-M computer in the 1980s. The MTD architecture aims at enhancing both total machine throughput and a single thread performance. To achieve this goal, we propose a two-level parallel computation model. Its low level defines the decoupled parallel execution of instructions within program fragments not containing branches. We will be referring to these fragments as basic blocks. The model’s high level defines the parallel execution of multiple basic blocks representing a function or procedure. This scheduling hierarchy reflects the MTD storage hierarchy. Together the scheduling and storage models allow a processor with multiple execution units to exploit several forms of parallelism within a procedure. The compiler provides the hardware with thread register usage masks to allow run-time enforcing of control and data dependencies between the high level threads. We present a possible implementation of the MTD-processor with multiple execution units and two-level distributed register memory.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116828787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Optimal Weighted Binary Trees","authors":"J. Pradhan, C. V. Sastry","doi":"10.1142/S0129053395000245","DOIUrl":"https://doi.org/10.1142/S0129053395000245","url":null,"abstract":"A new recursive top-down algorithm for the construction of a unique Huffman tree is introduced. We show that the prefix codes generated from the Huffman tree are unique and the weighted path length is optimal. Initially we have not imposed any restriction on the maximum length (the number of bits) a prefix code can take. But if buffering of the source is required, we have to put a restriction on the length of the prefix code. In this context we extend the top-down recursive algorithm for generating length-limited prefix codes.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123319503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Fortran Intrinsic Functions","authors":"Toru Nagai","doi":"10.1142/S0129053395000129","DOIUrl":"https://doi.org/10.1142/S0129053395000129","url":null,"abstract":"High performance of mathematical functions is essential to speed up scientific calculations because they are very frequently used in scientific computing. This paper presents performance of important Fortran intrinsic functions on the fastest vector supercomputers. It is assumed that a relationship between CPU-time and the number of function arguments given to calculate function values is linear, and speeds of a function were measured using the parameters and . The author also examines how the speed of the function varies with respect to the selection of arguments. The computers tested in the present paper are Cray C9016E/16256– 4, Fujitsu VP2600/10, Hitachi S-3800/480 and NEC SX-3/14R.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123460574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Block Preconditioned Conjugate Gradient Methods on a Distributed Virtual Shared Memory Multiprocessor","authors":"L. Giraud","doi":"10.1142/S0129053395000105","DOIUrl":"https://doi.org/10.1142/S0129053395000105","url":null,"abstract":"We study both shared and distributed approaches for the parallel implementation of the SSOR and Jacobi block preconditioned Krylov methods on a distributed virtual shared memory computer: a BBN TC2000. We consider the solution of block tridiagonal systems arising from the discretization of 3D partial differential equations, which diagonal blocks correspond to the discretization of 2D partial differential equations. The solution of the diagonal subproblems required for the preconditionings are performed using a domain decomposition method with overlapped subdomains: a variant of the Schwarz alternating method.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124538696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Minimal Synchronization Overhead Affinity Scheduling Algorithm for Shared-Memory Multiprocessors","authors":"Yi-Min Wang, R. Chang","doi":"10.1142/S0129053395000130","DOIUrl":"https://doi.org/10.1142/S0129053395000130","url":null,"abstract":"In addition to load balancing and synchronization overhead, affinity is an important consideration for loop scheduling algorithms in modern multiprocessors. Algorithms based on affinity, like affinity scheduling (AFS), do perform better than dynamic algorithms, such as guided self-scheduling (GSS) and trapezoid self-scheduling (TSS). However, there is still room for improvement in affinity scheduling. This paper suggests a modification to AFS which combines the advantages of both GSS and AFS. Experimental results confirm the effectiveness of the proposed modification.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126933874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A General-Purpose Parallel Sorting Algorithm","authors":"A. Tridgell, R. Brent","doi":"10.1142/S0129053395000166","DOIUrl":"https://doi.org/10.1142/S0129053395000166","url":null,"abstract":"A parallel sorting algorithm is presented for general purpose internal sorting on MIMD machines. The algorithm initially sorts the elements within each node using a serial sorting algorithm, then proceeds with a two-phase parallel merge. The algorithm is comparison-based and requires additional storage of order the square root of the number of elements in each node. Performance of the algorithm on the Fujitsu AP1000 MIMD supercomputer is discussed.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"430 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123573903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Factorized Sparse Approximate Inverse Preconditioning II: Solution of 3D FE Systems on Massively Parallel Computers","authors":"L. Kolotilina, A. Yeremin","doi":"10.1142/S0129053395000117","DOIUrl":"https://doi.org/10.1142/S0129053395000117","url":null,"abstract":"An iterative method for solving large linear systems with sparse symmetric positive definite matrices on massively parallel computers is suggested. The method is based on the Factorized Sparse Approximate Inverse (FSAI) preconditioning of ‘parallel’ CG iterations. Efficiency of a concurrent implementation of the FSAI-CG iterations is analyzed for a model hypercube, and an estimate of the optimal hypercube dimension is derived. For finite element applications, two strategies for selecting the preconditioner sparsity pattern are suggested. A high convergence rate of the resulting iterations is demonstrated numerically for the 3D equilibrium equations for linear elastic orthotropic materials approximated using both h- and p-versions of the FEM.","PeriodicalId":270006,"journal":{"name":"Int. J. High Speed Comput.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125318516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}