{"title":"Experiments on high-priority cold requests in the presence of tree saturation","authors":"Jin-Ho Lee, Myong-Soon Park","doi":"10.1109/ICPADS.1994.589898","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.589898","url":null,"abstract":"In large-scale shared memory multiprocessors, when a multistage interconnection network (MIN) is used for communication between processors and memory modules, hot spot and tree saturation severely delay memory requests and degrade memory bandwidth. We propose the Cold-First scheme, which is based on priority control and virtual channel flow control concepts, to reduce the delay of cold requests in the presence of hot spots. By simulations and results, we show that Cold-First scheme reduces the delay of memory requests, especially the delay of cold requests, and improves the memory bandwidth. In addition, we study the effect caused by the long delay of hot requests on lock and unlock mechanisms generally used for synchronization.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"28 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing procedure call overhead: optimizing register usage at procedure calls","authors":"F. Lai, Chia-Jung Hsieh","doi":"10.1109/ICPADS.1994.590416","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590416","url":null,"abstract":"Proposes a common global variable reassignment and an integrated approach which takes advantage of the complementary relationship of (1) in-lining and (2) interprocedural register allocation to reduce the procedure call overhead without causing any additional negative effect. Our approach is based on the observation of analyzed program characteristics to identify the heavily called procedure regions, and on register usage information to optimize the placement of resister save/restore code. This method also takes full advantage of free-use registers at each procedure call site. The average performance improvement is 1.233 compared with previous schemes that performed either (1) or (2) independently.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124881166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thread migration on heterogeneous systems via compile-time transformations","authors":"J. Sang, G. Peters","doi":"10.1109/ICPADS.1994.590411","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590411","url":null,"abstract":"Describes a technique to provide multi-threading an an enhanced C language. In contrast to the traditional design of a thread library, which usually utilizes a few lines of assembly code to effect context-switching between threads, the technique we use is based on compile-time program transformations and a run-time library. Since this approach transforms a thread's physical states into logical forms, thread migration in a heterogeneous distributed environment becomes practically feasible. Performance measurements of the current implementation are reported.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124886439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On evaluating parallel sparse Cholesky factorizations","authors":"W.-Y. Lin, C.-L. Chen","doi":"10.1109/ICPADS.1994.590074","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590074","url":null,"abstract":"Though many parallel implementations of sparse Cholesky factorization with the experimental results accompanied have been proposed, it seems hard to evaluate the performance of these factorization methods theoretically because of the irregular structure of sparse matrices. This paper is an attempt to such research. On the basis of the criteria of parallel computation and communication time, we successfully evaluate four widely adopted Cholesky factorization methods, including column-Cholesky, row-Cholesky, submatrix-Cholesky and multifrontal. The results show that the multifrontal method is superior to the others.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124890073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel design of Q-coders for bilevel image compression","authors":"Jianmin Jiang","doi":"10.1109/ICPADS.1994.590299","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590299","url":null,"abstract":"A parallel algorithm is presented in this paper to implement the adaptive binary arithmetic coding for lossless bilevel image compression. Based on the sequential Q-coder, software analysis in C is carried out to establish a tree array to process 4 bits in parallel. This development of parallel Q-coder substantially improves the encoding speed of bilevel images. As a matter of fact, the parallel algorithm can also be extended theoretically to any number of bits to be processed in parallel. The implication involved will be the design of internal structure for each PE, especially the buffer size where each bit renormalized locally is to be updated by the PE at the next level before it is sent out at the top of the tree array.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126021345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Broadcast in all-port wormhole-routed 3D mesh networks using extended dominating sets","authors":"Y. Tsai, P. McKinley","doi":"10.1109/ICPADS.1994.590061","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590061","url":null,"abstract":"A new approach to broadcast in wormhole-routed three-dimensional (3D) mesh networks is proposed. The approach extends the concept of dominating sets from graph theory by accounting for the relative distance-insensitivity of the wormhole routing switching strategy and by taking advantage of an all-port communication architecture, which allows each node to simultaneously transmit messages on different outgoing channels. The resulting broadcast operation is based on a tree structure that is composed of multiple levels of extended dominating nodes (EDN). Performance evaluation results, in the form of analysis and simulation, are presented that confirm the advantage of this technique over the recursive doubling approaches to broadcast.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125761047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault tolerance in hyperbus and hypercube multiprocessors using partitioning scheme","authors":"Szu-Chi Wang, S. Kuo","doi":"10.1109/ICPADS.1994.590319","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590319","url":null,"abstract":"In this paper, the partitioning scheme is used to achieve fault tolerance in hyperbus and hypercube multiprocessors. Unlike other schemes, processor faults are assumed to be randomly distributed. We propose a novel and practical load redistribution method to tolerate processor faults in a hyperbus structure with insignificant overhead (a slowdown of 2 for computation and a slowdown of 3 for communication in the worst case). Standard routing and broadcasting algorithms were implemented on hypercube computers. To achieve fault tolerance, we present routing and broadcasting algorithms for a faulty hypercube with at most n-1 faults. Compared with other existing algorithms, our methods have better performance in most measures.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114931421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semigroup computation and its applications on mesh-connected computers with hyperbus broadcasting","authors":"S. Horng","doi":"10.1109/ICPADS.1994.589892","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.589892","url":null,"abstract":"Let /spl oplus/ be an associative operation on a domain D. The semigroup problem is to compute a/sub 0//spl oplus/a/sub 1//spl oplus/...a/sub N-1/, where a/sub i/ /spl isin/D, for 0/spl les/i<N. The algorithm described here runs on SIMD mesh-connected computers with hyperbus broadcasting using p processors in time O(N/p+logp), where p/spl les/N. It as shown optimal when p=N and optimal speedup when p log p=N. Based on the proposed semigroup algorithm, other applications such as matrix multiplication, all-pair shortest path, shortest path spanning tree, topological sorting and connected component problems can be also solved in the order of logarithmic time using N/sup 3/ processors.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127082880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGIN: a modified Gamma interconnection network with multiple disjoint paths","authors":"Po-Jen Chuang","doi":"10.1109/ICPADS.1994.590332","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590332","url":null,"abstract":"To ensure high terminal reliability for the Gamma interconnection network (GIN), we propose a new modified GIN, referred to as CGIN (cyclic Gamma interconnection network) as its connecting patterns between stages exhibit a cyclic feature. The fact that there exist multiple disjoint paths between any communication pair for all types of CGINs makes it possible to tolerate any arbitrary single fault and to accomplish enhanced terminal reliability accordingly. The performance of the CGIN is also evaluated through simulation.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128991819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chiung-San Lee, T. Parng, Jew-Chin Lee, Cheng-Nan Tsai, K. Farn, Lin-Ching Chang, T. Chung, L.-P. Chen
{"title":"Performance modelling and evaluation for the XMP shared-bus multiprocessor architecture","authors":"Chiung-San Lee, T. Parng, Jew-Chin Lee, Cheng-Nan Tsai, K. Farn, Lin-Ching Chang, T. Chung, L.-P. Chen","doi":"10.1109/ICPADS.1994.590354","DOIUrl":"https://doi.org/10.1109/ICPADS.1994.590354","url":null,"abstract":"This paper presents the performance modelling and evaluation of a shared bus multiprocessor, XMP. A key characteristic of XMP is that it employs a special shared bus scheme featuring separate address bus and data bus with split transaction, pipelined cycle (called SSTP scheme). To assist evaluating the architectural alternatives of XMP, the features of the SSTP bus scheme as well as two important performance impacting factors: (1) cache, bus, and memory interferences and (2) DMA transfer, are modelled. We employ a Subsystem Access Time (SAT) modelling methodology. It is based on a Subsystem Access Time Per Instruction (SATPI) concept, in which we treat major components other than processors (e.g. off-chip cache, bus, memory, I/O) as subsystems and model for each of them the mean access time per instruction from each processor. Validated by statistical simulations, the performance model is fed with a given set of representative workload parameters, and then used to conduct performance evaluation for some initial system design issues. Furthermore, the SATPIs of the subsystems are directly utilized to identify the bottleneck subsystems and to help analyze the cause of the bottleneck.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"424 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132333751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}