{"title":"A register allocation technique using register existence graph","authors":"A. Koseki, Y. Fukazawa, H. Komatsu","doi":"10.1109/ICPP.1997.622673","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622673","url":null,"abstract":"Optimizing compilation is very important for generating code sequences in order to utilize the characteristics of processor architectures. One of the most essential optimization techniques is register allocation. In register allocation that takes account of instruction-level parallelism, anti-dependences generated when the same register is allocated to different variables, and spill code generated when the number of registers is insufficient should be handled in such a way that the parallelism in a program is not lost. In our method, we realized register allocation using a new data structure called the register existence graph, in which the parallelism in program is well expressed.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125029525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient parallel algorithms for optimally locating a k-leaf tree in a tree network","authors":"S. Ku, W. Shih, Biing-Feng Wang","doi":"10.1109/ICPP.1997.622537","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622537","url":null,"abstract":"In this paper, an efficient parallel algorithm is proposed for finding a k-tree core of a tree network. The proposed algorithm performs on the EREW PRAM in O(log n log* n) time using O(n) work.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123551525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Hsieh, Chin-Wen Ho, T. Hsu, M. Ko, Gen-Huey Chen
{"title":"Efficient parallel algorithms on distance-hereditary graphs","authors":"S. Hsieh, Chin-Wen Ho, T. Hsu, M. Ko, Gen-Huey Chen","doi":"10.1109/ICPP.1997.622541","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622541","url":null,"abstract":"We present efficient parallel algorithms for finding a minimum weighted connected dominating set, a minimum weighted Steiner tree for a distance-hereditary graph which take O(log n) time using O(n+m) processors on a CRCW PRAM, where n and m are the number of vertices and edges of a given graph, respectively. We also find a maximum weighted clique of a distance-hereditary graph in O(log/sup 2/ n) time using O(n+m) processors on a CREW PRAM.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114601377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Turn grouping for efficient barrier synchronization in wormhole mesh networks","authors":"Kuo-Pao Fan, C. King","doi":"10.1109/ICPP.1997.622588","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622588","url":null,"abstract":"Barrier is an important synchronization operation. On scalable parallel computers, it is often implemented as a collective communication with a reduction operation followed by a distribution operation. In this paper, we introduce a systematic way of generating efficient algorithms to perform barrier synchronization in mesh networks. The scheme works with any base routing algorithm derivable from the turn model. Our scheme extends the turn grouping method with two new algorithms, Tail to Central and Central to Tail, for scheduling the message transmission in the reduction and distribution phase respectively. Simulation results show that our approach can take advantage of the adaptivity of the turn-model based routing algorithms and outperform methods proposed previously.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128412912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How much does network contention affect distributed shared memory performance?","authors":"Donglai Dai, D. Panda","doi":"10.1109/ICPP.1997.622680","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622680","url":null,"abstract":"Most of recent research on distributed shared memory (DSM) systems have focused on either careful design of node controllers or cache coherence protocols. While evaluating these designs, simplified models of networks (constant latency or average latency based on the network size) are typically used. Such models completely ignore network contention. To help network designers to design better networks for DSM systems, in this paper; we focus on two goals: 1) to isolate and quantify the impact of network link contention and network interface contention on the overall performance of DSM applications and 2) to study the impact of critical architectural parameters on these two categories of network contention. We achieve these goals by evaluating a set of SPLASH2 benchmarks on a DSM simulator using three network models. For an 8/spl times/8 wormhole system, our results show that network contention can degrade performance up to 59.8%. Out of this, up to 7.2% is caused by network interface contention alone. The study indicates that network contention becomes dominant for DSM systems using small caches, wide cache line sizes, low degrees of associativity, high processing node speeds, high memory speeds, low network speeds, or small network link widths.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128252528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Broadcast-efficient sorting in the presence of few channels","authors":"K. Nakano, S. Olariu, J. Schwing","doi":"10.1109/ICPP.1997.622534","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622534","url":null,"abstract":"We present simple and broadcast-efficient ranking and sorting algorithms on the broadcast communication model (BCM, for short) with few communication channels. At the heart of our algorithms is a new and elegant sampling and bucketing scheme whose main feature is that the resulting buckets are well balanced, making costly rebalancing unnecessary. The resulting ranking algorithm uses only 2 n/k+o(n/k) broadcast rounds, while 3 n/k+o(n/k) broadcast rounds are needed for sorting on a L-channel, n-processor BCM whenever k/spl les//spl radic/(n/log n). These bounds are fairly tight, when compared with the trivial lower bound of n/k broadcast rounds necessary to permute n items using k communication channels.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129281105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Euler path based technique for deadlock-free multicasting","authors":"N. Agrawal, C. Ravikumar","doi":"10.1109/ICPP.1997.622669","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622669","url":null,"abstract":"The existing algorithms for deadlock-free multicasting in interconnection networks assume the Hamiltonian property in the networks topology. However, these networks fail to be Hamiltonian in the presence of faults. This paper investigates the use of Euler circuits in deadlock-free multicasting. Not only are Euler circuits known to exist in all connected networks, a fast polynomial-time algorithm exists to find an number circuit in a network. We present a multicasting algorithm which works for both regular and irregular topologies. Our algorithm is applicable to store-and-forward as well as wormhole-routed networks. We show that at most two virtual channel are required per physical channel for any connected network. We also prove that no virtual channels are required to achieve deadlock-free multicasting on a large class of networks. Unlike other existing algorithms for deadlock-free multicasting in faulty networks, our algorithm requires a small amount of information to be stored at each node. The potential of our technique is further illustrated with the help of various examples. A performance analysis on wormhole-routed networks shows that our routing algorithm out-performs existing multicasting procedures.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"41 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114012852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the performance of out-of-core computations","authors":"M. Kandemir, J. Ramanujam, A. Choudhary","doi":"10.1109/ICPP.1997.622574","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622574","url":null,"abstract":"The difficulty of handling out-of-core data limits the potential of parallel machines and high-end supercomputers. Since writing an efficient out-of-core version of a program is a difficult task and since virtual memory systems do not perform well on scientific computations, we believe that there is a clear need for compiler-directed explicit I/O approach for out-of-core computations. In this paper, we present a compiler algorithm to optimize locality of disk accesses in out-of-core codes by choosing a good combination of file layouts on disks and loop transformations. The transformations change the access order of array data. Experimental results obtained on IBM SP-2 and Intel Paragon provide encouraging evidence that our approach is successful at optimizing programs which depend on disk-resident data in distributed-memory machines.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122033391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Embedding of binomial trees in hypercubes with link faults","authors":"Jie Wu, E. Fernández, Ying-Chen Lo","doi":"10.1109/ICPP.1997.622564","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622564","url":null,"abstract":"We study the embedding of binomial trees with variable roots in n-dimensional hypercubes (n-cubes) with faulty links. A simple embedding algorithm is first proposed that can embed an n-level binomial tree in an n-cube with up to n-1 faulty links in log(n-1) steps. We then extend the result to show that spanning binomial trees exist in a connected n-cube with up to [3(n-1)/2]-1 faulty links. Our results reveal the fault tolerance property of hypercubes and they can be used to predict the performance of broadcasting and reduction operations, where the binomial tree structure is commonly used.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance and configuration of hierarchical ring networks for multiprocessors","authors":"V. Hamacher, Hong Jiang","doi":"10.1109/ICPP.1997.622653","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622653","url":null,"abstract":"Analytical queueing network models for expected message delay in 2-level and 3-level hierarchical-ring interconnection networks (INs) are developed. Such networks have recently been used in commercial and research prototype multiprocessors. A major class of traffic carried by these INs consists of cache line transfers, and associated coherency control messages, between processor caches and remote memory modules in shared-memory multiprocessors. Memory modules are assumed to be evenly distributed over the processor nodes. Such traffic consists of short, fixed-length messages. They can be conveniently transported using the slotted ring transmission technique, which is studied here. The message delay results derived from the models are shown to be quite accurate when checked against a simulation study. The comparisons to simulations include heavy traffic situations where queueing delays in ring crossover switches are significant for ring utilization levels of 80 to 90%. As well as facilitating analysis, the analytical models can be used to determine optimal sizes for the rings at different levels in the hierarchy under specified traffic distributions in a system with a given total number of processor nodes. Optimality is in terms of minimizing average message delay. A specific example of such a design exercise is provided for the uniform traffic case.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128895450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}