{"title":"Optimal sorting algorithms on incomplete meshes with arbitrary fault patterns","authors":"C. Yeh, B. Parhami","doi":"10.1109/ICPP.1997.622530","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622530","url":null,"abstract":"In this paper we propose simple and efficient algorithms for sorting on incomplete meshes. No hardware redundancy is required and no assumption is made about the availability of a complete submesh. The proposed robust sorting algorithms are very efficient when only a few processors are faulty and degrade gracefully as the number of faults increases. In particular we show that 1-1 sorting (1 key per healthy processor) in row-major or snakelike row-major order can be performed in 3n+o(n) communication and comparison steps on an n/spl times/n incomplete mesh that has an arbitrary pattern of o(/spl radic/n) faulty processors. This is the fastest algorithm reported thus far for sorting in row-major and snakelike row-major orders on faulty meshes and the time complexity is quite close to its lower bound.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115642520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler techniques for effective communication on distributed-memory multiprocessors","authors":"A. Navarro, E. Zapata, Y. Paek, D. Padua","doi":"10.1109/ICPP.1997.622559","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622559","url":null,"abstract":"The Polaris restructurer transforms conventional Fortran programs into parallel form for various types of multiprocessor systems. This paper presents the results of a study on strategies to improve the effectiveness of Polaris' techniques for distributed-memory multiprocessors. Our study, which is based on the hand analysis of MDG and TRFD from the Perfect Benchmarks and TOYCATV and SWIM from SPEC benchmarks, identified three techniques that are important for improving communication optimization. Their application produces almost perfect speedups for the four programs on the Cray T3D.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"9 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125859146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance analysis and simulation of multicast networks","authors":"Yuanyuan Yang, Jianchao Wang","doi":"10.1109/ICPP.1997.622670","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622670","url":null,"abstract":"In this paper, we look into the issue of supporting multicast in the well-known three-stage Clos network or /spl nu/(m, n, r) network. We first develop an analytical model for the blocking probability of the /spl nu/(m, n, r) multicast network, and then study the blocking behavior of the network under various routing control strategies through simulations. Our analytical and simulation results show that a /spl nu/(m, n, r) network with a small number of middle switches m, such as m=n+c or dn, where c and d are small constants, is almost nonblocking for multicast connections, although theoretically it requires m/spl ges//spl Theta/ (nlog r/log log r) to achieve nonblocking for multicast connections. We also demonstrate that routing control strategies are effective for reducing the blocking probability of the multicast network. The best routing control strategy can provide a factor of 2 to 3 performance improvement over random routing. The results indicate that a /spl nu/(m, n, r) network with a comparable cost to a permutation network can provide cost-effective support for multicast communication.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129834560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stride-directed prefetching for secondary caches","authors":"Sunil Kim, A. Veidenbaum","doi":"10.1109/ICPP.1997.622661","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622661","url":null,"abstract":"This paper studies hardware prefetching for second-level (L2) caches. Previous work on prefetching has been extensive but largely directed at primary caches. In some cases only L2 prefetching is possible or is more appropriate. By studying L2 prefetching characteristics we show that existing stride-directed methods for L1 caches do not work as well in L2 caches. We propose a new stride-detection mechanism for L2 prefetching and combine it with stream buffers used in Palacharla and Kessler, (1994). Our evaluation shows that this new prefetching scheme is more effective than stream buffer prefetching particularly for applications with long-stride accesses. Finally, we evaluate an L2 cache prefetching organization which combines a small L2 cache with our stride-directed prefetching scheme. Our results show that this system performs significantly better than stream buffer prefetching or a larger non-prefetching L2 cache without suffering from a significant increase in the memory traffic.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126847721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient multicast algorithms in all-port wormhole-routed hypercubes","authors":"Vivek Halwan, F. Özgüner","doi":"10.1109/ICPP.1997.622562","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622562","url":null,"abstract":"This paper presents several recursive heuristic methods for multicasting in all-port dimension-ordered wormhole-routed hypercubes. The methods described are stepwise contention-free and are primarily designed to reduce the number of communication steps. Experiments show that the number of steps can be significantly reduced compared to depth contention-free solutions previously described. These methods are also shown to be source-controlled depth contention-free and can be considered a generalization of the broadcast method described by C.T. Ho and M. Kao (1995) which is the most efficient method known.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124527996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved analytical model for wormhole routed networks with application to butterfly fat-trees","authors":"R. I. Greenberg, L. Guan","doi":"10.1109/ICPP.1997.622554","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622554","url":null,"abstract":"A performance model for wormhole routed interconnection networks is presented and applied to the butterfly fat-tree network. Experimental results agree very closely over a wide range of load rate. Novel aspects of the model, leading to accurate and simple performance predictions, include: (1) use of multiple-server queues, and (2) a general method of correcting queuing results based on Poisson arrivals to apply to wormhole routing. These ideas can also be applied to other networks.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127788825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Background compensation and an active-camera motion tracking algorithm","authors":"R. Gupta, M. Theys, H. Siegel","doi":"10.1109/ICPP.1997.622677","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622677","url":null,"abstract":"Motion tracking using an active camera is a very computationally complex problem. Existing serial algorithms have provided frame rates that are much lower than those desired, mainly because of the lack of computational resources. Parallel computers are well suited to image processing tasks and can provide the computational power that is required for real-time motion tracking algorithms. This paper develops a parallel implementation of a known serial motion tracking algorithm, with the goal of achieving greater than real-time frame rates, and to study the effects of data layout, choice of parallel mode of execution, and machine size on the execution time of this algorithm. A distinguishing feature of this application study is that the portion of each image frame that is relevant changes from one frame to the next based on the camera motion. This impacts the effect of the chosen data layout on the needed inter-processor data transfers and the way in which work is distributed among the processors. Experiments were performed to determine for which image sizes and number of processors which data layout would perform better. The parallel computers used in this study are the MasPar MP-1, Intel Paragon, and PASM. Different modes are examined and it is determined that mixed mode is faster than SIMD or MIMD implementations.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"366 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132984445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hindsight helps: deterministic task scheduling with backtracking","authors":"Yueh-O Wang, N. Amato, D. Friesen","doi":"10.1109/ICPP.1997.622582","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622582","url":null,"abstract":"This paper considers the problem of scheduling a set of precedence-related tasks on a nonpreemptive homogeneous message-passing multiprocessors system in order to minimize the makespan, that is, the completion time of the last task relative to start time of the first task. We propose family of scheduling algorithms, called IPR for immediate predecessor rescheduling, which utilize one level of backtracking. We also develop a unifying framework to facilitate the comparison between our results and the various models and algorithms that have been previously studied. We show, both theoretically and experimentally, that the IPR algorithms out-perform previous algorithms in terms of both time complexity and the makespans of the resulting schedules. Moreover our simulation results indicate that the relative advantage of the IPR algorithms increases as the communication constraint is relaxed.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130856430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive load-balancing algorithms using symmetric broadcast networks: performance study on an IBM SP2","authors":"Sajal K. Das, Daniel J. Harvey, R. Biswas","doi":"10.1109/ICPP.1997.622667","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622667","url":null,"abstract":"In a distributed-computing environment, it is important to ensure that the processor work loads are adequately balanced. Among numerous load-balancing algorithms, a unique approach due to Das and Prasad defines a symmetric broadcast network (SBN) that provides a robust communication pattern among the processors in a topology-independent manner. In this paper, we propose and analyze three SBN-based load-balancing algorithms, and implement them on an SP2. A thorough experimental study with Poisson-distributed synthetic loads demonstrates that these algorithms are very effective in balancing system load while minimizing processor idle time. They also compare favorably with several existing techniques.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123257805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network performance under physical constraints","authors":"F. Petrini, M. Vanneschi","doi":"10.1109/ICPP.1997.622550","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622550","url":null,"abstract":"The performance of an interconnection network in a massively parallel architecture is subject to physical constraints whose impact needs to be re-evaluated from time to time. Fat-trees, and low dimensional cubes have raised a great interest in the scientific community in the last few years and are emerging standards in the design of interconnection networks for massively parallel computers. In this paper we compare the communication performance of these two classes of interconnection networks using a detailed simulation model. The comparison is made using a set of synthetic benchmarks, taking into account physical constraints, as pin and bandwidth limitations, and the router complexity. In our experiments we consider two networks with 256 nodes, a 16-ary 2-cube and 4-ary 4-tree.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"25 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}