{"title":"Trace-driven analysis of migration-based gang scheduling policies for parallel computers","authors":"Sanjeev Setia","doi":"10.1109/ICPP.1997.622685","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622685","url":null,"abstract":"Gang scheduling is a job scheduling policy for parallel computers that combines elements of space-sharing and time-sharing. In this paper we analyze the performance of gang scheduling policies that allow the remapping of an executing job to a new set of processors. Most previously proposed gang-scheduling policies do not allow such job remapping under the assumption that it is prohibitively expensive. Through a detailed trace-driven simulation, we analyze the tradeoff between the benefits and overheads of such job relocation. Our results show that gang-scheduling policies that support such job relocation offer significant performance gains over policies that do not use remapping.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123089597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and analysis of fault-tolerant star networks","authors":"C. Liang, S. Bhattacharya, Jack Tan","doi":"10.1109/ICPP.1997.622665","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622665","url":null,"abstract":"The star graph has been proposed as an attractive alternative to the hypercube offering a lower degree, a smaller diameter, and a smaller distance for a similar number of nodes. In this paper, we investigate the fault-tolerant design for the Star interconnection network using modules called fault tolerant building blocks (FTBBs). Each FTBB module contains several primary and few spare nodes. The spare nodes within each FTBB can replace the primary nodes when a failure occurs. If each spare node within an FTBB can replace any primary node we ascribe the situation to fall spare utilization. We propose fault tolerant Star networks constructed from smaller FTBBs with full spare utilization and a fault-tolerant routing scheme to reconfigure the system when a failure occurs.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time multicast in wireless communication","authors":"Sandeepan Sanyal, L. Nahar, S. Bhattacharya","doi":"10.1109/ICPP.1997.622687","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622687","url":null,"abstract":"This paper presents the reconfiguration of multicast tree at the instance of node migration in cellular wireless networks. We consider a novel, and highly practical, formulation of the real-time multicast problem. Unlike the traditional notion of source to leaf node message transmission time being the measure for real-time, we consider the multicast tree re-construction time (in the event of a node migration) as the measure for real-time constraints. The overall goal is to minimize both the delays-i): delay of re-constructing the multicast tree and ii) source to destination transmission delay. In this paper we introduce and provide solutions for the first objective, while the current research in real-time multicast tree considers the latter objective only. We propose three heuristics to solve the real-time multicast tree re-construction problem and present analysis and simulation results for different factors that contribute differently to the reconstruction delay.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114295272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of dynamic task distributions on the performance of a class of irregular computations","authors":"Hemal V. Shah, J. Fortes","doi":"10.1109/ICPP.1997.622651","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622651","url":null,"abstract":"In this paper, a modified version of previously proposed quasi-barrier technique is developed. On distributed memory machines, relaxation with modified quasi-barriers can be used to perform basis computations that arise in symbolic polynomial manipulation. In this type of synchronous computation, the set of tasks is distributed across the processors. Each nonzero result of a task reduction dynamically generates a set of new tasks. The distribution of these newly generated tasks can have a significant impact on the overall execution time of the parallel computation. In this paper, four task distribution strategies, named modified block, modified sorted block, modified cyclic, and modified sorted cyclic are developed and their performances are comparatively evaluated. For the experiments performed on an 18-node IBM SP2, the modified cyclic distribution provides the best performance overall.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127597431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for parallel tree-based scientific simulations","authors":"Pangfeng Liu, Jan-Jan Wu","doi":"10.1109/ICPP.1997.622577","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622577","url":null,"abstract":"This paper describes an implementation of a platform-independent parallel C++ N-body framework that can support various scientific simulations that involve tree structures, such as astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics. Within the framework the users will be able to concentrate on the computation kernels that differentiate different N-body problems, and let the framework take care of the tedious and error-prone details that care common among N-body applications. This framework was developed based on the techniques we learned from previous CM-5 C implementations, which have been rigorously justified both experimentally and mathematically. This gives us confidence that our framework will allow fast prototyping of different N-body applications, to run on different parallel platforms, and to deliver good performance as well.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134556439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance and implementation aspects of higher order head-of-line blocking switch boxes","authors":"M. Jurczyk","doi":"10.1109/ICPP.1997.622555","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622555","url":null,"abstract":"Nonuniform traffic can degrade the overall performance of multistage interconnection networks substantially. This performance degradation was traced back to higher order head-of-line blocking (higher order HOL-blocking) effects within the network in the literature. This paper further elaborates on higher order HOL-blocking networks, on their performance under nonuniform traffic patterns, and on methods on how to efficiently implement switch boxes to construct higher order HOL-blocking networks. An analytical upper bound of the achievable network bandwidth under nonuniform traffic patterns is derived and compared to simulation results. Furthermore, it is discussed how central memory buffered switch boxes can be efficiently changed into higher order HOL-blocking switch boxes through only minor changes in the switch box control path. With those switch boxes, high network performance under nonuniform traffic patterns can be achieved with regular hardware effort.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115101062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining loop fusion with prefetching on shared-memory multiprocessors","authors":"N. Manjikian","doi":"10.1109/ICPP.1997.622560","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622560","url":null,"abstract":"The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and improving performance are: (a) cache locality enhancement for latency reduction and (b) data prefetching for latency tolerance. This paper studies the benefit of combining loop fusion for locality enhancement with prefetching. Experimental results are reported for multiprocessors with support for prefetching. For a complete application on an SGI Power Challenge R10000, combining loop fusion with prefetching improves parallel speedup by 46%.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114467667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ghose, Seth Melnick, T. Gaska, Seth Goldberg, A. Jayendran, Brian T. Stein
{"title":"The implementation of low latency communication primitives in the SNOW prototype","authors":"K. Ghose, Seth Melnick, T. Gaska, Seth Goldberg, A. Jayendran, Brian T. Stein","doi":"10.1109/ICPP.1997.622681","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622681","url":null,"abstract":"This paper describes the implementation of a low latency protected message passing facility and a low latency barrier synchronization mechanism for an experimental, tightly-coupled network of workstations called SNOW: SNOW uses multiprocessing SPARC 20s, running Solaris 2.4, as computing nodes, and uses semi-custom network interface cards (NICs) that connect these nodes in a 212 Mbits/sec. unidirectional ring. The NICs include field-programmable gate array logic devices that allow for experimentation with the nature and level of hardware support for tight coupling. The one way protected message passing latency on the SNOW prototype for a 64-byte message is about 9 /spl mu/secs., comparable to latencies of low-end to medium range multiprocessors.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117079302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinan N. Shen, Xiao-Tao Chen, S. Horiguchi, F. Lombardi
{"title":"On the multiple fault diagnosis of multistage interconnection networks: the lower bound and the CMOS fault model","authors":"Yinan N. Shen, Xiao-Tao Chen, S. Horiguchi, F. Lombardi","doi":"10.1109/ICPP.1997.622666","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622666","url":null,"abstract":"This paper presents new results for diagnosing (detection and location) multistage interconnection networks (MINs) in the presence of multiple faults. Initially, it is proved that the lower bound in the number of tests for multiple fault diagnosis (independent of the assumed fault model for the MIN) is 2/spl times/log/sub 2/N, where N is the number of inputs/outputs of the network. A new fault model is introduced; this fault model is applicable to interconnection networks implemented using CMOS technology. The characterization for diagnosing stuck-open faults is presented.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125798969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiscalar execution along a single flow of control","authors":"K. Sundararaman, M. Franklin","doi":"10.1109/ICPP.1997.622568","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622568","url":null,"abstract":"The multiscalar processing model extracts instruction level parallelism from ordinary programs by splitting the program into smaller, possibly dependent, tasks, and parallelly executing multiple tasks using multiple execution units. Past work had advocated pursuing multiple flows of control in the multiscalar processor. We first illustrate the problems involved in pursuing multiple flows of control. We then discuss a methodology to obtain good performance from multiple tasks extracted from a single line of control. We also present the results of simulation studies that verify the potential of this method. These results, obtained with a set of SPECS92 benchmarks, show better issue rates when a single line of control is pursued in the multiscalar processor. The primary reason for this improvement is the ability to have better load balancing among the execution units.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116002218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}