{"title":"Automatic partitioning of data and computations on scalable shared memory multiprocessors","authors":"S. Tandri, T. Abdelrahman","doi":"10.1109/ICPP.1997.622557","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622557","url":null,"abstract":"This paper describes an algorithm for deriving data and computation partitions on scalable shared memory multiprocessors. The algorithm establishes affinity relationships between where computations are performed and where data is located based on array accesses in the program. The algorithm then uses these affinity relationships to determine both static and dynamic partitions for arrays and parallel loops. Experimental results from a prototype implementation of the algorithm demonstrate that it is computationally efficient and that it improves the parallel performance of standard benchmarks. The results also show the necessity of taking shared memory effects (memory contention, cache locality, false-sharing and synchronization) into account-partitions derived to minimize only interprocessor communications do not necessarily result in the best performance.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127735967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Good processor management=fast allocation+efficient scheduling","authors":"B. S. Yoo, C. Das","doi":"10.1109/ICPP.1997.622656","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622656","url":null,"abstract":"Fast and efficient processor allocation and job scheduling algorithms are essential components of a multi-user multicomputer operating system. In this paper we propose two novel processor management schemes which meet such demands for mesh-connected multicomputers. A stack-based allocation algorithm that can locate a free sub-mesh for a job very quickly using simple coordinate calculation and spatial subtraction is proposed. Simulation results show that the stack-based allocation algorithm outperforms all the existing allocation policies in terms of allocation overhead while delivering competitive performance. Another technique, called group scheduling, schedules jobs in such a way that the jobs belonging to the same group do not block each other. The groups are scheduled in an FCFS order to prevent starvation. This simple but efficient scheduling policy reduces the response rime significantly by minimizing the queueing delay for the jobs in the same group. These two schemes, when used together can provide faster service to users with very little overhead.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134195668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan-Miguel Martínez, P. López, J. Duato, T. Pinkston
{"title":"Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks","authors":"Juan-Miguel Martínez, P. López, J. Duato, T. Pinkston","doi":"10.1109/ICPP.1997.622586","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622586","url":null,"abstract":"In this paper, we take a different approach to handle deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and reduces the probability of deadlock to negligible values even when fully adaptive routing is used. We also propose an improved deadlock detection mechanism that only uses local information, detects all the deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposed recovery technique absorbs the deadlocked message at the current node and later re-injects it for continued routing towards its destination. Performance evaluation results show that our new approach to deadlock handling is more efficient than previously proposed techniques.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128619700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient processor allocation scheme for multi dimensional interconnection networks","authors":"Hyunseung Choo, H. Youn, G. Park, B. Shirazi","doi":"10.1109/ICPP.1997.622570","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622570","url":null,"abstract":"The task scheduling policy and the processor allocation scheme affect the system performance significantly. In this paper, we propose an efficient processor allocation scheme for 3D mesh interconnection network with a simple FIFO scheduling policy. Complexity analysis shows that the allocation and deallocation of the scheme are O(LWH/sup 2/) and O(LH), respectively, which are better than earlier schemes. Comprehensive computer simulation shows that the average allocation time of the proposed scheme is improved up to about 85% compared to the best earlier 3D approach.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127008104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An adaptive sequential prefetching scheme in shared-memory multiprocessors","authors":"Myoung Kwon Tcheun, H. Yoon, S. Maeng","doi":"10.1109/ICPP.1997.622660","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622660","url":null,"abstract":"The sequential prefetching scheme is a simple hardware controlled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on shared-memory multiprocessors. Also, we propose a simple hardware scheme which selects the prefetching degree on each miss by adding a small table (PDS: Prefetching Degree Selector) to the sequential prefetching scheme. This scheme could prefetch consecutive blocks aggressively for applications with high sequentiality and conservatively for applications with low sequentiality.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127831603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A global computing environment for networked resources","authors":"H. Topcuoglu, S. Hariri","doi":"10.1109/ICPP.1997.622686","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622686","url":null,"abstract":"Current advances in high-speed networks and WWW technologies have made network computing a cost-effective, high-performance computing alternative. New software tools are being developed to utilize efficiently the network computing environment. Our project, called Virtual Distributed Computing Environment (VDCE), is a high-performance computing environment that allows users to write and evaluate networked applications for different hardware and software configurations using a web interface. In this paper we present the software architecture of VDCE by emphasizing application development and specification, scheduling, and execution/runtime aspects.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125086914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Load balancing and work load minimization of overlapping parallel tasks","authors":"V. Krishnaswamy, Gagan Hasteer, P. Banerjee","doi":"10.1109/ICPP.1997.622655","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622655","url":null,"abstract":"In this paper, we propose a unique problem in the assignment of overlapping tasks to processors on a parallel machine, with the twin objectives of minimizing workloads while maintaining good load balance. This problem arises in some applications in VLSI CAD, e.g. parallel compiled VHDL simulation. We assume that the parallel application can be decomposed into a set of tasks, each in turn comprising a finite number of subtasks. Overlapped computations arise as a result of replication of subtasks across tasks in order to reduce the amount of communication performed in fine grained parallel applications. The uniqueness of the problem stems from the fact that overlapping computation on tasks assigned to the same processor is only performed once. Theoretical results on NP-hardness and bounds on the utilization of overlap are provided. A heuristic solution is also proposed. An important application area in VLSI-CAD, parallel compiled event driven VHDL simulation is introduced. Results of the application of our heuristics to this problem are reported on a SUN Sparcserver 1000 multiprocessor.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"39 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116538230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel synchronization of continuous time discrete event simulators","authors":"Peter Frey, H. Carter, P. Wilsey","doi":"10.1109/ICPP.1997.622649","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622649","url":null,"abstract":"Mixed-Mode simulation has been generating considerable interest in the simulation community and has continued to grow as an active research area. Traditional mixed-mode simulation involves the merging of digital and analog simulators in various ways. However, efficient methods for the synchronization between the two time domains remains elusive. This is due to the fact that the analog simulator uses dynamic time step control whereas the digital simulator uses the event driven paradigm. This paper proposes two new synchronization methods and presents their capabilities using a component-based continuous time simulator integrated with an optimistic parallel discrete event simulator. The results of the performance evaluation leads us to believe that while both synchronization methods are functionally viable, one has superior performance.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131235691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance evaluation of fault tolerance for parallel applications in networked environments","authors":"Pierre Sens","doi":"10.1109/ICPP.1997.622663","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622663","url":null,"abstract":"This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1823 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129752660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic generation of injective modular mappings","authors":"Hyuk-Jae Lee, J. Fortes","doi":"10.1109/ICPP.1997.622675","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622675","url":null,"abstract":"Many optimizations (of programs with loops) used in parallelizing compilers and systolic array design are based on linear transformations of loop iteration spaces. Additional important optimizations and designs are possible by using recently proposed modular mappings, which are described by linear transformations modulo a constant vector. Previous work on modular mappings focused an conditions that guarantee injectivity of a modular mapping for algorithms with rectangular index sets. This paper generalizes previous work by providing new injectivity conditions that cover the cases when the program index set has arbitrary shape and size, and the target processor array and the mapping moduli are of arbitrary size. A systematic technique to efficiently generate modular mappings is also proposed. The complexity of the proposed generation technique is O(n/sup 2/n!) for a nested loop of depth n with a rectangular index set and a target processor array with as many processors as required. A bounded search scheme is also provided for general cases. Each trial is formulated as an integer linear programming problem with at most 3n variables.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133396901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}