{"title":"How to improve local load balancing policies by distorting load information","authors":"F. Zambonelli","doi":"10.1109/HIPC.1998.738004","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738004","url":null,"abstract":"The paper focuses on local load balancing policies for massively parallel architectures and introduces a new scheme for load information exchange between neighbor nodes. The idea is to distort the exchanged load information to let the policy keep into account a more global view of the system and overcome the limits of the local scope. The presented scheme has been integrated into two variants of a direct-neighbor policy and evaluated in dependence of the characteristics of the system load. Experimental results show that the transmission of distorted load information provides high efficiency unless the dynamicity of the load becomes too high, in which case it is preferable to exploit non-distorted load information.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125040873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design alternatives for shared memory multiprocessors","authors":"J. Carter, Chen-Chi Kuo, R. Kuramkote, M. Swanson","doi":"10.1109/HIPC.1998.737969","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737969","url":null,"abstract":"We consider the design alternatives available for building the next generation DSM machine (e.g., the choice of memory architecture, network technology, and amount and location of per-node remote data cache). To investigate this design space, we have simulated five applications on a wide variety of possible DSM architectures that employ significantly different caching techniques. We also examine the impact of using a special purpose system interconnect designed specifically to support low latency DSM operation versus using a powerful off the shelf system interconnect. We found that two architectures have the best combination of good average performance and reasonable worst case performance: CC-NUMA employing a moderate sized DRAM remote access cache (RAC) and a hybrid CC-NUMA/S-COMA architecture called AS-COMA or adaptive S-COMA. Both pure CC-NUMA and pure S-COMA have serious performance problems for some applications, while CC-NUMA employing an SRAM RAC does not perform as well as the two architectures that employ larger DRAM caches. The paper concludes with several recommendations to designers of next generation DSM machines, complete with a discussion of the issues that led to each recommendation so that designers can decide which ones are relevant to them given changes in technology and corporate priorities.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129639552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing concurrency and communication in distributed objects","authors":"Adnan Bader, A. Sajeev, S. Ramakrishnan","doi":"10.1109/HIPC.1998.738017","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738017","url":null,"abstract":"Concurrency and communication are two of the key features of distributed systems. These features can make systematic testing of distributed systems a complex task. A major problem is the explosion of the test space because of the potential for arbitrary interference of concurrent threads. This paper describes an approach for systematic testing of such systems in an object-oriented context. We use statecharts for system specification, and model the system behaviour as event-sequences. A test case, therefore, is primarily an event-sequence with concurrent threads represented as interleaving events. Communication-states with associated events represent communication between objects. The test-space explosion is controlled by an extension to Chow's (1978) algorithm for generating test sequences for finite state machines. The number of test sequences we require is O(n/sup 2/), where n is the sum of all events in all concurrent statecharts.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129081041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Execution characteristics of object oriented programs on the UltraSPARC-II","authors":"R. Radhakrishnan, L. John","doi":"10.1109/HIPC.1998.737990","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737990","url":null,"abstract":"It is widely accepted that object-oriented design improves code reusability, facilitates code maintainability and enables higher levels of abstraction. Although software developers and the software engineering community have embraced object-oriented programming for these benefits, there have been wide concerns about the performance overhead associated with this programming paradigm on modern processors. We characterize the performance of several C and C++ benchmarks on an UltraSPARC-II processor. Various architectural data related to execution behavior of the benchmarks are collected using on-chip performance monitoring counters. Factors including CPI, instruction and data cache misses, processor stalls due to instruction cache misses and branch misprediction, from real execution of several programs are measured and presented. While previous research evaluates the behavioral differences between C and C++ programs based on profiling and simulation, we measure execution behavior. Results show that the programs in the C++ suite incur a higher CPI, higher i-cache misses, and higher branch mispredictions than the programs in the C suite. A strong correlation was observed between CPI and branch mispredictions for the C++ application programs.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131268396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A simple mechanism to deal with sequential code in dataflow architectures","authors":"M. A. Cavenaghi, G. Travieso, Á. G. Neto","doi":"10.1109/HIPC.1998.737988","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737988","url":null,"abstract":"The aim of this work is to propose a simple and efficient mechanism to deal with the problem of executing sequential code in a pure dataflow machine. Our results is obtained with a simulator of Wolf architecture. The implemented mechanism improved the architecture performance when executing sequential code and we expect that this improvement could be better if we use some heuristics to deal with some special groups of instructions such as branch operations. Further research will show us if this is true.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129497520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhong Wen, Bryan Carpenter, Geoffrey C. Fox, Guansong Zhang
{"title":"Java data parallel extensions with runtime system support","authors":"Yuhong Wen, Bryan Carpenter, Geoffrey C. Fox, Guansong Zhang","doi":"10.1109/HIPC.1998.737978","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737978","url":null,"abstract":"In order to provide Java with the ability for supporting scientific parallel computing, we introduce a data parallel extension to Java language with runtime system support. We provide the distributed array extension to Java, and discuss the related operation and control over the new distributed array. Communication involving distributed arrays are handles through a standard of a collective communication library. We consider the programming in a Single Program Multiple Data (SPMD) model.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116097236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global reactive congestion control in multicomputer networks","authors":"Abdel-Halim Smai, L. Thorelli","doi":"10.1109/HIPC.1998.737987","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737987","url":null,"abstract":"In this paper we develop a general approach to global reactive congestion control in multicomputer networks. The approach uses a timeout mechanism to detect congestion, and exploits control lines such as those used for handshaking in the flit-level flow control of wormhole routers to distribute information about congestion. It is also based on a mechanism that limits the demands placed by the network interface and the processing element. The approach is described in detail and evaluated through simulation experiments. We show that the proposed congestion control can provide network stability and predictable network performance. By choosing the right timeout, we can provide bounds on average delay and worst-case delay. Furthermore, with appropriate timeouts the network can be kept out of saturation. Other attributes of the approach include fairness and applicability to a wide range of network architectures.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123329935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-line diagnosibility of baseline interconnection network","authors":"Sipra Das, A. Chaudhuri","doi":"10.1109/HIPC.1998.737995","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737995","url":null,"abstract":"This paper presents an on-line approach for the diagnosis of baseline interconnection networks. An exhaustive fault model with a multiple fault assumption is used in the analysis. The dual function switching element is considered to have two valid states corresponding to the straight connection mode and exchanged connection mode. Because of the inherent buddy property of the baseline network, for some particular distribution of faults, the algorithm identifies a group obviously including the faulty ones. Some of the results which are already proved in earlier works are mentioned as the proposed algorithm is based on these.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modulo-variable expansion sensitive scheduling","authors":"M. Valluri, R. Govindarajan","doi":"10.1109/HIPC.1998.738006","DOIUrl":"https://doi.org/10.1109/HIPC.1998.738006","url":null,"abstract":"Modulo scheduling is an aggressive scheduling technique for loops that exploit instruction-level parallelism by overlapping successive iterations of the loop. Due to the nature of modulo scheduling, the lifetime of a variable can overlap with a subsequent definition of itself. To handle such overlapping lifetimes, modulo-variable expansion (MVE) is used, wherein the constructed schedule is unrolled a number of times. We propose a technique to improve the constructed schedule while performing MVE. In our approach, we unroll the data dependence graph of the original loop and re-schedule it with a MVE-sensitive scheduler. Such an approach is expected to result in better initiation rates as compared to the traditional approach. We have implemented our approach and evaluated its performance on a large number of scientific benchmark kernels.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124879516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PERL-a registerless architecture","authors":"P. Suresh, R. Moona","doi":"10.1109/HIPC.1998.737968","DOIUrl":"https://doi.org/10.1109/HIPC.1998.737968","url":null,"abstract":"Reducing processor memory speed gap is one of the major challenges computer architects face today. Efficient use of CPU registers reduces the number of memory accesses. However, registers do incur extra overhead of load/store, register allocation and saving of register context across procedure calls. Caches however do not have any such overheads and cache technology has matured to the extent that today the access time of on-chip cache is almost equal to that of registers. This motivates one to explore alternate ways to do away with the overheads of registers. We propose a registerless, memory to memory architecture of a processor. We call this architecture Performance Enhanced Registerless (PERL) processor. All instructions in this processor operate directly on memory operands thus eliminating the load/store and other overheads of registers. The performance of this machine is studied by simulations and results are reported in the paper.","PeriodicalId":175528,"journal":{"name":"Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128330339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}