{"title":"Template based structured collections","authors":"J. Nolte, M. Sato, Y. Ishikawa","doi":"10.1109/IPDPS.2000.846025","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846025","url":null,"abstract":"Collective operations on distributed data sets foster a high-level data-parallel programming style that eases many aspects of parallel programming significantly. In this paper we describe how higher-order collective operations on distributed object sets can be introduced in a structured way by means of reusable topology classes and C++ templates.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115257214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems","authors":"F. Petrini, Wu-chun Feng","doi":"10.1109/IPDPS.2000.846019","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846019","url":null,"abstract":"Buffered coscheduling is a scheduling methodology for time-sharing communicating processes in parallel and distributed systems. The methodology has two primary features: communication buffering and strobing. With communication buffering, communication generated by each processor is buffered and performed at the end of regular intervals to amortize communication and scheduling overhead. This infrastructure is then leveraged by a strobing mechanism to perform a total exchange of information at the end of each interval, thus providing global information to more efficiently schedule communicating processes. This paper describes how buffered coscheduling can optimize resource utilization by analyzing workloads with varying computational granularities, load imbalances, and communication patterns. The experimental results, performed using a detailed simulation model, show that buffered coscheduling is very effective on fast SANs such as Myrinet as well as slower switch-based LANs.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-tier RAID storage system with RAID1 and RAID5","authors":"Nitin Muppalaneni, K. Gopinath","doi":"10.1109/IPDPS.2000.846051","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846051","url":null,"abstract":"Redundant Arrays of Inexpensive Disks (RAID) is a popular technique used to improve the reliability and performance of secondary storage. Of various levels of RAID discussed, RAID1 and RAID5 have become more popular. Mirroring or RAID1 maintains multiple copies of the data, generally provides best performance and is easier to configure. Rotating parity scheme or RAID5 is the least expensive RAID scheme with good large update performance. It suffers from poor small update performance and performance drops sharply when a diskfails and the array enters degraded mode. Configuring RAID5 is more involved. This paper presents the design and implementation of a host-based driver for a multi-tier RAID storage system, currently with 2 tiers: a small RAID1 tier and a larger RAID5 tier. Based on access patterns, the driver automatically migrates frequently accessed data to RAID1 while demoting not so frequently accessed data to RAID5. The prototype provides reliable persistence semantics for data migration between the tiers using ordered updates. Mechanisms are separated from policies through an API so that any desired policy can be implemented in trusted user processes. Finally, we present comparison of the performance of our system with comparable systems using striping and RAID5.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128677981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consensus based on failure detectors with a perpetual accuracy property","authors":"A. Mostéfaoui, M. Raynal","doi":"10.1109/IPDPS.2000.846029","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846029","url":null,"abstract":"This paper is on the Consensus problem, in the context of asynchronous distributed systems made of n processes, at most f of them may crash. A family of failure detector classes satisfying a Perpetual Accuracy property is first defined. This family includes the failure detector class S (the class of Strong failure detectors defined by Chandra and Toueg) central to the definition of a class (S/sub x/) where x is the minimum number (x/spl ges/1) of correct processes that can never be suspected to have crashed Then, a protocol that solves the Consensus problem is given. This protocol works with any failure detector class (S/sub x/) of this family. It is particularly simple and uses a Reliable Broadcast protocol as a skeleton. It requires n-x+1 communication steps, and its communication bit complexity is (n-x+1)(n-1)|/spl nu/| (where |/spl nu/| is the maximal size of an initial value a process can propose).","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126125791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using switch directories to speed up cache-to-cache transfers in CC-NUMA multiprocessors","authors":"R. Iyer, L. Bhuyan, Ashwini K. Nanda","doi":"10.1109/IPDPS.2000.846057","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846057","url":null,"abstract":"In this paper we propose a novel hardware caching technique, called switch directory, to reduce the communication latency in CC-NUMA multiprocessors. The main idea is to implement small fast directory caches in crossbar switches of the inter-connect medium to capture and store ownership information as the data flows from the memory module to the requesting processor. Using the stored information, the switch directory re-routes subsequent requests to dirty blocks directly to the owner cache, thus reducing the latency for home node processing such as slow DRAM directory access and coherence controller occupancies. The design and implementation details of a DiRectory Embedded Switch ARchitecture; DRESAR, are presented. We explore the performance benefits of switch directories by modeling DRESAR in a detailed execution driven simulator. Our results show that the switch directories can improve performance by up to 60% reduction in home node cache-to-cache transfers for several scientific applications and commercial workloads.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACDS: Adapting computational data streams for high performance","authors":"Carsten Isert, K. Schwan","doi":"10.1109/IPDPS.2000.846046","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846046","url":null,"abstract":"Data-intensive, interactive applications are an important class of metacomputing (Grid) applications. They are characterized by large, time-varying data flows between data providers and consumers. The topic of this paper is the runtime adaptation of data streams, in response to changes in resource availability and/or in end user requirements, with the goal of continually providing to consumers data at the levels of quality they require. Our approach is one that associates computational objects with data streams. Runtime adaptation is achieved by adjusting objects' actions on streams, by splitting and merging objects, and by migrating them (and the streams on which they operate) across machines and network links. Adaptive streams also react to changes in resource availability detected by online monitoring.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131902138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gang scheduling with memory considerations","authors":"Anat Batat, D. Feitelson","doi":"10.1109/IPDPS.2000.845971","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.845971","url":null,"abstract":"A major problem with time slicing on parallel machines is memory pressure, as the resulting paging activity damages the synchronism among a job's processes. An alternative is to impose admission controls, and only admit jobs that fit into the available memory. Despite suffering from delayed execution, this leads to better overall performance by preventing the harmful effects of paging and thrashing.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132028015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A provably optimal, distribution-independent parallel fast multipole method","authors":"F. E. Sevilgen, N. Futamura, S. Aluru","doi":"10.1109/IPDPS.2000.845967","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.845967","url":null,"abstract":"The Fast Multipole Method (FMM) is a robust technique for the rapid evaluation of the combined effect of pairwise interactions of n data sources. Parallel computation of the FMM is considered a challenging problem due to the dependence of the computation on the distribution of the data sources, usually resulting in dynamic data decomposition and load balancing problems. In this paper, we present the first provably efficient and distribution-independent parallel algorithm for the FMM on distributed memory parallel computers. Our algorithm does not require any dynamic data decomposition or load balancing step. We present our algorithm in terms of a few basic and well understood primitive operations such as sorting and parallel prefix.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115830527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc González, Albert Serra, X. Martorell, J. Oliver, E. Ayguadé, Jesús Labarta, N. Navarro
{"title":"Applying interposition techniques for performance analysis of OPENMP parallel applications","authors":"Marc González, Albert Serra, X. Martorell, J. Oliver, E. Ayguadé, Jesús Labarta, N. Navarro","doi":"10.1109/IPDPS.2000.845990","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.845990","url":null,"abstract":"Tuning parallel applications requires the use of effective tools for detecting performance bottlenecks. Along a parallel program execution, many individual situations of performance degradation may arise. We believe that an exhaustive and time-aware tracing at a fine-grain level is essential to capture this kind of situations. This paper presents a tracing mechanism based on dynamic code interposition, and compares it with the usual compiler-directed code injection. Dynamic code interposition adds monitoring code at run-time to unmodified binaries and shared libraries, making it suitable for environments in which the compiler or the available tools do not offer instrumentation facilities. Static injection and dynamic interposition techniques are used to collect detailed traces that feed an analysis tool. Both environments meet the accuracy and performance goals required to profile and analyze parallel applications and runtime libraries.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130867326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerant wormhole routing algorithms in meshes in the presence of concave faults","authors":"Seungjin Park, Jong-Hoon Youn, B. Bose","doi":"10.1109/IPDPS.2000.846045","DOIUrl":"https://doi.org/10.1109/IPDPS.2000.846045","url":null,"abstract":"A fault ring is a connection of only nonfaulty adjacent nodes and links such that the interior of the ring contains only faulty components. This paper proposes two wormhole routing algorithms that deal with more relaxed shapes of fault rings than previously known algorithms in the mesh networks. As a result, the number of components to be made disabled would be reduced considerably in some cases. First algorithm, called F4, uses four virtual channels and allows all four sides of fault rings to contain concave shapes. Second algorithm, F3, permits up to three sides to contain concave shapes using only three virtual channels. Both F3 and F4 are free of deadlock and livelock and guarantee the delivery of messages between any pair of nonfaulty and connected nodes in the network.","PeriodicalId":206541,"journal":{"name":"Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123842243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}