{"title":"Protocol-dependent message-passing performance on Linux clusters","authors":"D. Turner, Xuehua Chen","doi":"10.1109/CLUSTR.2002.1137746","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137746","url":null,"abstract":"In a Linux cluster, as in any multiprocessor system, the inter-processor communication rate is the major limiting factor to its general usefulness. This research is geared toward improving the communication performance by identifying where the inefficiencies lie and trying to understand their cause. The NetPIPE utility is being used to compare the latency and throughput of all current message-passing libraries and the native software layers they run upon for a variety of hardware configurations.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"14 1","pages":"187-194"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85481978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. H. Yum, Eun Jung Kim, C. Das, Mazin S. Yousif, J. Duato
{"title":"Integrated admission and congestion control for QoS support in clusters","authors":"K. H. Yum, Eun Jung Kim, C. Das, Mazin S. Yousif, J. Duato","doi":"10.1109/CLUSTR.2002.1137761","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137761","url":null,"abstract":"Admission and congestion control mechanisms are integral parts of any Quality of Service (QoS) design for networks that support integrated traffic. In this paper we propose an-admission control algorithm and a congestion control algorithm for clusters, which are increasingly being used in a diverse set of applications that require QoS guarantees. The uniqueness of our approach is that we develop these algorithms for wormhole-switched networks. We use QoS-capable wormhole routers and QoS-capable network interface cards (NICs), referred to as Host Channel Adapters (HCAs) in InfiniBand/spl trade/ Architecture (IBA), to evaluate the effectiveness of these algorithms. The admission control is applied at the HCAs and the routers, while the congestion control is deployed only at the HCAs. Simulation results indicate that the admission and congestion control algorithms are quite effective in delivering the assured performance. The proposed credit-based congestion control algorithm is simple and practical in that it relies on hardware already available in the HCA to regulate traffic injection.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"40 1","pages":"325-332"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85309481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research directions in parallel I/O for clusters","authors":"W. Ligon","doi":"10.1109/CLUSTR.2002.1137777","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137777","url":null,"abstract":"Parallel I/O remains a critical problem for cluster computing. A significant number of important applications need high performance parallel I/O and most cluster systems provide enough hardware to deliver the required performance. System software for achieving the desired goals remains in the research and development stage. A number of parallel file systems have achieved remarkable goals in one or more of several key areas related to parallel I/O, but there is still great reluctance to commit to any file system currently available. This is mostly due to the fact that these file systems do not address enough issues at once in a package that is robust enough for widespread use. Critical goals in the development of an operation parallel file system for clusters include: high performance with scalability; reliability/fault tolerance; flexible and efficient integration with parallel codes; portability. These issues give rise to problems with interfaces and semantics, in addition to specific technical problems such as distributed locking, caching, and redundancy. The next generation of parallel file systems must look beyond traditional interfaces, semantics, and implementation methods in order achieve the desired goals. Of equal importance is the issue of knowing to what extent a given file system achieves these goals. Given that no file system is likely to address all of these goals equally well, it is important to be able to measure a given file system's utility in these areas through benchmarking or other evaluation methods. We explore a few of these issues and include specific examples and a case study of the PVFS V2 team's approach to these issues.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"16 1","pages":"436-"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81702709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPI in 2002: has it been ten years already?","authors":"E. Lusk","doi":"10.1109/CLUSTR.2002.1137776","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137776","url":null,"abstract":"Summary form only given. In April of 1992, a group of parallel computing vendors, computer science researchers, and application scientists met at a one-day workshop and agreed to cooperate on the development of a community standard for the message-passing model of parallel computing. The MPI Forum that eventually emerged from that workshop became a model of how a broad community could work together to improve an important component of the high performance computing environment. The Message Passing Interface (MPI) definition that resulted from this effort has been widely adopted and implemented, and is now virtually synonymous with the message-passing model itself MPI not only standardized existing practice in the service of making applications portable in the rapidly changing world of parallel computing, but also consolidated research advances into novel features that extended existing practice and have proven useful in developing a new generation of applications. This talk will discuss some of the procedures and approaches of the MPI Forum that led to MPI's early adoption, and then describe some of the features that have led to its persistence as a reference model for parallel computing. Although clusters were only just emerging as a significant parallel computing production platform as MPI was being defined, MPI has proven to be a useful way of programming them for high performance, and we will discuss the current situation in MPI implementations for clusters. MPI was deliberately designed to grant considerable flexibility to implementors, and thus provides a useful framework for implementation research. Successful implementation techniques within the MPI standard can be utilized immediately by applications already using MPI, thus providing an unusually fast path front research results to their application. At Argonne National Laboratory we have been developing and distributing MPICH, a portable, high performance implementation of MPI, from the very beginning of the MPI effort. We will describe MPICH-2, a completely new version of MPICH just being released. We will present some of its novel design features that we hope will stimulate both further research and a new generation of complete MPI-2 implementations, along with some early performance results. We will conclude with a speculative look at the future of MPI, including its role in other programming approaches, fault tolerance, and its applicability to advanced architectures.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"129 1","pages":"435-"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89640924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Bladed Beowulf: a cost-effective alternative to traditional Beowulfs","authors":"Wu-chun Feng, Michael S. Warren, E. Weigle","doi":"10.1109/CLUSTR.2002.1137753","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137753","url":null,"abstract":"We present a new twist to the Beowulf cluster - the Bladed Beowulf. In contrast to traditional Beowulfs which typically use Intel or AMD processors, our Bladed Beowulf uses Trans-meta processors in order to keep thermal power dissipation low and reliability and density high while still achieving comparable performance to Intel- and AMD-based clusters. Given the ever increasing complexity of traditional supercomputers and Beowulf clusters; the issues of size, reliability power consumption, and ease of administration and use will be \"the\" issues of this decade for high-performance computing. Bigger and faster machines are simply not good enough anymore. To illustrate, we present the results of performance benchmarks on our Bladed Beowulf and introduce two performance metrics that contribute to the total cost of ownership (TCO) of a computing system - performance/power and performance/space.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"60 1","pages":"245-254"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75702160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A data parallel programming model based on distributed objects","authors":"R. Diaconescu, R. Conradi","doi":"10.1109/CLUSTR.2002.1137782","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137782","url":null,"abstract":"This paper proposes a data parallel programming model suitable for loosely synchronous, irregular applications. At the core of the model are distributed objects that express non-trivial data parallelism. Sequential objects express independent computations. The goal is to use objects to fold synchronization into data accesses and thus, free the user from concurrency aspects. Distributed objects encapsulate large data partitioned across multiple address spaces. The system classifies accesses to distributed objects as read and write. Furthermore, it uses the access patterns to maintain information about dependences across partitions. The system guarantees inter-object consistency using a relaxed update scheme. Typical access patterns uncover dependences for data on the border between partitions. Experimental results show that this approach is highly usable and efficient.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"142 1","pages":"455-460"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77375216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZENTURIO: an experiment management system for cluster and Grid computing","authors":"R. Prodan, T. Fahringer","doi":"10.1109/CLUSTR.2002.1137723","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137723","url":null,"abstract":"The need to conduct and manage large sets of experiments for scientific applications dramatically increased over the last decade. However, there is still very little tool support for this complex and tedious process. We introduce the ZENTURIO experiment management system for parameter studies, performance analysis, and software testing for cluster and Grid architectures. ZENTURIO uses the ZEN directive-based language to specify arbitrary complex program executions. ZENTURIO is designed as a collection of Grid services that comprise: (1) a registry service which supports registering and locating Grid services; (2) an experiment generator that parses files with ZEN directives and instruments applications for performance analysis and parameter studies; (3) an experiment executor that compiles and controls the execution of experiments on the target machine. A graphical user portal allows the user to control and monitor the experiments and to automatically visualise performance and output data across multiple experiments. ZENTURIO has been implemented based on Java/Jini distributed technology. It supports experiment management on cluster architectures via PBS and on Grid infrastructures through GRAM. We report results of using ZENTURIO for performance analysis of an ocean simulation application and a parameter study of a computational finance code.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"39 1","pages":"9-18"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81161915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"I/O analysis and optimization for an AMR cosmology application","authors":"Jianwei Li, W. Liao, A. Choudhary, V. Taylor","doi":"10.1109/CLUSTR.2002.1137736","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137736","url":null,"abstract":"In this paper we investigate the data access patterns and file I/O behaviors of a production cosmology application that uses the adaptive mesh refinement (AMR) technique for its domain decomposition. This application was originally developed using Hierarchical Data Format (HDF version 4) I/O library and since HDF4 does not provide parallel I/O facilities, the global file I/O operations were carried out by one of the allocated processors. When the number of processors becomes large, the I/O performance of this design degrades significantly due to the high communication cost and sequential file access. In this work, we present two additional I/O implementations, using MPI-IO and parallel HDF version 5, and analyze their impacts to the I/O performance for this typical AMR application. Based on the I/O patterns discovered in this application, we also discuss the interaction between user level parallel I/O operations and different parallel file systems and point out the advantages and disadvantages. The performance results presented in this work are obtained from an SGI Origin2000 using XFS, an IBM SP using GPFS, and a Linux cluster using PVFS.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"325 1","pages":"119-126"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82922178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Lawry, Christopher Wilson, A. Maccabe, R. Brightwell
{"title":"COMB: a portable benchmark suite for assessing MPI overlap","authors":"W. Lawry, Christopher Wilson, A. Maccabe, R. Brightwell","doi":"10.1109/CLUSTR.2002.1137785","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137785","url":null,"abstract":"This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two methods to characterize the ability of messages to make progress concurrently, with computational processing on the host processor(s). COMB measures the relationship between MPI communication bandwidth and host CPU availability.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"60 1","pages":"472-475"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82343729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SilkRoad II: a multi-paradigm runtime system for cluster computing","authors":"Liang Peng, W. Wong, C. Yuen","doi":"10.1109/CLUSTR.2002.1137779","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137779","url":null,"abstract":"A parallel programming paradigm dictates the way in which an application is to be expressed. It also restricts the algorithms that may be used in the application. Unfortunately, runtime systems for parallel computing often impose a particular programming paradigm. For a wider choice of algorithms, it is desirable to support more than one paradigm. In this paper we consider SilkRoad II, a variant of the Cilk runtime system for cluster computing. What is unique about SilkRoad II is its memory model which supports multiple paradigms with the underlying software distributed shared memory. The RC-dag memory consistency model of SilkRoad II is introduced. Our experimental results show that the stronger RC-dag can achieve performance comparable to LC of Cilk while supporting a bigger set of paradigms with rather good performance.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"19 1","pages":"443-444"},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81483437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}