Jonghyun Lee, R. Ross, R. Thakur, Xiaosong Ma, M. Winslett
{"title":"RFS: efficient and flexible remote file access for MPI-IO","authors":"Jonghyun Lee, R. Ross, R. Thakur, Xiaosong Ma, M. Winslett","doi":"10.1109/CLUSTR.2004.1392604","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392604","url":null,"abstract":"Scientific applications often need to access remote file systems. Because of slow networks and large data size, however, remote I/O can become an even more serious performance bottleneck than local I/O performance. In this work, we present RFS, a high-performance remote I/O facility for ROMIO, which is a well-known MPI-IO implementation. Our simple, portable, and flexible design eliminates the shortcomings of previous remote I/O efforts. In particular, RFS improves the remote I/O performance by adopting active buffering with threads (ABT), which hides I/O cost by aggressively buffering the output data using available memory and performing background I/O using threads while computation is taking place. Our experimental results show that RFS with ABT can significantly reduce the remote I/O visible cost, achieving up to 92% of the theoretical peak throughput. The computation slowdown caused by concurrent I/O activities was 0.2-6.2%, which is dwarfed by the overall performance improvement in application turnaround time.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Component-based cluster systems software architecture a case study","authors":"N. Desai, Rick Bradshaw, E. Lusk, R. Butler","doi":"10.1109/CLUSTR.2004.1392629","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392629","url":null,"abstract":"We describe the use of component architecture in an area to which this approach has not been classically applied, the area of cluster system software. By \"cluster system software,\" we mean the collection of programs used in configuring and maintaining individual nodes, together with the software involved in submission, scheduling, monitoring, and termination of jobs. We describe how the component approach maps onto the cluster systems software problem, together with our experiences with the approach in implementing an all-new suite of systems software for a medium-sized cluster with unusually complex systems software requirements.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122766757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motohiko Matsuda, T. Kudoh, Hiroshi Tazuka, Y. Ishikawa
{"title":"The design and implementation of an asynchronous communication mechanism for the MPI communication model","authors":"Motohiko Matsuda, T. Kudoh, Hiroshi Tazuka, Y. Ishikawa","doi":"10.1109/CLUSTR.2004.1392597","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392597","url":null,"abstract":"Many implementations of an MPI communication library are realized on top of the socket interface which is based on connection-oriented stream communication. This work addresses a mismatch between the MPI communication model and the socket interface. In order to overcome a mismatch and implement an efficient MPI library for large-scale commodity-based clusters, a new communication mechanism, called 02G, is designed and implemented. O2G integrates receive queue management of MPI into a TCP/IP protocol handler, without modifying the protocol stacks. Received data is extracted from the TCP receive buffer and copied into the user space within the TCP/IP protocol handler invoked by interrupts. It totally avoids polling of sockets and reduces system call overhead, which becomes dominant in large-scale clusters. In addition, its immediate and asynchronous receive operation avoids message flow disruption due to a shortage of capacity in the receive buffer, and keeps the bandwidth high. An evaluation using the NAS Parallel Benchmarks shows that 02G made an MPI implementation up to 30 percent faster than the original one. An evaluation on bandwidth also shows that 02G made an MPI implementation independent of the number of connections, while an implementation with sockets was greatly affected by the number of connections.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129810409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Los Alamos Crestone Project: cluster computing applications","authors":"R. Weaver, M. Gittings, L. Pritchett, C. Scovel","doi":"10.1109/CLUSTR.2004.1392661","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392661","url":null,"abstract":"Summary form only given. The Los Alamos Crestone Project is part of the Department of Energy's (DOE) Accelerated Strategic Computing Initiative, or ASCI Program. The main goal of this software development project is to investigate the use of continuous adaptive mesh refinement (CAMR) techniques for application to problems of interest to the Laboratory. There are many code development efforts in the Crestone Project, both unclassified and classified codes. An overview of the Crestone Project, and the SAGE and RAGE codes, has been published recently in Weaver and Gittings (2003). In This work, I will give the status of the use of these CAMR codes on commodity cluster machines. One of the most economical methods for achieving supercomputing capability is to use commodity processors connected by commodity interconnects. This was highlighted recently at Virginia Tech when Dr. Varadarajan built the third fastest supercomputer in the world by connecting 1100 dual-processor Macintosh G5 machines together (see http://www.top500.org). Most commodity clusters use a form of LINUX as the operating system. We will give an overview of the current status of using the Crestone Project codes SAGE and RAGE on commodity cluster machines. These codes are intended for general applications without tuning of algorithms or parameters. We have run a wide variety of physical applications from millimeter-scale laboratory laser experiments, to the multikilometer-scale asteroid impacts into the Pacific Ocean, to parsec-scale galaxy formation. Examples of these simulations will be shown. The goal of our effort is to avoid ad hoc models and attempt to rely on first-principles physics. In addition to the large effort on developing parallel code physics packages, a substantial effort in the project is devoted to improving the computer science and software quality engineering (SQE) of the Project codes as well as a sizable effort on the verification and validation (V&V) of the resulting codes. Examples of these efforts for our project will be discussed. Recent results of the scaling of these codes on commodity clusters will be shown.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130543624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliability algorithms for network swapping systems with page migration","authors":"Ben Mitchell, J. Rosse, T. Newhall","doi":"10.1109/CLUSTR.2004.1392655","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392655","url":null,"abstract":"Summary form only given. Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap pages over the network. Without reliability support a single node crash can affect programs running on other nodes by losing their remotely swapped page data. RAID-based (Patterson et al., 1988; Markatos and Dramitinos, 1996) reliability solutions promise the best alternative in terms of flexibility and performance. However, two important features of our network swapping system, Nswap (Newhall et al., 2003), make direct application of RAID-based schemes impossible. First, Nswap adapts to each node's local memory load, adjusting the amount of RAM space it makes available for remote swapping, which results in a variable capacity \"backing store\". Second, Nswap supports migration of remotely swapped pages between cluster nodes, which occurs when a node needs to reclaim some of its RAM from Nswap to use for local processing. Page migration complicates reliability if, for example, two pages in the same parity group end up on the same node. We present novel reliability algorithms that solve these problems. Our Parity algorithm uses dynamic parity group membership to match Nswap's dynamic nature. We show that our algorithms add minimal overhead to remote swapping.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132197816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gang Shi, Mingchang Hu, Hongda Yin, Weiwu Hu, Zhimin Tang
{"title":"A shared virtual memory network with fast remote direct memory access and message passing","authors":"Gang Shi, Mingchang Hu, Hongda Yin, Weiwu Hu, Zhimin Tang","doi":"10.1109/CLUSTR.2004.1392660","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392660","url":null,"abstract":"The communication overhead has become one of the bottlenecks of SVM (shared virtual memory). Many methods have been taken to improve the performance of SVM. However, these can't obtain the improvement as expected. In order to get further utility of communication hardware and reduce unnecessary overhead, a prototype with the ability of RDMA is designed and implemented in This work, which is named FRAMP (virtual memory based Fast Remote direct memory Access and Message Passing network). FRAMP includes the cross bar-based switch, the custom host network interface and the user-level communication protocol. All of these are tightly coupled and deliberately balanced. FRAMP achieves 3.7 s one-way latency and 6.0 s RDMA read latency on system driver level. FRAMP gets 5.6 s one-way latency and 2.0 s ping-ping latency and 125MB/S asymptotic bandwidth on user API level with multi-thread programming method. Remote memory read for 8 bytes and a page of 4096 bytes only takes 8.0 s and 39 s respectively on user level. The obtained bandwidth is close to the hardware limit of our experimental environment, which is based on 33MHz 32-bit PCI bus, and the use rate of PCI bus is 94%. The SVM performance on FRAMP network with pure message passing is very good, but the one using RDMA read to fetch fault pages is not so good.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134390366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting memory-access cost based on data-access patterns","authors":"S. Byna, Xian-He Sun, W. Gropp, R. Thakur","doi":"10.1109/CLUSTR.2004.1392630","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392630","url":null,"abstract":"Improving memory performance at software level is more effective in reducing the rapidly expanding gap between processor and memory performance. Loop transformations (e.g. loop unrolling, loop tiling) and array restructuring optimizations improve the memory performance by increasing the locality of memory accesses. To find the best optimization parameters at runtime, we need a fast and simple analytical model to predict the memory access cost. Most of the existing models are complex and impractical to be integrated in the runtime tuning systems. In this paper, we propose a simple, fast and reasonably accurate model that is capable of predicting the memory access cost based on a wide range of data access patterns that appear in many scientific applications.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125612239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An evaluation of the close-to-files processor and data co-allocation policy in multiclusters","authors":"H. Mohamed, D. Epema","doi":"10.1109/CLUSTR.2004.1392626","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392626","url":null,"abstract":"In multicluster systems, and more generally, in grids, jobs may require coallocation, i.e., the simultaneous allocation of resources such as processors and input files in multiple clusters. While such jobs may have reduced runtimes because they have access to more resources, waiting for processors in multiple clusters and for the input files to become available in the right locations may introduce inefficiencies. In previous work, we have studied through simulations only processor coallocation. Here, we extend this work with an analysis of the performance in a real testbed of our prototype processor and data coallocator with the close-to-files (CF) job-placement algorithm. CF tries to place job components on clusters with enough idle processors which are close to the sites where the input files reside. We present a comparison of the performance of CF and the worst-fit job-placement algorithm, with and without file replication, achieved with our prototype. Our most important findings are that CF with replication works best, and that the utilization in our testbed can be driven to about 80%.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129266203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajvikram Singh, Byungil Jeong, L. Renambot, Andrew E. Johnson, J. Leigh
{"title":"TeraVision: a distributed, scalable, high resolution graphics streaming system","authors":"Rajvikram Singh, Byungil Jeong, L. Renambot, Andrew E. Johnson, J. Leigh","doi":"10.1109/CLUSTR.2004.1392638","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392638","url":null,"abstract":"In electronically mediated distance collaborations involving scientific data, there is often the need to stream the graphical output of individual computers or entire visualization clusters to remote displays. This work presents TeraVision as a scalable platform-independent solution which is capable of transmitting multiple synchronized high-resolution video streams between single workstations and/or clusters without requiring any modifications to be made to the source or destination machines. Issues addressed include: how to synchronize individual video streams to form a single larger stream; how to scale and route streams generated by an array of M/spl times/N nodes to fit a X/spl times/Y display; and how TeraVision exploits a variety of transport protocols. Results from experiments conducted over gigabit local-area networks and wide-area networks (between Chicago and Amsterdam), are presented. Finally, we propose the scalable adaptive graphics environment (SAGE) - an architecture to support future collaborative visualization environments with potentially billions of pixels.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117115854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A client-centric grid knowledgebase","authors":"George Kola, T. Kosar, M. Livny","doi":"10.1109/CLUSTR.2004.1392642","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392642","url":null,"abstract":"Grid computing brings with it additional complexities and unexpected failures. Just keeping track of our jobs traversing different grid resources before completion can at times become tricky. We introduce a client-centric grid knowledgebase that keeps track of the job performance and failure characteristics on different grid resources as observed by the client. We present the design and implementation of our prototype grid knowledgebase and evaluate its effectiveness on two real life grid data processing pipelines: NCSA image processing pipeline and WCER video processing pipeline. It enabled us to easily extract useful job and resource information and interpret them to make better scheduling decisions. Using it, we were able to understand failures better and were able to devise innovative methods to automatically avoid and recover from failures and dynamically adapt to grid environment improving fault-tolerance and performance.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126593815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}