2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)最新文献

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI FTC-Charm++:用于Charm++和MPI的基于内存检查点的容错运行时

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392606

G. Zheng, L. Shi, L. Kalé

{"title":"FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI","authors":"G. Zheng, L. Shi, L. Kalé","doi":"10.1109/CLUSTR.2004.1392606","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392606","url":null,"abstract":"As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115135368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 213

Simplifying administration through dynamic reconfiguration. in a cooperative cluster storage system 通过动态重新配置简化管理。在协作集群存储系统中

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392615

Renaud Lachaize, J. Hansen

引用次数: 5

State of InfiniBand in designing HPC clusters, storage/file systems, and datacenters [datacenters read as data centers] InfiniBand在高性能计算集群、存储/文件系统和数据中心(数据中心读作数据中心)设计中的现状

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392594

D. Panda

引用次数: 0

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI 改进的消息日志记录与改进的容错MPI协调检查点

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392609

Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello

{"title":"Improved message logging versus improved coordinated checkpointing for fault tolerant MPI","authors":"Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello","doi":"10.1109/CLUSTR.2004.1392609","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392609","url":null,"abstract":"Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125306726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC 基于网格的生物信息学应用于CBBC的比较基因组分析

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392652

A. Hunter, D. Schibeci, H. L. Hiew, M. Bellgard

{"title":"GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC","authors":"A. Hunter, D. Schibeci, H. L. Hiew, M. Bellgard","doi":"10.1109/CLUSTR.2004.1392652","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392652","url":null,"abstract":"Summary form only given. Bioinformatics is an important application area for grid computing. The grid computing issues required to tackle current bioinformatics challenges include processing power, large-scale data access and management, security, application integration, data integrity and curation, control/automation/tracking of workflows, data format consistency and resource discovery. In this poster, we describe preliminary steps taken to develop a grid environment to advance bioinformatics research. We developed a system called Grendel, with the aims of providing bioinformatics researchers transparent access to basic computational resources used in their research. Grendel is a platform and language independent Web-services based system for distributed resource management utilising Sun Grid Engine that provides a single entry point for computational tasks while keeping the actual resources transparent to the user. Grendel is developed in Java and deployed using the Tomcat. Client libraries have been developed in Perl and Java to provide access to computation resource exported via Grendel.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124747240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Management of grid jobs and data within SAMGrid 管理SAMGrid中的网格作业和数据

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392634

A. Baranovski, G. Garzoglio, I. Terekhov, A. Roy, T. Tannenbaum

引用次数: 5

Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM 在Myrinet/GM上可扩展的、高性能的、基于nic的全对全广播

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392610

Weikuan Yu, D. Panda, Darius Buntinas

引用次数: 13

Computation-at-risk: employing the grid for computational risk management 计算风险:采用网格进行计算风险管理

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392633

S. Kleban, S. Clearwater

引用次数: 7

A comparison of local and gang scheduling on a Beowulf cluster 贝奥武夫集群的本地调度与组调度比较

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392601

P. Strazdins, Johannes Uhlmann

{"title":"A comparison of local and gang scheduling on a Beowulf cluster","authors":"P. Strazdins, Johannes Uhlmann","doi":"10.1109/CLUSTR.2004.1392601","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392601","url":null,"abstract":"Gang scheduling and related techniques are widely believed to be necessary for efficient job scheduling on distributed memory parallel computers. This is because they minimize context switching overheads and permit the parallel job currently running to progress at the fastest possible rate. However, in the case of cluster computers, and particularly those with COTS networks, these benefits can be outweighed in the multiple jobs time-sharing context by the loss the ability to utilize the CPU for other jobs when the current job is waiting for messages. Experiments on a Linux Beowulf cluster with 100 Mb fast Ethernet switches are made comparing the SCore buddy-based gang scheduling with local scheduling (provided by the Linux 2.4 kernel with MPI implemented over TCP/IP). Results for communication-intensive numerical applications on 16 nodes reveal that gang scheduling results in 'slowdowns ' up to a factor of two greater for 8 simultaneous jobs. This phenomenon is not due to any deficiencies in SCore but due to the relative costs of context switching versus message overhead, and we expect similar results holds for any gang scheduling implementation. A performance analysis of local scheduling indicates that cache pollution due to context switching is more significant than the direct context switching overhead on the applications studied. When this is taken into account, local scheduling behaviour comes close to achieving ideal slowdowns for finer-grained computations such as Linpack. The performance models also indicate that similar trends are to be expected for clusters with faster networks.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134528140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Predicting memory-access cost based on data-access patterns 基于数据访问模式预测内存访问成本

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392630

S. Byna, Xian-He Sun, W. Gropp, R. Thakur

引用次数: 22