2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)最新文献

筛选
英文 中文
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI FTC-Charm++:用于Charm++和MPI的基于内存检查点的容错运行时
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392606
G. Zheng, L. Shi, L. Kalé
{"title":"FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI","authors":"G. Zheng, L. Shi, L. Kalé","doi":"10.1109/CLUSTR.2004.1392606","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392606","url":null,"abstract":"As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115135368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 213
Simplifying administration through dynamic reconfiguration. in a cooperative cluster storage system 通过动态重新配置简化管理。在协作集群存储系统中
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392615
Renaud Lachaize, J. Hansen
{"title":"Simplifying administration through dynamic reconfiguration. in a cooperative cluster storage system","authors":"Renaud Lachaize, J. Hansen","doi":"10.1109/CLUSTR.2004.1392615","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392615","url":null,"abstract":"Cluster storage systems where storage devices are distributed across a large number of nodes are able to reduce the I/O bottleneck problems present in most centralized storage systems. However, such distributed storage devices are hard to manage efficiently. In this paper, we examine the use of explicit, component-based (command and data) paths between hosts and disks as a vehicle for performing nondisruptive storage system reconfiguration. We describe the mechanisms necessary to perform reconfigurations and show how they can be used to handle two management tasks: migration between network technologies and rebuilding a disk in a mirror. Our approach is validated through initial performance measurements of these two tasks using a prototype implementation. The results show that online reconfiguration is possible at a modest cost","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123112392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
State of InfiniBand in designing HPC clusters, storage/file systems, and datacenters [datacenters read as data centers] InfiniBand在高性能计算集群、存储/文件系统和数据中心(数据中心读作数据中心)设计中的现状
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392594
D. Panda
{"title":"State of InfiniBand in designing HPC clusters, storage/file systems, and datacenters [datacenters read as data centers]","authors":"D. Panda","doi":"10.1109/CLUSTR.2004.1392594","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392594","url":null,"abstract":"Summary forn only given. The tutorial aims to familiarize with IBA, its benefits, available IBA hardware/software solutions, and the latest trends in designing high-end computing, networking, and storage systems with IBA, and providing a critical assessment of whether IBA is ready for prime-time or not.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124272728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI 改进的消息日志记录与改进的容错MPI协调检查点
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392609
Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello
{"title":"Improved message logging versus improved coordinated checkpointing for fault tolerant MPI","authors":"Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello","doi":"10.1109/CLUSTR.2004.1392609","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392609","url":null,"abstract":"Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125306726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC 基于网格的生物信息学应用于CBBC的比较基因组分析
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392652
A. Hunter, D. Schibeci, H. L. Hiew, M. Bellgard
{"title":"GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC","authors":"A. Hunter, D. Schibeci, H. L. Hiew, M. Bellgard","doi":"10.1109/CLUSTR.2004.1392652","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392652","url":null,"abstract":"Summary form only given. Bioinformatics is an important application area for grid computing. The grid computing issues required to tackle current bioinformatics challenges include processing power, large-scale data access and management, security, application integration, data integrity and curation, control/automation/tracking of workflows, data format consistency and resource discovery. In this poster, we describe preliminary steps taken to develop a grid environment to advance bioinformatics research. We developed a system called Grendel, with the aims of providing bioinformatics researchers transparent access to basic computational resources used in their research. Grendel is a platform and language independent Web-services based system for distributed resource management utilising Sun Grid Engine that provides a single entry point for computational tasks while keeping the actual resources transparent to the user. Grendel is developed in Java and deployed using the Tomcat. Client libraries have been developed in Perl and Java to provide access to computation resource exported via Grendel.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124747240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Management of grid jobs and data within SAMGrid 管理SAMGrid中的网格作业和数据
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392634
A. Baranovski, G. Garzoglio, I. Terekhov, A. Roy, T. Tannenbaum
{"title":"Management of grid jobs and data within SAMGrid","authors":"A. Baranovski, G. Garzoglio, I. Terekhov, A. Roy, T. Tannenbaum","doi":"10.1109/CLUSTR.2004.1392634","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392634","url":null,"abstract":"When designing SAMGrid, a project for distributing high-energy physics computations on a grid, we discovered that it was challenging to decide where to place user's jobs. Jobs typically need to access hundreds of files, and each site has a different subset of the files. Our data system SAM knows what portion of a user's data may be at each site, but does not know how to submit grid jobs. Our job submission system Condor-G knows how to submit grid jobs, but originally it required users to choose grid sites and gave them no assistance in choosing. This work describes how we enhanced Condor-G to interact with SAM to make good decisions about where jobs should be executed, and thereby improve the performance of grid jobs that access large amounts of data. All these enhancements are general enough to be applicable to grid computing beyond the data-intensive computing with SAMGrid.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114707099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM 在Myrinet/GM上可扩展的、高性能的、基于nic的全对全广播
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392610
Weikuan Yu, D. Panda, Darius Buntinas
{"title":"Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM","authors":"Weikuan Yu, D. Panda, Darius Buntinas","doi":"10.1109/CLUSTR.2004.1392610","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392610","url":null,"abstract":"All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable network interface cards (NICs) have been leveraged to efficiently support collective operations, including barrier, broadcast, and reduce. This work explores the characteristics of all-to-all broadcast and proposes new algorithms to exploit the potential advantages of NIC programmability. Along with these algorithms, salient strategies have been used to provide scalable topology management, global buffer management, efficient communication processing, and message reliability. The algorithms have been incorporated into a NIC-based collective protocol over Myrinet/GM. The NIC-based all-to-all broadcast operations improve all-to-all broadcast bandwidth over 16 nodes by a factor of 3, compared to host-based all-to-all broadcast operation. Furthermore, the NIC-based operations have been demonstrated to achieve better scalability to large systems and very low host CPU utilization.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134536845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Computation-at-risk: employing the grid for computational risk management 计算风险:采用网格进行计算风险管理
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392633
S. Kleban, S. Clearwater
{"title":"Computation-at-risk: employing the grid for computational risk management","authors":"S. Kleban, S. Clearwater","doi":"10.1109/CLUSTR.2004.1392633","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392633","url":null,"abstract":"This work expands upon our earlier work involving the concept of computation-at-risk (CaR). In particular, CaR refers to the risk that certain computations may not get done within a timely manner. We examine a number of CaR distributions on several large clusters. The important contribution of This work is that it shows that there exist CaR-reducing strategies and by employing such strategies, a facility can significantly reduce the risk of inefficient resource utilization. Grids are shown to be one means for employing a CaR-reducing strategy. For example, we show that a CaR-reducing strategy applied to a common queue can have a dramatic effect on the wait times for jobs on a grid of clusters. In particular, we defined a CaR Sharpe rule that provides a decision rule for determining the best machine in a grid to place a new job.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130829454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A comparison of local and gang scheduling on a Beowulf cluster 贝奥武夫集群的本地调度与组调度比较
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392601
P. Strazdins, Johannes Uhlmann
{"title":"A comparison of local and gang scheduling on a Beowulf cluster","authors":"P. Strazdins, Johannes Uhlmann","doi":"10.1109/CLUSTR.2004.1392601","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392601","url":null,"abstract":"Gang scheduling and related techniques are widely believed to be necessary for efficient job scheduling on distributed memory parallel computers. This is because they minimize context switching overheads and permit the parallel job currently running to progress at the fastest possible rate. However, in the case of cluster computers, and particularly those with COTS networks, these benefits can be outweighed in the multiple jobs time-sharing context by the loss the ability to utilize the CPU for other jobs when the current job is waiting for messages. Experiments on a Linux Beowulf cluster with 100 Mb fast Ethernet switches are made comparing the SCore buddy-based gang scheduling with local scheduling (provided by the Linux 2.4 kernel with MPI implemented over TCP/IP). Results for communication-intensive numerical applications on 16 nodes reveal that gang scheduling results in 'slowdowns ' up to a factor of two greater for 8 simultaneous jobs. This phenomenon is not due to any deficiencies in SCore but due to the relative costs of context switching versus message overhead, and we expect similar results holds for any gang scheduling implementation. A performance analysis of local scheduling indicates that cache pollution due to context switching is more significant than the direct context switching overhead on the applications studied. When this is taken into account, local scheduling behaviour comes close to achieving ideal slowdowns for finer-grained computations such as Linpack. The performance models also indicate that similar trends are to be expected for clusters with faster networks.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134528140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
MPIIMGEN - a code transformer that parallelizes image processing codes to run on a cluster of workstations MPIIMGEN——一个代码转换器,它将图像处理代码并行化,以便在工作站集群上运行
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI: 10.1109/CLUSTR.2004.1392596
U. V. Vinod, P. K. Baruah
{"title":"MPIIMGEN - a code transformer that parallelizes image processing codes to run on a cluster of workstations","authors":"U. V. Vinod, P. K. Baruah","doi":"10.1109/CLUSTR.2004.1392596","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392596","url":null,"abstract":"An enormous body of image and video processing software has been written for conventional (sequential) desktop computers. These implement a wide range of operations, such as convolution, histogram equalization and template matching. These applications usually have a tremendous potential for parallelism. However a significant barrier in exploiting such parallelism is the difficulty of writing parallel software. In this work, the design and implementation of MPIIMGEN -- a code transformer that automatically transforms these sequential image processing codes into parallel codes that are capable of running on a cluster of workstations is presented. This tool uses a pattern driven approach to parallelize the sequential codes.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信