高端计算故障下的性能

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI:10.1145/1362622.1362687

Ming Wu, Xian-He Sun, Hui Jin

{"title":"高端计算故障下的性能","authors":"Ming Wu, Xian-He Sun, Hui Jin","doi":"10.1145/1362622.1362687","DOIUrl":null,"url":null,"abstract":"Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"75 23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Performance under failures of high-end computing\",\"authors\":\"Ming Wu, Xian-He Sun, Hui Jin\",\"doi\":\"10.1145/1362622.1362687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.\",\"PeriodicalId\":274744,\"journal\":{\"name\":\"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)\",\"volume\":\"75 23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1362622.1362687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1362622.1362687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

摘要

现代高端计算机空前复杂。故障的发生是在未来的千万亿次计算机上解决大规模应用的一个不可避免的事实。近年来提出了许多方法来掩盖故障。然而，这些方法带来了不同的性能和生产成本。更好地理解错误对应用程序性能的影响对于明智地使用现有的容错方法是必要的。在本研究中，我们首先引入一些实用且有效的性能模型来预测系统故障下的应用程序完成时间。这些模型分离了故障率、故障修复、检查点周期、检查点成本和并行任务分配对并行和顺序执行时间的影响。为了使给定计算平台的最终用户受益，我们开发了有效的故障感知任务调度算法来优化系统故障下的应用程序性能。最后，进行了大量的仿真和实验，以评估我们的预测模型和具有实际故障跟踪的调度策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance under failures of high-end computing

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)

自引率

0.00%

发文量