针对弹性MPI应用的应用级解决方案的见解

2018 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2018-07-01 DOI:10.1109/HPCS.2018.00101

P. González, Nuria Losada, María J. Martín

{"title":"针对弹性MPI应用的应用级解决方案的见解","authors":"P. González, Nuria Losada, María J. Martín","doi":"10.1109/HPCS.2018.00101","DOIUrl":null,"url":null,"abstract":"Current petascale systems, formed by hundreds of thousands of cores, are highly dynamic, which causes that hardware failure rates are relatively high. Failure data collected from two large high-performance computing sites have been analysed in [1], showing failure rates from 20 to more than 1,000 failures per year, depending mostly on system size. This can be translated in a failure every 8.7 hours. Future exascale systems, formed by several millions of cores, will be hit by error/faults even more frequently due to their scale and complexity [2]. Thus, long-running applications in these systems will need to use fault tolerance techniques to ensure the successful execution completion.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"17 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Insights into Application-level Solutions towards Resilient MPI Applications\",\"authors\":\"P. González, Nuria Losada, María J. Martín\",\"doi\":\"10.1109/HPCS.2018.00101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current petascale systems, formed by hundreds of thousands of cores, are highly dynamic, which causes that hardware failure rates are relatively high. Failure data collected from two large high-performance computing sites have been analysed in [1], showing failure rates from 20 to more than 1,000 failures per year, depending mostly on system size. This can be translated in a failure every 8.7 hours. Future exascale systems, formed by several millions of cores, will be hit by error/faults even more frequently due to their scale and complexity [2]. Thus, long-running applications in these systems will need to use fault tolerance techniques to ensure the successful execution completion.\",\"PeriodicalId\":308138,\"journal\":{\"name\":\"2018 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"17 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCS.2018.00101\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2018.00101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目前的千万亿级系统由数十万个核心组成，是高度动态的，这导致硬件故障率相对较高。从两个大型高性能计算站点收集的故障数据已经在[1]中进行了分析，显示了每年从20到1000多个故障的故障率，主要取决于系统大小。这可以转化为每8.7小时发生一次故障。未来的百亿亿级系统，由数百万个核心组成，由于其规模和复杂性，将更频繁地受到错误/故障的打击。因此，这些系统中的长时间运行的应用程序将需要使用容错技术来确保成功完成执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Insights into Application-level Solutions towards Resilient MPI Applications

Current petascale systems, formed by hundreds of thousands of cores, are highly dynamic, which causes that hardware failure rates are relatively high. Failure data collected from two large high-performance computing sites have been analysed in [1], showing failure rates from 20 to more than 1,000 failures per year, depending mostly on system size. This can be translated in a failure every 8.7 hours. Future exascale systems, formed by several millions of cores, will be hit by error/faults even more frequently due to their scale and complexity [2]. Thus, long-running applications in these systems will need to use fault tolerance techniques to ensure the successful execution completion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量