{"title":"A study of application-level recovery methods for transient network faults","authors":"I. Laguna, E. León, M. Schulz, M. Stephenson","doi":"10.1145/2530268.2530271","DOIUrl":null,"url":null,"abstract":"With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Scala","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2530268.2530271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.