{"title":"An investigation of the effects of error correcting code on GPU-accelerated molecular dynamics simulations","authors":"R. Walker, Robin M. Betz","doi":"10.1145/2484762.2484774","DOIUrl":null,"url":null,"abstract":"Molecular dynamics (MD) simulations rely on the accurate evaluation and integration of Newton's equations of motion to propagate the positions of atoms in proteins during a simulation. As such, one can expect them to be sensitive to any form of numerical error that may occur during a simulation. Increasingly graphics processing units (GPUs) are being used to accelerate MD simulations. Current GPU architectures designed for HPC applications support error correcting codes (ECC) that detect and correct single bit-flip error events in GPU memory; however, this error checking carries a penalty in terms of simulation speed. ECC is also a major distinguishing feature between HPC NVIDIA Tesla cards and the considerably more cost-effective NVIDIA GeForce gaming cards. An argument often put forward for not using GeForce cards is that the results are unreliable due to the lack of ECC. In an initial attempt to quantify these concerns, an investigation of the effects of ECC on GPU-accelerated MD simulations using the AMBER software was conducted on 720 GPUs of the XSEDE supercomputer Keeneland with and without ECC. While the data collected are insufficient to make solid conclusions and more extensive testing is needed to provide quantitative statistics, the absence of ECC events and lack of any silent errors in all the simulations conducted to date suggest that these errors are exceedingly rare and as such the time and memory penalty of ECC may outweigh the utility of error checking functionality. This is particularly true in the case of large scale HPC runs where simulation is more likely to be interrupted by a node or storage failure and thus reducing the simulation wall clock time by turning ECC off may actually reduce the overall simulation failure rate.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"102 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Molecular dynamics (MD) simulations rely on the accurate evaluation and integration of Newton's equations of motion to propagate the positions of atoms in proteins during a simulation. As such, one can expect them to be sensitive to any form of numerical error that may occur during a simulation. Increasingly graphics processing units (GPUs) are being used to accelerate MD simulations. Current GPU architectures designed for HPC applications support error correcting codes (ECC) that detect and correct single bit-flip error events in GPU memory; however, this error checking carries a penalty in terms of simulation speed. ECC is also a major distinguishing feature between HPC NVIDIA Tesla cards and the considerably more cost-effective NVIDIA GeForce gaming cards. An argument often put forward for not using GeForce cards is that the results are unreliable due to the lack of ECC. In an initial attempt to quantify these concerns, an investigation of the effects of ECC on GPU-accelerated MD simulations using the AMBER software was conducted on 720 GPUs of the XSEDE supercomputer Keeneland with and without ECC. While the data collected are insufficient to make solid conclusions and more extensive testing is needed to provide quantitative statistics, the absence of ECC events and lack of any silent errors in all the simulations conducted to date suggest that these errors are exceedingly rare and as such the time and memory penalty of ECC may outweigh the utility of error checking functionality. This is particularly true in the case of large scale HPC runs where simulation is more likely to be interrupted by a node or storage failure and thus reducing the simulation wall clock time by turning ECC off may actually reduce the overall simulation failure rate.