{"title":"大规模并行地震框架的容错实现","authors":"Suha N. Kayum, H. Alsalim, T. Tonellot, A. Momin","doi":"10.1109/HPEC43674.2020.9286143","DOIUrl":null,"url":null,"abstract":"An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Fault Tolerant Implementation for a Massively Parallel Seismic Framework\",\"authors\":\"Suha N. Kayum, H. Alsalim, T. Tonellot, A. Momin\",\"doi\":\"10.1109/HPEC43674.2020.9286143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.\",\"PeriodicalId\":168544,\"journal\":{\"name\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC43674.2020.9286143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Fault Tolerant Implementation for a Massively Parallel Seismic Framework
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.