{"title":"Highly Scalable Parallel Checksums","authors":"Christian Siebert","doi":"10.1109/ICPADS53394.2021.00107","DOIUrl":null,"url":null,"abstract":"Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors within their distributed data sets.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors within their distributed data sets.