Highly Scalable Parallel Checksums

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2021-12-01 DOI:10.1109/ICPADS53394.2021.00107

Christian Siebert

{"title":"Highly Scalable Parallel Checksums","authors":"Christian Siebert","doi":"10.1109/ICPADS53394.2021.00107","DOIUrl":null,"url":null,"abstract":"Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors within their distributed data sets.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors within their distributed data sets.

查看原文本刊更多论文

高度可扩展的并行校验和

校验和用于检测在存储或通信数据时可能发生的错误。检查数据的完整性已经建立，但只适用于较小的数据集。相反，超级计算机必须处理大量数据，这可能会导致未被发现的故障。因此，在大范围内，额外的保护是必要的。然而，检查大型数据集的完整性，特别是在分布式数据的情况下，显然需要并行方法。我们展示了流行的校验和(如CRC-32或Adler-32)如何有效地并行化。这也反驳了一个普遍的观点，即并行化前面提到的校验和，特别是以可伸缩的方式，是不可能的。这些校验和背后的数学属性使方法能够组合部分校验和，使其结果与连接的部分数据的校验和相对应。我们的并行校验和算法在可伸缩的分层约简方案中利用这种组合思想来组合来自任意数量的处理元素的部分校验和。虽然这种简化方案可以使用大多数并行编程接口手动实现，但我们使用消息传递接口，它通过非交换的用户定义简化操作直接支持这种功能。结合zlib库的有效校验和功能，我们的算法不仅可以方便地以可移植的方式实现，而且非常高效。计算节点内的额外共享内存并行化完成了我们的混合并行校验和解决方案，它显示了高达524,288个线程的高可伸缩性。在这个规模下，计算240 TiB数据的校验和对于CRC-32只需要3.4秒，对于Adler-32只需要2.6秒。最后，我们讨论了作为动态超级计算机应用代表的APES应用。由于我们的可扩展校验和算法，即使是这样的应用程序现在也能够检测到其分布式数据集中的许多错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量