2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第2页

Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots 基于任意图检测的快速确定性采集:多机器人的力量

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00015

A. R. Molla, Kaushik Mondal, W. Moses

{"title":"Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots","authors":"A. R. Molla, Kaushik Mondal, W. Moses","doi":"10.1109/IPDPS54959.2023.00015","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00015","url":null,"abstract":"Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate.In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes $tilde Oleft({{n^5}log ell }right)$ rounds, where is the smallest label among robots and $tilde O$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k ≥ ⌊n/2⌋ + 1, the algorithm takes O(n3) rounds, (ii) when k ≥ ⌊n/3⌋ + 1, the algorithm takes O(n4 log n) rounds, and (iii) otherwise, the algorithm takes $tilde Oleft({{n^5}}right)$ rounds. The algorithm is not required to know k, but only n.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121089074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer SW-LCM:一种基于新神威超级计算机的可扩展和弱监督土地覆盖制图方法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00071

Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu

{"title":"SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer","authors":"Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu","doi":"10.1109/IPDPS54959.2023.00071","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00071","url":null,"abstract":"High-resolution land cover mapping (LCM) is an important application for studying and understanding the change of the earth surface. While deep learning (DL) methods demonstrate great potential in analyzing satellite images, they largely depend on massive high-quality labels. This paper proposes SW-LCM, a Scalable and Weakly-supervised two-stage Land Cover Mapping method on a new Sunway Supercomputer. Our method consists of a k-means clustering module as a first stage, and an iterative deep learning module as a second stage. With the k-means module providing a good enough starting point (taking inaccurate results as noisy labels), the deep learning module improves the classification results in an iterative way, without any labelling efforts required for processing large scenarios. To achieve efficiency for country-level land cover mapping, we design a customized data partition scheme and an on-the-fly assembly for k-means. Through careful parallelization and optimization, our k-means module scales to 98,304 computing nodes (over 38 million cores), and provides a sustained performance of 437.56 PFLOPS, in a real LCM task of the entire region of China; the iterative updating part scales to 24,576 nodes, with a performance of 11 PFLOPS. We produce a 10-m resolution land cover map of China, with an accuracy of 83.5% (10-class) or 73.2% (25-class), 7% to 8% higher than best existing products, paving ways for finer land surveys to support sustainability-related applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116337423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM) 涉及对称输入矩阵(SYMM)的分布式存储密集矩阵乘法的算术强度

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00044

E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost

{"title":"On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM)","authors":"E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost","doi":"10.1109/IPDPS54959.2023.00044","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00044","url":null,"abstract":"Dense matrix multiplication involving a symmetric input matrix (SYMM) is implemented in reference distributed-memory codes with the same data distribution as its general analogue (GEMM). We show that, when the symmetric matrix is dominant, such a 2D block-cyclic (2D BC) scheme leads to a lower arithmetic intensity (AI) of SYMM than that of GEMM by a factor of 2. We propose alternative data distributions preserving the memory benefit of SYMM of storing only half of the matrix while achieving up to the same AI as GEMM. We also show that, in the case we can afford the same memory footprint as GEMM, SYMM can achieve a higher AI. We propose a task-based design of SYMM independent of the data distribution. This design allows for scalable A-stationary SYMM with which all discussed data distributions, may they be very irregular, can be easily assessed. We have integrated the resulting code in a reduction dimension algorithm involving a randomized singular value decomposition dominated by SYMM. An experimental study shows a compelling impact on performance.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Efficient 2D Method for Training Super-Large Deep Learning Models 一种训练超大型深度学习模型的高效二维方法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00031

Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You

{"title":"An Efficient 2D Method for Training Super-Large Deep Learning Models","authors":"Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You","doi":"10.1109/IPDPS54959.2023.00031","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00031","url":null,"abstract":"Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $Wsim {(sqrt p log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes 任意数目节点上密集线性代数分解的数据分布方案

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00047

Olivier Beaumont, Jean-Alexandre Collin, Lionel Eyraud-Dubois, Mathieu Vérité

引用次数: 0

Exact Fault-Tolerant Consensus with Voting Validity 具有投票有效性的精确容错共识

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00089

Zhangchen Xu, Yuetai Li, Chengli Feng, Lei Zhang

引用次数: 0

Accelerating Packet Processing in Container Overlay Networks via Packet-level Parallelism 通过包级并行性加速容器覆盖网络中的包处理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00018

Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao

{"title":"Accelerating Packet Processing in Container Overlay Networks via Packet-level Parallelism","authors":"Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao","doi":"10.1109/IPDPS54959.2023.00018","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00018","url":null,"abstract":"Overlay networks serve as the de facto network virtualization technique for providing connectivity among distributed containers. Despite the flexibility in building customized private container networks, overlay networks incur significant performance loss compared to physical networks (i.e., the native). The culprit lies in the inclusion of multiple network processing stages in overlay networks, which prolongs the network processing path and overloads CPU cores. In this paper, we propose mFlow, a novel packet steering approach to parallelize the in-kernel data path of network flows. mFlow exploits packet-level parallelism in the kernel network stack by splitting the packets of the same flow into multiple micro-flows, which can be processed in parallel on multiple cores. mFlow devises new, generic mechanisms for flow splitting while preserving in-order packet delivery with little overhead. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of mFlow, with significantly improved performance – e.g., by 81% in TCP throughput and 139% in UDP compared to vanilla overlay networks. mFlow even achieved higher TCP throughput than the native (e.g., 29.8 vs. 26.6 Gbps).","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations 用于量子电路模拟的gpu加速错误边界压缩框架

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00081

Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello

{"title":"GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations","authors":"Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello","doi":"10.1109/IPDPS54959.2023.00081","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00081","url":null,"abstract":"Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113990403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices 一种新的用于方阵缓存无关就地转置的三角形空间填充曲线

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00045

J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner

{"title":"A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices","authors":"J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner","doi":"10.1109/IPDPS54959.2023.00045","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00045","url":null,"abstract":"This paper proposes a novel cache-oblivious blocking scheme based on a new triangular space-filling curve which preserves data locality. The proposed blocking-scheme reduces the movement of data within the host memory hierarchy for triangular matrix traversals, which inherently exhibit poor data locality, such as the in-place transposition of square matrices. We show that our cache-oblivious blocking-scheme can be generated iteratively in linear time and constant memory with regard to the number of entries present in the lower, or upper, triangle of the input matrix. In contrast to classical recursive cache-oblivious solutions, the iterative nature of our blocking-scheme does not inhibit other essential optimizations such as software prefetching. In order to assess the viability of our blocking-scheme as a cache-oblivious strategy, we applied it to the in-place transposition of square matrices. Extensive experiments show that our cache-oblivious transposition algorithm generally outperforms the cache-aware state-of-the-art algorithm in terms of throughput and energy efficiency in sequential as well as parallel environments.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127706918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LowFive: In Situ Data Transport for High-Performance Workflows LowFive:用于高性能工作流的现场数据传输

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00102

T. Peterka, D. Morozov, Arnur Nigmetov, Orcun Yildiz, Bogdan Nicolae, Philip E. Davis

引用次数: 0