{"title":"Alias-Chain: Improving Blockchain Scalability via Exploring Content Locality among Transactions","authors":"Jintong Liu, Shenggang Wan, Xubin He","doi":"10.1109/ipdps53621.2022.00122","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00122","url":null,"abstract":"A Blockchain is a promising infrastructure but it has serious scalability problems, i.e., long block synchronization time and high storage cost. Conventional coarse-grained data deduplication schemes (block or file level) are proved to be ineffective on this problem. Based on comprehensive analysis on typical blockchain workloads, we are the first to propose two new locality concepts: economic and argument locality. To further explore these new localities, we propose a novel fine-grained data deduplication scheme (transaction level) named Alias-Chain to improve the scalability of blockchains. Specifically, Alias-Chain replaces frequently used data, e.g., smart contract arguments, with much shorter aliases to reduce the block size. During prop-agation and preservation of blocks, smaller blocks result in both shorter synchronization time and lower storage cost. Simulation results show the average transfer and SC-call transaction sizes can be reduced by up to 11.23% and 43.23% in native Ethereum, and up to 61.95 % and 77.54 % in Ethereum optimized by state-of-the-art techniques, respectively. Prototyping-based experiments are further conducted on a testbed consisting of up to 3200 miners. The results demonstrate the effectiveness and efficiency of Alias-Chain on reducing block synchronization time and storage cost under typical real-world workloads.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128670417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Pham, Truong Thao Nguyen, Hiroshi Yamaguchi, Y. Urino, M. Koibuchi
{"title":"Scalable Low-Latency Inter-FPGA Networks","authors":"K. Pham, Truong Thao Nguyen, Hiroshi Yamaguchi, Y. Urino, M. Koibuchi","doi":"10.1109/ipdps53621.2022.00031","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00031","url":null,"abstract":"A cutting-edge FPGA card can be equipped with many high-bandwidth I/Os by means of high-density optical integration, e.g., onboard Si-photonics transceivers, to provide high network bandwidth for memory-to-memory inter-FPGA communication. This study presents its scalable switchless net-work architecture by exploiting an indirect path, consisting of two one-hop paths, for enabling a diameter-2 network topology. It then takes a Kautz network topology with a diameter of two for connecting d(d + 1) FPGAs with a degree of $d$, which is close to the theoretical upper bound. The Kautz network topologies have bi-directional links and uni-directional links which form triangles. Uni-directional links introduce difficulty in avoiding channel buffer overflow because the existing link-level flow control assumes a bi-directional link. This study presents an indirect flow control along a uni-directional triangle embedded in the Kautz network topology. It then develops a combination of unicasts that forms multi-port collective communications to mitigate the influence of the startup latency on the execution time. Since a high-degree FPGA card introduces difficulty in storing many I/O ports at the panel of a 1- U compute server, we propose using WDM (Wavelength Division Multiplexing) as an alternative and present its efficient mapping onto arrayed waveguide grating (AWG). The required number of wavelengths becomes d on d+ 1 AWG equipments. Based on our experimental results with OPTWEB of custom Stratix10 FPGA cards, SimGrid simulation results show that our collective communication is 7 × faster than that of Dragonfly with 272 FPGAs.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"92 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132447142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mnemonic: A Parallel Subgraph Matching System for Streaming Graphs","authors":"Bibek Bhattarai, Huimin Huang","doi":"10.48550/arXiv.2206.09983","DOIUrl":"https://doi.org/10.48550/arXiv.2206.09983","url":null,"abstract":"Finding patterns in large highly connected datasets is critical for value discovery in business development and scientific research. This work focuses on the problem of subgraph matching on streaming graphs, which provides utility in a myriad of real-world applications ranging from social network analysis to cybersecurity. Each application poses a different set of control parameters, including the restrictions for a match, type of data stream, and search granularity. The problem-driven design of existing subgraph matching systems makes them challenging to apply for different problem domains. This paper presents Mnemonic, a programmable system that provides a high-level API and democratizes the development of a wide variety of subgraph matching solutions. Importantly, Mnemonic also delivers key data management capabilities and optimizations to support real-time processing on long-running, high-velocity multi-relational graph streams. The experiments demonstrate the versatility of Mnemonic, as it outperforms several state-of-the-art systems by up to two orders of magnitude.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116105840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthieu Dorier, Zhe Wang, Utkarsh Ayachit, S. Snyder, R. Ross, M. Parashar
{"title":"Colza: Enabling Elastic In Situ Visualization for High-performance Computing Simulations","authors":"Matthieu Dorier, Zhe Wang, Utkarsh Ayachit, S. Snyder, R. Ross, M. Parashar","doi":"10.1109/ipdps53621.2022.00059","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00059","url":null,"abstract":"In situ analysis and visualization have grown increasingly popular for enabling direct access to data from high-performance computing (HPC) simulations. As a simulation progresses and interesting physical phenomena emerge, however, the data produced may become increasingly complex, and users may need to dynamically change the type and scale of in situ analysis tasks being carried out and consequently adapt the amount of resources allocated to such tasks. To date, none of the production in situ analysis frameworks offer such an elasticity feature, and for good reason: the assumption that the number of processes could vary during run time would force developers to rethink software and algorithms at every level of the in situ analysis stack. In this paper we present Colza, a data staging service with elastic in situ visualization capabilities. Colza relies on the widely used ParaView Catalyst in situ visualization framework and enables elasticity by replacing MPI with a custom collective communication library based on the Mochi suite of libraries. To the best of our knowledge, this work is the first to enable elastic in situ visualization capabilities for HPC applications on top of existing production analysis tools.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamín Tovar, Ben Lyons, K. Mohrman, Barry Sly-Delgado, K. Lannon, D. Thain
{"title":"Dynamic Task Shaping for High Throughput Data Analysis Applications in High Energy Physics","authors":"Benjamín Tovar, Ben Lyons, K. Mohrman, Barry Sly-Delgado, K. Lannon, D. Thain","doi":"10.1109/ipdps53621.2022.00041","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00041","url":null,"abstract":"Distributed data analysis frameworks are widely used for processing large datasets generated by instruments in scientific fields such as astronomy, genomics, and particle physics. Such frameworks partition petabyte-size datasets into chunks and execute many parallel tasks to search for common patterns, locate unusual signals, or compute aggregate properties. When well-configured, such frameworks make it easy to churn through large quantities of data on large clusters. However, configuring frameworks presents a challenge for end users, who must select a variety of parameters such as the blocking of the input data, the number of tasks, the resources allocated to each task, and the size of nodes on which they run. If poorly configured, the result may perform many orders of magnitude worse than optimal, or the application may even fail to make progress at all. Even if a good configuration is found through painstaking observations, the performance may change drastically when the input data or analysis kernel changes. This paper considers the problem of automatically configuring a data analysis application for high energy physics (TopEFT) built upon standard frameworks for physics analysis (Coffea) and distributed tasking (Work Queue). We observe the inherent variability within the application, demonstrate the problems of poor configuration, and then develop several techniques for automatically sizing tasks to meet goals of resource consumption, and overall application completion.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125582811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinghao Wu, Jianwei Niu, Xuefeng Liu, Tao Ren, Zhangmin Huang, Zhetao Li
{"title":"pFedGF: Enabling Personalized Federated Learning via Gradient Fusion","authors":"Xinghao Wu, Jianwei Niu, Xuefeng Liu, Tao Ren, Zhangmin Huang, Zhetao Li","doi":"10.1109/ipdps53621.2022.00068","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00068","url":null,"abstract":"Data heterogeneity is one of the main challenges faced by federated learning (FL). Unlike traditional FL methods (e.g. FedAvg) which train a global model for all clients, personalized federated learning (PFL) can address the above problem by training a personalized model for each client. Current mainstream PFL researches first obtain a global model through collaborative training among all clients and then fine-tune the global model on each client's local data to obtain personalized models. However, this two-staged approach has a drawback: when the heterogeneity of different clients is large, the obtained final global model can deviate from the distributions of all clients, and therefore is not a good starting point for updating personalized models. In this paper, we propose pFedGF, a new PFL method based on gradient fusion. Different from traditional two-staged PFL, in each round of pFedGF, each client maintains two gradients simultaneously, a global gradient to capture information from all clients, and a local gradient that reflects the specific distribution of each client. The two gradients are fused to obtain the updated direction of the personalized model for each client. We carried out experiments on MNIST, FMNIST, and CIFAR-10 datasets. The results demonstrate that in the presence of data heterogeneity, pFedGF outperforms other PFL methods.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130211628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinglei Cao, Rabab Alomairy, Yu Pei, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra
{"title":"A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization","authors":"Qinglei Cao, Rabab Alomairy, Yu Pei, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra","doi":"10.1109/ipdps53621.2022.00047","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00047","url":null,"abstract":"We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile Low-Rank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and data-sparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory foot-print. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance results on two different high-performance supercomputers and compare against existing state-of-the-art implementation. Our implementation shows up to 7-fold on Shaheen II and 9-fold on Fugaku performance superiority in situations where the 3D unstructured mesh deformation application renders a matrix operator with low density. Our software framework solves a formally dense 3D problem with 52M mesh points on 65K cores in about half an hour. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system, by synergistically bridging matrix algebra libraries with scientific applications.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115628150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haozhong Qiu, Chuanfu Xu, Dali Li, Haoyu Wang, Jie Li, Z. Wang
{"title":"Parallelizing and Balancing Coupled DSMC/PIC for Large-scale Particle Simulations","authors":"Haozhong Qiu, Chuanfu Xu, Dali Li, Haoyu Wang, Jie Li, Z. Wang","doi":"10.1109/ipdps53621.2022.00045","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00045","url":null,"abstract":"In high-performance and parallel computing, an important application class is particle simulation. Due to massive particle migration among distributed simulation workers across simulation iterations, achieving balanced runtime work distribution is vital for accelerating large-scale realistic particle simulations. This paper proposes a novel approach to enable dynamic load balance for distributed numerical particle simulations, specifically targeting the latest coupled DSMC/PI C method. Unlike prior work, our approach adopts a dual, nested unstructured grid organization to facilitate coupled DSMC/PIC computation and runtime grid distribution. Our implementation leverages both centralized and distributed communication strategies to dynamically migrate particles among arbitrary parallel processes. It then employs a load balancer - driven by a carefully designed analytical model and a grid remapping mechanism - to dynamically redistribute the simulation workloads among parallel simulation workers. By constantly monitoring and redis-tributing the simulation work across workers, our approach can adapt to the change of particle distribution across simulation iterations, avoiding a few workers becoming the performance bottleneck of the entire simulation process. We integrate our techniques into a coupled DSMC/PIC solver and apply them to simulate the plasma plume with hydrogen atoms and ions. Experimental results show that our approach can scale well up to 1500+ processes with billions of particles, exhibiting the state-of-the-art parallel simulation scalability and efficiency for plasma plume simulation.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122753508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A self-stabilizing 2-minimal dominating set algorithm based on loop composition in networks of girth at least 7","authors":"Syohei Maruyama, Y. Sudo, S. Kamei, H. Kakugawa","doi":"10.1109/ipdps53621.2022.00114","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00114","url":null,"abstract":"We propose a silent self-stabilizing asynchronous distributed algorithm to find a 2-minimal dominating set (2-MDS) in networks of girth at least 7. Given a graph <tex>$G=(V, E)$</tex>, a 2-MDS of <tex>$G$</tex> is a minimal dominating set <tex>$Dsubseteq V$</tex> such that <tex>$Dbackslash {p_{i},p_{j}}cup{p_{z}}$</tex> is not a dominating set for any nodes <tex>$p_{i},p_{j}in L (p_{i}neq p_{j})$</tex> and <tex>$p_{z} /{!!!in} D$</tex>. The girth is the length of the shortest cycles in the graph. We assume that the processes have unique identifiers. The proposed algorithm constructs a 2-MDS in the networks of girth at least 7 under the weakly fair distributed daemon. The time complexity is <tex>$O(nH)$</tex> rounds, and the space complexity is <tex>$O(log n)$</tex> bits per process, where <tex>$n$</tex> is the number of processes and <tex>$H$</tex> is the diameter of the network.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132372482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}