{"title":"Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs","authors":"Lanshun Nie, Yuqi Qiu, Fei Meng, Mo Yu, Jing Li","doi":"10.1109/IPDPS54959.2023.00051","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00051","url":null,"abstract":"Resource allocation for stream processing graphs on computing devices is critical to the performance of stream processing. Efficient allocations need to balance workload distribution and minimize communication simultaneously and globally. Since this problem is known to be NP-complete, recent machine learning solutions were proposed based on an encoder-decoder framework, which predicts the device assignment of computing nodes sequentially as an approximation. However, for large graphs, these solutions suffer from the deficiency in handling long-distance dependency and global information, resulting in suboptimal predictions. This work proposes a new paradigm to deal with this challenge, which first coarsens the graph and conducts assignments on the smaller graph with existing graph partitioning methods. Unlike existing graph coarsening works, we leverage the theoretical insights in this resource allocation problem, formulate the coarsening of stream graphs as edge-collapsing predictions, and propose an edge-aware coarsening model. Extensive experiments on various datasets show that our framework significantly improves over existing learning-based and heuristic-based baselines with up to 56% relative improvement on large graphs.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114133003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Younghyun Cho, J. Demmel, Jacob King, X. Li, Yang Liu, Hengrui Luo
{"title":"Harnessing the Crowd for Autotuning High-Performance Computing Applications","authors":"Younghyun Cho, J. Demmel, Jacob King, X. Li, Yang Liu, Hengrui Luo","doi":"10.1109/IPDPS54959.2023.00069","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00069","url":null,"abstract":"This paper presents GPTuneCrowd, a crowd-based autotuning framework for tuning high-performance computing applications. GPTuneCrowd collects performance data from various users using a user-friendly tuner interface. GPTuneCrowd then presents novel autotuning techniques, based on transfer learning and parameter sensitivity analysis, to maximize tuning quality using collected data from the crowd. This paper shows several real-world case studies of GPTuneCrowd. Our evaluation shows that GPTuneCrowd’s transfer learning improves the tuned performance of ScaLAPACK’s PDGEQRF by 1.57x and a plasma fusion code NIMROD by 2.97x, over a non-transfer learning autotuner. We use GPTuneCrowd’s sensitivity analysis to reduce the search space of SuperLU_DIST and Hypre. Tuning on the reduced search space achieves 1.17x and 1.35x better tuned performance of SuperLU_DIST and Hypre, respectively, compared to the original search space.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127218390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smart Redbelly Blockchain: Reducing Congestion for Web3","authors":"Deepal Tennakoon, Yiding Hua, V. Gramoli","doi":"10.1109/IPDPS54959.2023.00098","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00098","url":null,"abstract":"Decentralization promises to remedy the drawbacks of the web by executing decentralized applications (DApps) on blockchains. Unfortunately, modern blockchains cannot support realistic web application workloads mainly due to congestion.We introduce the Smart Redbelly Blockchain (SRBB), a provably correct permissionless blockchain that reduces congestion by (1) avoiding redundant propagation and validations of transactions with Transaction Validation and Propagation Reduction (TVPR) and (2) mitigating the propagation of invalid transactions within blocks by Byzantine nodes with a dedicated Reward-Penalty Mechanism (RPM). Our comparison of SRBB against Algorand, Avalanche, Diem, Ethereum, Quorum, and Solana, using the DIABLO benchmark suite, indicates that SRBB outperforms all these blockchains under real application workloads. Moreover, SRBB is the only blockchain to successfully execute real workloads of NASDAQ and Uber on a DApp without losing transactions. To demonstrate that TVPR and RPM are the causes of the improved performance, we compare SRBB with its naive baseline, which does not contain TVPR and RPM. Our results show that TVPR increases the throughput by 55× and divides the latency by 3.5, while RPM increases the throughput by 7% under flooding attacks. Finally, TVPR helps reduce transaction losses in the normal scenario while RPM goes further and mitigates transaction losses under flooding attacks.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114188769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles
{"title":"Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation","authors":"William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles","doi":"10.1109/IPDPS54959.2023.00063","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00063","url":null,"abstract":"Cloud computing resources are becoming an increasingly attractive option for simulation workflows but require users to assess a wider variety of hardware options and associated costs than required by traditional in-house hardware or fixed allocations at leadership computing facilities. The pay-as-you-go model used by cloud providers gives users the opportunity to make more nuanced cost-benefit decisions at runtime by choosing hardware that best matches a given workload, but creates the risk of suboptimal allocation strategies or inadvertent cost overruns. In this work, we propose the use of an iteratively-refined performance model to optimize cloud simulation campaigns against overall cost, throughput, or maximum time to solution. Hemodynamic simulations represent an excellent use case for these assessments, as the relative costs and dominant terms in the performance model can vary widely with hardware, numerical parameters and physics models. Performance and scaling behavior of hemodynamic simulations on multiple cloud services as well as a traditional compute cluster are collected and evaluated, and an initial performance model is proposed along with a strategy for dynamically refining it with additional experimental data.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue
{"title":"FIRST: Exploiting the Multi-Dimensional Attributes of Functions for Power-Aware Serverless Computing","authors":"Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue","doi":"10.1109/IPDPS54959.2023.00091","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00091","url":null,"abstract":"Emerging cloud-native development models raise new challenges for managing server performance and power at microsecond scale. Compared with traditional cloud workloads, serverless functions exhibit unprecedented heterogeneity, variability, and dynamicity. Designing cloud-native power management schemes for serverless functions requires significant engineering effort. Current solutions remain sub-optimal since their orchestration process is often one-sided, lacking a systematic view. A key obstacle to truly efficient function deployment is the fundamental wide abstraction gap between the upper-layer request scheduling and the low-level hardware execution.In this work, we show that the optimal operating point (OOP) for energy efficiency cannot be attained without synthesizing the multi-dimensional attributes of functions. We present FIRST, a novel mechanism that enables servers to better orchestrate serverless functions. The key feature of FIRST is that it leverages a lightweight Internal Representation and meta-Scheduling (IRS) layer for collecting the maximum potential revenue from the servers. Specifically, FIRST follows a pipeline-style workflow. Its frontend components aim to analyze functions from different angles and expose their key features to the system. Meanwhile, its backend components are able to make informed function assignment decisions to avoid OOP divergence. We further demonstrate the way to create extensions based on FIRST to enable versatile cloud-native power management. In total, our design constitutes a flexible management layer that supports power-aware function deployment. We show that FIRST could allow 94% functions to be processed under the OOP, which brings up to 24% energy efficiency improvements.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125484185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the IPDPS 2023 Program Chairs","authors":"","doi":"10.1109/ipdps54959.2023.00006","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00006","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf
{"title":"Satellite Collision Detection using Spatial Data Structures","authors":"C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf","doi":"10.1109/IPDPS54959.2023.00078","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00078","url":null,"abstract":"In recent years, the number of artificial objects in Earth orbit has increased rapidly due to lower launch costs and new applications for satellites. More and more governments and private companies are discovering space for their own purposes. Private companies are using space as a new business field, launching thousands of satellites into orbit to offer services like worldwide Internet access. Consequently, the probability of collisions and, thus, the degradation of the orbital environment is rapidly increasing. To avoid devastating collisions at an early stage, efficient algorithms are required to identify satellites approaching each other. Traditional deterministic filter-based conjunction detection algorithms compare each satellite to every other satellite and pass them through a chain of orbital filters. Unfortunately, this leads to a runtime complexity of O(n2). In this paper, we propose two alternative approaches that rely on spatial data structures and thus allow us to exploit modern hardware’s parallelism efficiently. Firstly, we introduce a purely grid-based variant that relies on non-blocking atomic hash maps to identify conjunctions. Secondly, we present a hybrid method that combines this approach with traditional filter chains. Both implementations make it possible to identify conjunctions in a large population with millions of satellites with high precision in a comparatively short time. While the grid-based variant is characterized by lower memory consumption, the hybrid variant is faster if enough memory is available.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129819452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedBIAD: Communication-Efficient and Accuracy-Guaranteed Federated Learning with Bayesian Inference-Based Adaptive Dropout","authors":"Jingjing Xue, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang","doi":"10.1109/IPDPS54959.2023.00056","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00056","url":null,"abstract":"Federated Learning (FL) emerges as a distributed machine learning paradigm without end-user data transmission, effectively avoiding privacy leakage. Participating devices in FL are usually bandwidth-constrained, and the uplink is much slower than the downlink in wireless networks, which causes a severe uplink communication bottleneck. A prominent direction to alleviate this problem is federated dropout, which drops fractional weights of local models. However, existing federated dropout studies focus on random or ordered dropout and lack theoretical support, resulting in unguaranteed performance. In this paper, we propose Federated learning with Bayesian Inference-based Adaptive Dropout (FedBIAD), which regards weight rows of local models as probability distributions and adaptively drops partial weight rows based on importance indicators correlated with the trend of local training loss. By applying FedBIAD, each client adaptively selects a high-quality dropping pattern with accurate approximations and only transmits parameters of non-dropped weight rows to mitigate uplink costs while improving accuracy. Theoretical analysis demonstrates that the convergence rate of the average generalization error of FedBIAD is minimax optimal up to a squared logarithmic factor. Extensive experiments on image classification and next-word prediction show that compared with status quo approaches, FedBIAD provides 2× uplink reduction with an accuracy increase of up to 2.41% even on non-Independent and Identically Distributed (non-IID) data, which brings up to 72% decrease in training time.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129357731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. H. Davis, Justin Shafner, Daniel Nichols, N. Grube, P. Martin, A. Bhatele
{"title":"Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms","authors":"J. H. Davis, Justin Shafner, Daniel Nichols, N. Grube, P. Martin, A. Bhatele","doi":"10.1109/IPDPS54959.2023.00066","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00066","url":null,"abstract":"Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6× to 44× speedup over the CPU-only version.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130949397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks","authors":"Zeyu Luan, Qing Li, Yi Wang, Yong Jiang","doi":"10.1109/IPDPS54959.2023.00017","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00017","url":null,"abstract":"Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}