{"title":"An Efficient GPU Algorithm for Lattice Boltzmann Method on Sparse Complex Geometries","authors":"Zhangrong Qin;Xusheng Lu;Long Lv;Zhongxiang Tang;Binghai Wen","doi":"10.1109/TPDS.2024.3510810","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3510810","url":null,"abstract":"Many fluid flow problems, such as the porous media, arterial blood flow and tissue fluid, contain sparse complex geometries. Although the lattice Boltzmann method is good at dealing with the complex boundaries, these sparse complex geometries cause the low computational performance and high memory consumption when the graphics processing unit (GPU) is used to accelerate the numerical computation. These problems would be addressed by compact memory layout, sophisticated memory access and enhanced thread utilization. This paper proposes a GPU-based algorithm to improve the lattice Boltzmann simulations with sparse complex geometries. An access pattern for a single set of distribution functions together with a semi-direct addressing is adopted to reduce memory consumption, while a collected structure of arrays is employed to enhance memory access efficiency. Furthermore, an address index array and a node classification coding scheme are employed to improve the GPU thread utilization ratio and reduce the GPU global memory access, respectively. The accuracy and mesh-independence has been verified by the numerical simulations of Poiseuille flow and porous media flow with face-centered filled spheres. The present algorithm has a significantly lower memory consumption than those based on direct or indirect addressing schemes. It improves the computational performance by several times compared to the other algorithms on the common GPU hardware.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"239-252"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster
{"title":"Object Proxy Patterns for Accelerating Distributed Applications","authors":"J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster","doi":"10.1109/TPDS.2024.3511347","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3511347","url":null,"abstract":"Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer—optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area references that can resolve to data regardless of location, has been demonstrated as an effective low-level building block in such situations. Here we propose three high-level proxy-based programming patterns—distributed futures, streaming, and ownership—that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three meaningful scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"253-265"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens
{"title":"Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms","authors":"Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens","doi":"10.1109/TPDS.2024.3507814","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3507814","url":null,"abstract":"Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling\u0000<sup>1</sup>\u0000 by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of deep learning recommendation models (DLRM) models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based natural language processing (NLP) models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"226-238"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local Information","authors":"Xiaodong Dong;Lihai Nie;Zheli Liu;Yang Xiang","doi":"10.1109/TPDS.2024.3508275","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3508275","url":null,"abstract":"Inter-datacenter network applications generate massive coflows for purposes, e.g., backup, synchronization, and analytics, with deadline requirements. Decentralized coflow scheduling frameworks are desirable for their scalability in cross-domain deployment but grappling with the challenge of information agnosticism for lack of cross-domain privileges. Current information-agnostic coflow scheduling methods are incompatible with decentralized frameworks for relying on centralized controllers to continuously monitor and learn from coflow global transmission states to infer global coflow information. Alternative methods propose mechanisms for decentralized global coflow information gathering and synchronization. However, they require dedicated physical hardware or control logic, which could be impractical for incremental deployment. This article proposes Slark, a decentralized deadline-aware coflow scheduling framework, which meets coflows’ soft and hard deadline requirements using only local traffic information. It eschews requiring global coflow transmission states and dedicated hardware or control logic by leveraging multiple software-implemented scheduling agents working independently on each node and integrating such information agnosticism into node-specific bandwidth allocation by modeling it as a robust optimization problem with flow information on the other nodes represented as uncertain parameters. Subsequently, we validate the performance robustness of Slark by investigating how perturbations in the optimal objective function value and the associated optimal solution are affected by uncertain parameters. Finally, we propose a firebug-swarm-optimization-based heuristic algorithm to tackle the non-convexity in our problem. Experimental results demonstrate that Slark can significantly enhance transmission revenue and increase soft and hard deadline guarantee ratios by 10.52% and 7.99% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"197-211"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang
{"title":"Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous Infrastructure","authors":"Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang","doi":"10.1109/TPDS.2024.3506588","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506588","url":null,"abstract":"Distributed training of deep neural networks (DNNs) suffers from efficiency declines in dynamic heterogeneous environments, due to the resource wastage brought by the straggler problem in data parallelism (DP) and pipeline bubbles in model parallelism (MP). Additionally, the limited resource availability requires a trade-off between training performance and long-term costs, particularly in online settings. To address these challenges, this article presents a novel online approach to maximize long-term training efficiency in heterogeneous environments through uneven data assignment and communication-aware model partitioning. A group-based hierarchical architecture combining DP and MP is developed to balance discrepant computation and communication capabilities, and offer a flexible parallel mechanism. In order to jointly optimize the performance and long-term cost of the online DL training process, we formulate this problem as a stochastic optimization with time-averaged constraints. By utilizing Lyapunov’s stochastic network optimization theory, we decompose it into several instantaneous sub-optimizations, and devise an effective online solution to address them based on tentative searching and linear solving. We have implemented a prototype system and evaluated the effectiveness of our solution based on realistic experiments, reducing batch training time by up to 68.59% over state-of-the-art methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"150-167"},"PeriodicalIF":5.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Effective and Low-Latency Data Placement in Edge Environment Based on PageRank-Inspired Regional Value","authors":"Pengwei Wang;Junye Qiao;Yuying Zhao;Zhijun Ding","doi":"10.1109/TPDS.2024.3506625","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506625","url":null,"abstract":"Edge storage offers low-latency services to users. However, due to strained edge resources and high costs, enterprises must choose the data that most warrant placement at the edge and place it in the right location. In practice, data exhibit temporal and spatial properties, and variability, which have a significant impact on their placement, but have been largely ignored in research. To address this, we introduce the concept of data temperature, which considers data characteristics over time and space. To consider the influence of spatial relevance among different regions for placing data, inspired by PageRank, we present a model using data temperature to assess the regional value of data, which effectively leverages collaboration within the edge storage system. We also propose a regional value-based algorithm (RVA) that minimizes cost while meeting user response time requirements. By taking into account the correlation between regions, the RVA can achieve lower latency than current methods when creating an equal or even smaller number of replicas. Experimental results validate the efficacy of the proposed method in terms of latency, success rate, and cost efficiency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"185-196"},"PeriodicalIF":5.6,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialiang Han;Yudong Han;Xiang Jing;Gang Huang;Yun Ma
{"title":"DegaFL: Decentralized Gradient Aggregation for Cross-Silo Federated Learning","authors":"Jialiang Han;Yudong Han;Xiang Jing;Gang Huang;Yun Ma","doi":"10.1109/TPDS.2024.3501581","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3501581","url":null,"abstract":"Federated learning (FL) is an emerging promising paradigm of privacy-preserving machine learning (ML). An important type of FL is cross-silo FL, which enables a moderate number of organizations to cooperatively train a shared model by keeping confidential data locally and aggregating gradients on a central parameter server. However, the central server may be vulnerable to malicious attacks or software failures in practice. To address this issue, in this paper, we propose \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000, a novel decentralized gradient aggregation approach for cross-silo FL. \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000 eliminates the central server by aggregating gradients on each participant, and maintains and synchronizes gradients of only the current training round. Besides, we propose \u0000<inline-formula><tex-math>$mathtt{AdaAgg} $</tex-math></inline-formula>\u0000 to adaptively aggregate correct gradients from honest nodes and use HotStuff to ensure the consistency of the training round number and gradients among all nodes. Experimental results show that \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000 defends against common threat models with minimal accuracy loss, and achieves up to \u0000<inline-formula><tex-math>$50times$</tex-math></inline-formula>\u0000 reduction in storage overhead and up to \u0000<inline-formula><tex-math>$13times$</tex-math></inline-formula>\u0000 reduction in network overhead, compared to state-of-the-art decentralized FL approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"212-225"},"PeriodicalIF":5.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Lin;Rui Wang;Yongkun Li;Yinlong Xu;John C. S. Lui
{"title":"Two-Dimensional Balanced Partitioning and Efficient Caching for Distributed Graph Analysis","authors":"Shuai Lin;Rui Wang;Yongkun Li;Yinlong Xu;John C. S. Lui","doi":"10.1109/TPDS.2024.3501292","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3501292","url":null,"abstract":"Distributed graph analysis usually partitions a large graph into multiple small-sized subgraphs and distributes them into a cluster of machines for computing. Therefore, graph partitioning plays a crucial role in distributed graph analysis. However, the widely used existing graph partitioning schemes balance only in one dimension (number of edges or vertices) or incur a large number of edge cuts, so they degrade the performance of distributed graph analysis. In this article, we propose a novel graph partition scheme BPart and two enhanced algorithms BPart-C and BPart-S to achieve a balanced partition for both vertices and edges, and also reduce the number of edge cuts. Besides, we also propose a neighbor-aware caching scheme to further reduce the number of edge cuts so as to improve the efficiency of distributed graph analysis. Our experimental results show that BPart-C and BPart-S can achieve a better balance in both dimensions (the number of vertices and edges), and meanwhile reducing the number of edge cuts, compared to multiple existing graph partitioning algorithms, i.e., Chunk-V, Chunk-E, Fennel, and Hash. We also integrate these partitioning algorithms into two popular distributed graph systems, KnightKing and Gemini, to validate their impact on graph analysis efficiency. Results show that both BPart-C and BPart-S can significantly reduce the total running time of various graph applications by up to 60% and 70%, respectively. In addition, the neighbor-aware caching scheme can further improve the performance by up to 24%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"133-149"},"PeriodicalIF":5.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spreeze: High-Throughput Parallel Reinforcement Learning Framework","authors":"Jing Hou;Guang Chen;Ruiqi Zhang;Zhijun Li;Shangding Gu;Changjun Jiang","doi":"10.1109/TPDS.2024.3497986","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3497986","url":null,"abstract":"The promotion of large-scale applications of reinforcement learning (RL) requires efficient training computation. While existing parallel RL frameworks encompass a variety of RL algorithms and parallelization techniques, the excessively burdensome communication frameworks hinder the attainment of the hardware's limit for final throughput and training effects on a single desktop. In this article, we propose Spreeze, a lightweight parallel framework for RL that efficiently utilizes a single desktop hardware resource to approach the throughput limit. We asynchronously parallelize the experience sampling, network update, performance evaluation, and visualization operations, and employ multiple efficient data transmission techniques to transfer various types of data between processes. The framework can automatically adjust the parallelization hyperparameters based on the computing ability of the hardware device in order to perform efficient large-batch updates. Based on the characteristics of the “Actor-Critic” RL algorithm, our framework uses dual GPUs to independently update the network of actors and critics in order to further improve throughput. Simulation results show that our framework can achieve up to 15,000 Hz experience sampling and 370,000 Hz network update frame rate using only a personal desktop computer, which is an order of magnitude higher than other mainstream parallel RL frameworks, resulting in a 73% reduction of training time. Our work on fully utilizing the hardware resources of a single desktop computer is fundamental to enabling efficient large-scale distributed RL training.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"282-292"},"PeriodicalIF":5.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Service Demand Variability on Data Center Performance","authors":"Diletta Olliaro;Adityo Anggraito;Marco Ajmone Marsan;Simonetta Balsamo;Andrea Marin","doi":"10.1109/TPDS.2024.3497792","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3497792","url":null,"abstract":"Modern data centers feature an extensive array of cores that handle quite a diverse range of jobs. Recent traces, shared by leading cloud data center enterprises like Google and Alibaba, reveal that the constant increase in data center services and computational power is accompanied by a growing variability in service demand requirements. The number of cores needed for a job can vary widely, ranging from one to several thousands, and the number of seconds a core is held by a job can span more than five orders of magnitude. In this context of extreme variability, the policies governing the allocation of cores to jobs play a crucial role in the performance of data centers. It is widely acknowledged that the First-In First-Out (FIFO) policy tends to underutilize available computing capacity due to the varying magnitudes of core requests. However, the impact of the extreme variability in service demands on job waiting and response times, that has been deeply investigated in traditional queuing models, is not as well understood in the case of data centers, as we will show. To address this issue, we investigate the dynamics of a data center cluster through analytical models in simple cases, and discrete event simulations based on real data. Our findings emphasize the significant impact of service demand variability, both in terms of requested cores and service times, and allow us to provide insight for enhancing data center performance. In particular, we show how data center performance can be improved thanks to the control of the interplay between service and waiting times through the assignment of cores to jobs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"120-132"},"PeriodicalIF":5.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}