{"title":"Design and Performance Evaluation of Linearly Extensible Cube-Triangle Network for Multicore Systems","authors":"Savita Gautam;Abdus Samad;Mohammad S. Umar","doi":"10.1109/TPDS.2024.3486219","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3486219","url":null,"abstract":"High-performance interconnection networks are currently being used to design Massively Parallel Computers. Selecting the set of nodes on which parallel tasks execute plays a vital role in the performance of such systems. These networks when deployed to run large parallel applications suffer from communication latencies which ultimately affect the system throughput. Mesh and Torus are primary examples of topologies used in such systems. However, these are being replaced with more efficient and complicated hybrid topologies such as ZMesh and x-Folded TM networks. This paper presents a new topology named as Linearly Extensible Cube-Triangle (LECΔ) which focuses on low latency, lesser average distance and improved throughput. It is symmetrical in nature and exhibits the desirable properties of similar networks with lesser complexity and cost. For N x N network, the LECΔ topology has lesser network latency than that of Mesh, ZMesh, Torus and x-Folded networks. The proposed LECΔ network produces reduced average distance, diameter and cost. It has a high value of bisection width and good scalability. The simulation results show that the performance of LECΔ network is similar to that of Mesh, ZMesh, Torus and x-Folded networks. The results verify the efficiency of the LECΔ network as evaluated and compared with similar networks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2596-2607"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications","authors":"Yuyang Jin;Haojie Wang;Xiongchao Tang;Zhenhua Guo;Yaqian Zhao;Torsten Hoefler;Tao Liu;Xu Liu;Jidong Zhai","doi":"10.1109/TPDS.2024.3485789","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3485789","url":null,"abstract":"It is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present \u0000<sc>ScalAna</small>\u0000, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. \u0000<sc>ScalAna</small>\u0000 uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of \u0000<sc>ScalAna</small>\u0000 using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"308-325"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Breaking the Memory Wall for Heterogeneous Federated Learning via Model Splitting","authors":"Chunlin Tian;Li Li;Kahou Tam;Yebo Wu;Cheng-Zhong Xu","doi":"10.1109/TPDS.2024.3480115","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3480115","url":null,"abstract":"Federated Learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. Ever-increasing model complexity coupled with limited memory resources on the participating devices severely bottlenecks the deployment of FL in real-world scenarios. Thus, a framework that can effectively break the memory wall while jointly taking into account the hardware and statistical heterogeneity in FL is urgently required. In this article, we propose \u0000<italic>SmartSplit</i>\u0000 a framework that effectively reduces the memory footprint on the device side while guaranteeing the training progress and model accuracy for heterogeneous FL through model splitting. Towards this end, \u0000<italic>SmartSplit</i>\u0000 employs a hierarchical structure to adaptively guide the overall training process. In each training round, the central manager, hosted on the server, dynamically selects the participating devices and sets the cutting layer by jointly considering the memory budget, training capacity, and data distribution of each device. The MEC manager, deployed within the edge server, proceeds to split the local model and perform training of the server-side portion. Meanwhile, it fine-tunes the splitting points based on the time-evolving statistical importance. The on-device manager, embedded inside each mobile device, continuously monitors the local training status while employing cost-aware checkpointing to match the runtime dynamic memory budget. Extensive experiments on representative datasets are conducted on both commercial off-the-shelf mobile device testbeds. The experimental results show that \u0000<italic>SmartSplit</i>\u0000 excels in FL training on highly memory-constrained mobile SoCs, offering up to a 94% peak latency reduction and 100-fold memory savings. It enhances accuracy performance by 1.49%-57.18% and adaptively adjusts to dynamic memory budgets through cost-aware recomputation","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2513-2526"},"PeriodicalIF":5.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitosis: A Scalable Sharding System Featuring Multiple Dynamic Relay Chains","authors":"Keyuan Wang;Linpeng Jia;Zhaoxiong Song;Yi Sun","doi":"10.1109/TPDS.2024.3480223","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3480223","url":null,"abstract":"Sharding is a prevalent approach for addressing performance issues in blockchain. To reduce governance complexities and ensure system security, a common practice involves a relay chain to coordinate cross-shard transactions. However, with a growing number of shards and cross-shard transactions, the single relay chain usually first suffers from performance bottleneck and shows poor scalability, thus making the relay chain's scalability vital for sharding systems. To solve this, we propose \u0000<italic>Mitosis</i>\u0000, the first multi-relay architecture to improve the relay chain's scalability by sharding the relay chain itself. Our proposed relay sharding algorithm dynamically adjusts the number of relays or optimizes the topology between relays and shards to adaptively scale up relay chain's performance. Furthermore, to guarantee the security of the multi-relay architecture, a new validator reconfiguration scheme is designed, accompanied by a comprehensive security analysis of \u0000<italic>Mitosis</i>\u0000. Through simulation experiments on two mainstream relay chain paradigms, we demonstrate that \u0000<italic>Mitosis</i>\u0000 can achieve high scalability and outperform state-of-the-art baselines in terms of workload of relays, relay chain throughput, and transaction latency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2497-2512"},"PeriodicalIF":5.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10716349","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diandian Gu;Yihao Zhao;Peng Sun;Xin Jin;Xuanzhe Liu
{"title":"GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads","authors":"Diandian Gu;Yihao Zhao;Peng Sun;Xin Jin;Xuanzhe Liu","doi":"10.1109/TPDS.2024.3470074","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3470074","url":null,"abstract":"Deep learning (DL) has become a key component of modern software. Training DL models leads to huge carbon emissions. In data centers, it is important to reduce carbon emissions while completing DL training jobs early. In this article, we propose GreenFlow, a GPU cluster scheduler that reduces the average Job Completion Time (JCT) under a carbon emission budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance under different configurations. Based on the performance models and the carbon intensity of the grid, GreenFlow dynamically allocates GPUs, and adjusts the GPU-level and job-level configurations of DL training jobs. GreenFlow applies network packing and buddy allocation to job placement, thus avoiding extra carbon incurred by resource fragmentations. Evaluations on a real testbed show that when emitting the same amount of carbon, GreenFlow can improve the average JCT by up to 2.15×, compared to competitive baselines.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"168-184"},"PeriodicalIF":5.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Hu;Haotian Wang;Wangdong Yang;Renqiu Ouyang;Keqin Li;Kenli Li
{"title":"BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core Acceleration","authors":"Rong Hu;Haotian Wang;Wangdong Yang;Renqiu Ouyang;Keqin Li;Kenli Li","doi":"10.1109/TPDS.2024.3477746","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3477746","url":null,"abstract":"Sparse tensor contraction (SpTC) is an important operator in tensor networks, which tends to generate a large amount of sparse high-dimensional data, placing higher demands on the computational performance and storage bandwidth of the processor. Using GPUs with powerful arithmetic characteristics is a reliable choice for accelerating SpTC, however, the high dimensionality and sparsity of tensor makes GPU-accelerated SpTC operators suffer from the difficulties of low computational intensity and high memory consumption. The recent introduction of Tensor Core Units (TCUs) on GPUs brings even more powerful arithmetic, which exacerbates the memory wall problem. To cope with the challenges, this paper proposes a new BCB format that linearizes the indices of multidimensional blocks to reduce block index accesses and uses a bitmap to store the distribution of non-zero elements in a block to reduce the storage overhead. A parallel blocking algorithm of BCB-SpTC is designed to divide the binary linear indices into free and contracted indexes to improve the pairing overhead of computational tasks. Then based on the characteristic computation method of TCUs, the proprietary filling method of TCUs is designed to overcome the inefficiency of parallel computation of sparse data on TCUs. Finally, experimental results on the A100 dataset show that BCB-SpTC improves the acceleration ratio by \u0000<inline-formula><tex-math>$1.1times$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$21.3times$</tex-math></inline-formula>\u0000 over the existing SpTC GPU method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2435-2448"},"PeriodicalIF":5.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MoltDB: Accelerating Blockchain via Ancient State Segregation","authors":"Junyuan Liang;Wuhui Chen;Zicong Hong;Haogang Zhu;Wangjie Qiu;Zibin Zheng","doi":"10.1109/TPDS.2024.3467927","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467927","url":null,"abstract":"Blockchain store states in Log-Structured Merge (LSM) tree-based database. Due to blockchain traceability, the growing ancient states are inevitably stored in the databases. Unfortunately, by default, this process mixes \u0000<italic>current</i>\u0000 and \u0000<italic>ancient</i>\u0000 states in the data layout, increasing unnecessary disk I/O access and slowing transaction execution. This paper proposes MoltDB, a scalable LSM-based database for efficient transaction execution through a novel idea of \u0000<italic>ancient state segregation</i>\u0000, i.e., to segregate current and ancient states in the data layout. However, the frequently generated and uncertainly accessed characteristics of ancient states make the segregation challenging. Thus, we develop an “extract-compact” mechanism to batch extraction process for frequently generated ancient states and the LSM compaction process to relieve additional disk I/O overhead. Moreover, we design an adaptive LSM-based storage for the uncertainly accessed ancient states extracted for on-demand access. We implement MoltDB as a database engine compatible with many mainstream blockchains and integrate it into Ethereum for evaluation. Experimental results show that MoltDB achieves 1.3 × transaction throughput and 30% disk I/O latency savings over the state-of-the-art works.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2545-2558"},"PeriodicalIF":5.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinyu Hu;Huizhang Luo;Hong Jiang;Guoqing Xiao;Kenli Li
{"title":"FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs","authors":"Jinyu Hu;Huizhang Luo;Hong Jiang;Guoqing Xiao;Kenli Li","doi":"10.1109/TPDS.2024.3477431","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3477431","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) on GPUs has gained significant attention because of SpMV's importance in modern applications and the increasing computing power of GPUs in the last decade. Previous studies have emphasized the importance of data loading for the overall performance of SpMV and demonstrated the efficacy of coalesced memory access in enhancing data loading efficiency. However, existing approaches fall far short of reaching the full potential of data loading on modern GPUs. In this paper, we propose an efficient algorithm called FastLoad, that speeds up the loading of both sparse matrices and input vectors of SpMV on modern GPUs. Leveraging coalesced memory access, FastLoad achieves high loading efficiency and load balance by sorting both the columns of the sparse matrix and elements of the input vector based on the number of non-zero elements while organizing non-zero elements in blocks to avoid thread divergence. FastLoad takes the Compressed Sparse Column (CSC) format as an implementation case to prove the concept and gain insights. We conduct a comprehensive comparison of FastLoad with the CSC-based SpMV, cuSPARSE, CSR5, and TileSpMV, using the full SuiteSparse Matrix Collection as workload. The experimental results on RTX 3090 Ti demonstrate that our method outperforms the others in most matrices, with geometric speedup means over CSC-based, cuSPARSE, CSR5, and TileSpMV being 2.12×, 2.98×, 2.88×, and 1.22×, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2423-2434"},"PeriodicalIF":5.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Competitive Analysis of Online Elastic Caching of Transient Data in Multi-Tiered Content Delivery Network","authors":"Binghan Wu;Wei Bao;Bing Bing Zhou","doi":"10.1109/TPDS.2024.3475412","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3475412","url":null,"abstract":"As the demand for faster and more reliable content delivery escalates, Content Delivery Networks (CDNs) face significant challenges in managing content placement across their increasingly complex, multi-tiered structures to balance performance, complexity, and scalability, while addressing the transient nature of data and the unpredictability of internet traffic. Addressing these challenges, this study introduces a novel multi-tier CDN caching strategy that navigates spatial and temporal trade-offs in cache placement, considering the cache placement cost diminishes with the content lifetime, and the uncertainty of future data demands. We design a distributed online algorithm that evaluates each incoming request and places new caches when the total content delivery cost exceeds a threshold. Our competitive analysis shows a tight and optimal \u0000<inline-formula><tex-math>$mathtt {Tiers}+1$</tex-math></inline-formula>\u0000 competitive ratio. Additionally, our algorithm has low complexity by passing \u0000<inline-formula><tex-math>$O(mathtt {Tiers})$</tex-math></inline-formula>\u0000 number of reference messages for each request, which enhances its practical applicability. Empirical validation through numerical simulations and trace-driven experiments confirms the superiority of our approach to existing benchmarks in real-world CDN settings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2449-2462"},"PeriodicalIF":5.6,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenhua Guo;Yinan Tang;Jidong Zhai;Tongtong Yuan;Jian Jin;Li Wang;Yaqian Zhao;Rengang Li
{"title":"A Survey on Performance Modeling and Prediction for Distributed DNN Training","authors":"Zhenhua Guo;Yinan Tang;Jidong Zhai;Tongtong Yuan;Jian Jin;Li Wang;Yaqian Zhao;Rengang Li","doi":"10.1109/TPDS.2024.3476390","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3476390","url":null,"abstract":"The recent breakthroughs in large-scale DNN attract significant attention from both academia and industry toward distributed DNN training techniques. Due to the time-consuming and expensive execution process of large-scale distributed DNN training, it is crucial to model and predict the performance of distributed DNN training before its actual deployment, in order to optimize the design of distributed DNN training at low cost. This paper analyzes and emphasizes the importance of modeling and predicting the performance of distributed DNN training, categorizes and analyses the related state-of-the-art works, and discusses future challenges and opportunities for this research field. The objectives of this paper are twofold: first, to assist researchers in understanding and choosing suitable modeling and prediction tools for large-scale distributed DNN training, and second, to encourage researchers to propose more valuable research about performance modeling and prediction for distributed DNN training in the future.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2463-2478"},"PeriodicalIF":5.6,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10707191","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}