Diandian Gu;Yihao Zhao;Peng Sun;Xin Jin;Xuanzhe Liu
{"title":"GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads","authors":"Diandian Gu;Yihao Zhao;Peng Sun;Xin Jin;Xuanzhe Liu","doi":"10.1109/TPDS.2024.3470074","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3470074","url":null,"abstract":"Deep learning (DL) has become a key component of modern software. Training DL models leads to huge carbon emissions. In data centers, it is important to reduce carbon emissions while completing DL training jobs early. In this article, we propose GreenFlow, a GPU cluster scheduler that reduces the average Job Completion Time (JCT) under a carbon emission budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance under different configurations. Based on the performance models and the carbon intensity of the grid, GreenFlow dynamically allocates GPUs, and adjusts the GPU-level and job-level configurations of DL training jobs. GreenFlow applies network packing and buddy allocation to job placement, thus avoiding extra carbon incurred by resource fragmentations. Evaluations on a real testbed show that when emitting the same amount of carbon, GreenFlow can improve the average JCT by up to 2.15×, compared to competitive baselines.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"168-184"},"PeriodicalIF":5.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Hu;Haotian Wang;Wangdong Yang;Renqiu Ouyang;Keqin Li;Kenli Li
{"title":"BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core Acceleration","authors":"Rong Hu;Haotian Wang;Wangdong Yang;Renqiu Ouyang;Keqin Li;Kenli Li","doi":"10.1109/TPDS.2024.3477746","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3477746","url":null,"abstract":"Sparse tensor contraction (SpTC) is an important operator in tensor networks, which tends to generate a large amount of sparse high-dimensional data, placing higher demands on the computational performance and storage bandwidth of the processor. Using GPUs with powerful arithmetic characteristics is a reliable choice for accelerating SpTC, however, the high dimensionality and sparsity of tensor makes GPU-accelerated SpTC operators suffer from the difficulties of low computational intensity and high memory consumption. The recent introduction of Tensor Core Units (TCUs) on GPUs brings even more powerful arithmetic, which exacerbates the memory wall problem. To cope with the challenges, this paper proposes a new BCB format that linearizes the indices of multidimensional blocks to reduce block index accesses and uses a bitmap to store the distribution of non-zero elements in a block to reduce the storage overhead. A parallel blocking algorithm of BCB-SpTC is designed to divide the binary linear indices into free and contracted indexes to improve the pairing overhead of computational tasks. Then based on the characteristic computation method of TCUs, the proprietary filling method of TCUs is designed to overcome the inefficiency of parallel computation of sparse data on TCUs. Finally, experimental results on the A100 dataset show that BCB-SpTC improves the acceleration ratio by \u0000<inline-formula><tex-math>$1.1times$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$21.3times$</tex-math></inline-formula>\u0000 over the existing SpTC GPU method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2435-2448"},"PeriodicalIF":5.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MoltDB: Accelerating Blockchain via Ancient State Segregation","authors":"Junyuan Liang;Wuhui Chen;Zicong Hong;Haogang Zhu;Wangjie Qiu;Zibin Zheng","doi":"10.1109/TPDS.2024.3467927","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467927","url":null,"abstract":"Blockchain store states in Log-Structured Merge (LSM) tree-based database. Due to blockchain traceability, the growing ancient states are inevitably stored in the databases. Unfortunately, by default, this process mixes \u0000<italic>current</i>\u0000 and \u0000<italic>ancient</i>\u0000 states in the data layout, increasing unnecessary disk I/O access and slowing transaction execution. This paper proposes MoltDB, a scalable LSM-based database for efficient transaction execution through a novel idea of \u0000<italic>ancient state segregation</i>\u0000, i.e., to segregate current and ancient states in the data layout. However, the frequently generated and uncertainly accessed characteristics of ancient states make the segregation challenging. Thus, we develop an “extract-compact” mechanism to batch extraction process for frequently generated ancient states and the LSM compaction process to relieve additional disk I/O overhead. Moreover, we design an adaptive LSM-based storage for the uncertainly accessed ancient states extracted for on-demand access. We implement MoltDB as a database engine compatible with many mainstream blockchains and integrate it into Ethereum for evaluation. Experimental results show that MoltDB achieves 1.3 × transaction throughput and 30% disk I/O latency savings over the state-of-the-art works.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2545-2558"},"PeriodicalIF":5.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinyu Hu;Huizhang Luo;Hong Jiang;Guoqing Xiao;Kenli Li
{"title":"FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs","authors":"Jinyu Hu;Huizhang Luo;Hong Jiang;Guoqing Xiao;Kenli Li","doi":"10.1109/TPDS.2024.3477431","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3477431","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) on GPUs has gained significant attention because of SpMV's importance in modern applications and the increasing computing power of GPUs in the last decade. Previous studies have emphasized the importance of data loading for the overall performance of SpMV and demonstrated the efficacy of coalesced memory access in enhancing data loading efficiency. However, existing approaches fall far short of reaching the full potential of data loading on modern GPUs. In this paper, we propose an efficient algorithm called FastLoad, that speeds up the loading of both sparse matrices and input vectors of SpMV on modern GPUs. Leveraging coalesced memory access, FastLoad achieves high loading efficiency and load balance by sorting both the columns of the sparse matrix and elements of the input vector based on the number of non-zero elements while organizing non-zero elements in blocks to avoid thread divergence. FastLoad takes the Compressed Sparse Column (CSC) format as an implementation case to prove the concept and gain insights. We conduct a comprehensive comparison of FastLoad with the CSC-based SpMV, cuSPARSE, CSR5, and TileSpMV, using the full SuiteSparse Matrix Collection as workload. The experimental results on RTX 3090 Ti demonstrate that our method outperforms the others in most matrices, with geometric speedup means over CSC-based, cuSPARSE, CSR5, and TileSpMV being 2.12×, 2.98×, 2.88×, and 1.22×, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2423-2434"},"PeriodicalIF":5.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Competitive Analysis of Online Elastic Caching of Transient Data in Multi-Tiered Content Delivery Network","authors":"Binghan Wu;Wei Bao;Bing Bing Zhou","doi":"10.1109/TPDS.2024.3475412","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3475412","url":null,"abstract":"As the demand for faster and more reliable content delivery escalates, Content Delivery Networks (CDNs) face significant challenges in managing content placement across their increasingly complex, multi-tiered structures to balance performance, complexity, and scalability, while addressing the transient nature of data and the unpredictability of internet traffic. Addressing these challenges, this study introduces a novel multi-tier CDN caching strategy that navigates spatial and temporal trade-offs in cache placement, considering the cache placement cost diminishes with the content lifetime, and the uncertainty of future data demands. We design a distributed online algorithm that evaluates each incoming request and places new caches when the total content delivery cost exceeds a threshold. Our competitive analysis shows a tight and optimal \u0000<inline-formula><tex-math>$mathtt {Tiers}+1$</tex-math></inline-formula>\u0000 competitive ratio. Additionally, our algorithm has low complexity by passing \u0000<inline-formula><tex-math>$O(mathtt {Tiers})$</tex-math></inline-formula>\u0000 number of reference messages for each request, which enhances its practical applicability. Empirical validation through numerical simulations and trace-driven experiments confirms the superiority of our approach to existing benchmarks in real-world CDN settings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2449-2462"},"PeriodicalIF":5.6,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenhua Guo;Yinan Tang;Jidong Zhai;Tongtong Yuan;Jian Jin;Li Wang;Yaqian Zhao;Rengang Li
{"title":"A Survey on Performance Modeling and Prediction for Distributed DNN Training","authors":"Zhenhua Guo;Yinan Tang;Jidong Zhai;Tongtong Yuan;Jian Jin;Li Wang;Yaqian Zhao;Rengang Li","doi":"10.1109/TPDS.2024.3476390","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3476390","url":null,"abstract":"The recent breakthroughs in large-scale DNN attract significant attention from both academia and industry toward distributed DNN training techniques. Due to the time-consuming and expensive execution process of large-scale distributed DNN training, it is crucial to model and predict the performance of distributed DNN training before its actual deployment, in order to optimize the design of distributed DNN training at low cost. This paper analyzes and emphasizes the importance of modeling and predicting the performance of distributed DNN training, categorizes and analyses the related state-of-the-art works, and discusses future challenges and opportunities for this research field. The objectives of this paper are twofold: first, to assist researchers in understanding and choosing suitable modeling and prediction tools for large-scale distributed DNN training, and second, to encourage researchers to propose more valuable research about performance modeling and prediction for distributed DNN training in the future.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2463-2478"},"PeriodicalIF":5.6,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10707191","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrieKV: A High-Performance Key-Value Store Design With Memory as Its First-Class Citizen","authors":"Hui Sun;Deyan Kong;Song Jiang;Yinliang Yue;Xiao Qin","doi":"10.1109/TPDS.2024.3473013","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3473013","url":null,"abstract":"Key-value (KV) stores based on log-structured merge tree (LSM-tree) have been extensively studied and deployed in major information technology infrastructures. Because this type of systems is catered for KV store accessing disks, a limited disk bandwidth increases the difficulty of serving online data requests. One solution involves using a large DRAM such that frequent KV pairs are buffered and accessed from the main memory – and this solution exposes a major design drawback of the KV store: its lack of support for integrated data management in memory and on disks. For example, data in the most popular LSM-tree implementation – RocksDB – may reside in a small write buffer (MemTable) that organizes KV pairs for disk writes, a buffer cache for disk blocks, a write-ahead log on the disk for data persistence, and in various LSM levels on the disk. Without the integrated management of indexes, data, and their persistence in a hierarchical memory/disk architecture, memory is under-utilized along with missed performance optimization opportunities. We propose a KV store, TrieKV, which holistically incorporates DRAM, persistent memory (PMem), and disk with certain desired features: (1) fast in-memory access, (2) accurate identification of hot/cold data at an adaptable granularity, (3) customized memory space allocation for minimized fragmentation, (4) hotness-aware data placement across the storage hierarchy, (5) in-place data persistence in the PMem, and (6) hotness-aware LSM-tree compaction. TrieKV employs a single, integrated trie-structured index for all KV pairs in memory, where access hotness can be consistently discovered. Accordingly, the KV placement is dynamically determined according to the hotness and persistence needs of the storage hierarchy spanning the DRAM, PMem, and solid-state drive. In the experiment, we demonstrate that the 99th latency of RocksDB and NoveLSM is 38x and 6x higher than that of TrieKV, respectively. In addition, TrieKV outperforms RocksDB and NoveLSM by a factor of 5.6 and 1.7in terms of throughput, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2479-2496"},"PeriodicalIF":5.6,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TARIS: Scalable Incremental Processing of Time-Respecting Algorithms on Streaming Graphs","authors":"Ruchi Bhoot;Suved Sanjay Ghanmode;Yogesh Simmhan","doi":"10.1109/TPDS.2024.3471574","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3471574","url":null,"abstract":"Temporal graphs change with time and have a lifespan associated with each vertex and edge. These graphs are suitable to process time-respecting algorithms where the traversed edges must have monotonic timestamps. Interval-centric Computing Model (ICM) is a distributed programming abstraction to design such temporal algorithms. There has been little work on supporting time-respecting algorithms at large scales for streaming graphs, which are updated continuously at high rates (Millions/s), such as in financial and social networks. In this article, we extend the windowed-variant of ICM for incremental computing over streaming graph updates. We formalize the properties of temporal graph algorithms and prove that our model of incremental computing over streaming updates is equivalent to batch execution of ICM. We design TARIS, a novel distributed graph platform that implements these incremental computing features. We use efficient data structures to reduce memory access and enhance locality during graph updates. We also propose scheduling strategies to interleave updates with computing, and streaming strategies to adapt the execution window for incremental computing to the variable input rates. Our detailed and rigorous evaluation of temporal algorithms on large-scale graphs with up to \u0000<inline-formula><tex-math>$2,text{B}$</tex-math></inline-formula>\u0000 edges show that TARIS out-performs contemporary baselines, Tink and Gradoop, by 3–4 orders of magnitude, and handles a high input rate of \u0000<inline-formula><tex-math>$ 83k$</tex-math></inline-formula>\u0000–\u0000<inline-formula><tex-math>$ 587,text{M}$</tex-math></inline-formula>\u0000 Mutations/s with latencies in the order of seconds–minutes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2527-2544"},"PeriodicalIF":5.6,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Task Processing Platform for Infrastructure-Less IoT Networks: A Multi-Dimensional Optimization Approach","authors":"Qiushi Zheng;Jiong Jin;Zhishu Shen;Libing Wu;Iftekhar Ahmad;Yong Xiang","doi":"10.1109/TPDS.2024.3469545","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3469545","url":null,"abstract":"With the rapid development of artificial intelligence (AI) and the Internet of Things (IoT), intelligent information services have showcased unprecedented capabilities in acquiring and analysing information. The conventional task processing platforms rely on centralised Cloud processing, which encounters challenges in infrastructure-less environments with unstable or disrupted electrical grids and cellular networks. These challenges hinder the deployment of intelligent information services in such environments. To address these challenges, we propose a distributed task processing platform (\u0000<inline-formula><tex-math>${DTPP}$</tex-math></inline-formula>\u0000) designed to provide satisfactory performance for executing computationally intensive applications in infrastructure-less environments. This platform leverages numerous distributed homogeneous nodes to process the arriving task locally or collaboratively. Based on this platform, a distributed task allocation algorithm is developed to achieve high task processing performance with limited energy and bandwidth resources. To validate our approach, \u0000<inline-formula><tex-math>${DTPP}$</tex-math></inline-formula>\u0000 has been tested in an experimental environment utilising real-world experimental data to simulate IoT network services in infrastructure-less environments. Extensive experiments demonstrate that our proposed solution surpasses comparative algorithms in key performance metrics, including task processing ratio, task processing accuracy, algorithm processing time, and energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2392-2404"},"PeriodicalIF":5.6,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GeoDeploy: Geo-Distributed Application Deployment Using Benchmarking","authors":"Devki Nandan Jha;Yinhao Li;Zhenyu Wen;Graham Morgan;Prem Prakash Jayaraman;Maciej Koutny;Omer F. Rana;Rajiv Ranjan","doi":"10.1109/TPDS.2024.3470532","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3470532","url":null,"abstract":"Geo-distributed web-applications (GWA) can be deployed across multiple geographically separated datacenters to reduce the latency of access for users. Finding a suitable deployment for a GWA is challenging due to the requirement to consider a number of different parameters, such as host configurations across a federated infrastructure. The ability to evaluate multiple deployment configurations enables an efficient outcome to be determined, balancing resource usage while satisfying user requirements. We propose \u0000<sc>GeoDeploy</small>\u0000, a framework designed for finding a deployment solution for GWA. We evaluate \u0000<sc>GeoDeploy</small>\u0000 using both a formal algorithmic model and a practical cloud-based deployment. We also compare our approach with other existing techniques.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2361-2374"},"PeriodicalIF":5.6,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}