{"title":"DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN Training","authors":"Haoran Zhou;Wei Rang;Hongyang Chen;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2024.3431910","DOIUrl":"10.1109/TPDS.2024.3431910","url":null,"abstract":"Deep Neural Networks (DNNs) have gained widespread adoption in diverse fields, including image classification, object detection, and natural language processing. However, training large-scale DNN models often encounters significant memory bottlenecks, which ask for efficient management of extensive tensors. Heterogeneous memory system, which combines persistent memory (PM) modules with traditional DRAM, offers an economically viable solution to address tensor management challenges during DNN training. However, existing memory management methods on heterogeneous memory systems often lead to low PM access efficiency, low bandwidth utilization, and incomplete analysis of model characteristics. To overcome these hurdles, we introduce an efficient tensor management approach, DeepTM, tailored for heterogeneous memory to alleviate memory bottlenecks during DNN training. DeepTM employs page-level tensor aggregation to enhance PM read and write performance and executes contiguous page migration to increase memory bandwidth. Through an analysis of tensor access patterns and model characteristics, we quantify the overall performance and transform the performance optimization problem into the framework of Integer Linear Programming. Additionally, we achieve tensor heat recognition by dynamically adjusting the weights of four key tensor characteristics and develop a global optimization strategy using Deep Reinforcement Learning. To validate the efficacy of our approach, we implement and evaluate DeepTM, utilizing the TensorFlow framework running on a PM-based heterogeneous memory system. The experimental results demonstrate that DeepTM achieves performance improvements of up to 36% and 49% compared to the current state-of-the-art memory management strategies AutoTM and Sentinel, respectively. Furthermore, our solution reduces the overhead by 18 times and achieves up to 29% cost reduction compared to AutoTM.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1920-1935"},"PeriodicalIF":5.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams","authors":"Gabriele Mencagli;Patrizio Dazzi;Massimo Coppola","doi":"10.1109/TPDS.2024.3431611","DOIUrl":"10.1109/TPDS.2024.3431611","url":null,"abstract":"An increasing number of application domains require high-throughput processing to extract insights from massive data streams. The Data Stream Processing (DSP) paradigm provides formal approaches to analyze structured data streams considered as special, unbounded relations. The most used class of stateful operators in DSP are the ones running sliding-window aggregation, which continuously extracts insights from the most recent portion of the stream. This article presents \u0000<sc>Springald</small>\u0000, an efficient sliding-window operator leveraging GPU devices. \u0000<sc>Springald</small>\u0000, incorporated in the \u0000<sc>WindFlow</small>\u0000 parallel library, processes out-of-order data streams with watermarks propagation. These two features—GPU processing and out-of-orderliness—make \u0000<sc>Springald</small>\u0000 a novel contribution to this research area. This article describes the methodology behind \u0000<sc>Springald</small>\u0000, its design and implementation. We also provide an extensive experimental evaluation to understand the behavior of \u0000<sc>Springald</small>\u0000 deeply, and we showcase its superior performance against state-of-the-art competitors.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1657-1671"},"PeriodicalIF":5.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10606093","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jungwon Kim;Seyong Lee;Beau Johnston;Jeffrey S. Vetter
{"title":"IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous Computing","authors":"Jungwon Kim;Seyong Lee;Beau Johnston;Jeffrey S. Vetter","doi":"10.1109/TPDS.2024.3429010","DOIUrl":"10.1109/TPDS.2024.3429010","url":null,"abstract":"From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their software framework for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more agile and proactive software framework is essential to increase performance portability and improve user productivity. To this end, we have designed and implemented IRIS: a performance-portable framework for cross-platform heterogeneous computing. IRIS can discover available resources, manage multiple diverse programming platforms (e.g., CUDA, Hexagon, HIP, Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data movement proactively, and provide for user-configurable scheduling. To simplify data movement, IRIS introduces a shared virtual device memory with relaxed consistency among different heterogeneous devices. IRIS also adds an automatic kernel workload partitioning technique using the polyhedral model so that it can resize kernels for a wide range of devices. Our evaluation on three architectures, ranging from Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of diverse heterogeneous architectures with negligible overhead.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1796-1809"},"PeriodicalIF":5.6,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG","authors":"Jiaxing Qi;Wencong Xiao;Mingzhen Li;Chaojie Yang;Yong Li;Wei Lin;Hailong Yang;Zhongzhi Luan;Depei Qian","doi":"10.1109/TPDS.2024.3431189","DOIUrl":"10.1109/TPDS.2024.3431189","url":null,"abstract":"As deep learning (DL) technologies become ubiquitous, GPU clusters are deployed for inference tasks with consistent service level objectives (SLOs). Efficiently utilizing multiple GPUs is crucial for throughput and cost-effectiveness. This article addresses the challenges posed by dynamic input and NVIDIA MIG in scheduling DL workloads. We present ElasticBatch, a scheduling system that simplifies configuration through bucketization and employs a machine learning-based pipeline to optimize settings. Our experiments demonstrate that ElasticBatch achieves a 50% reduction in GPU instances compared to MIG disablement, increases GPU utilization by 1.4% to 6.5% over an ideal scheduler and significantly reduces profiling time. This research contributes to the discourse on efficient utilization of GPU clusters. ElasticBatch's effectiveness in mitigating challenges posed by dynamic inputs and NVIDIA MIG underscores its potential to optimize GPU cluster performance, providing tangible benefits in terms of reduced instances, increased utilization, and significant time savings in real-world deployment scenarios.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1708-1720"},"PeriodicalIF":5.6,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HybRAID: A High-Performance Hybrid RAID Storage Architecture for Write-Intensive Applications in All-Flash Storage Systems","authors":"Maryam Karimi;Reza Salkhordeh;André Brinkmann;Hossein Asadi","doi":"10.1109/TPDS.2024.3429336","DOIUrl":"10.1109/TPDS.2024.3429336","url":null,"abstract":"With the ever-increasing demand for higher I/O performance and reliability in data-intensive applications, \u0000<italic>solid-state drives</i>\u0000 (SSDs) typically configured as \u0000<italic>redundant array of independent disks</i>\u0000 (RAID) are broadly used in enterprise \u0000<italic>all-flash storage systems</i>\u0000. While a mirrored RAID offers higher performance in random access workloads, parity-based RAIDs (e.g., RAID5) provide higher performance in sequential accesses with less cost overhead. Previous studies try to address the poor performance of parity-based RAIDs in small writes (i.e., writes into a single disk) by offering various schemes, including caching or logging small writes. However, such techniques impose a significant performance and/or reliability overheads and are seldom used in the industry. In addition, our empirical analysis shows that partial stripe writes, i.e., writing into a fraction of a full array in parity-based RAIDs, can significantly degrade the I/O performance, which has \u0000<italic>not</i>\u0000 been addressed in the previous work. In this paper, we first offer an empirical study which reveals partial stripe writes reduce the performance of parity-based RAIDs by up to 6.85× compared to full stripe writes (i.e., writes into entire disks). Then, we propose a high-performance \u0000<underline>hyb</u>\u0000rid \u0000<underline>RAID</u>\u0000 storage architecture, called \u0000<italic>HybRAID</i>\u0000, which is optimized for write-intensive applications. HybRAID exploits the advantages of mirror- and parity-based RAIDs to improve the write performance. HybRAID directs a) \u0000<underline>aligned</u>\u0000 full stripe writes to parity-based RAID tier and b) small/partial stripe writes to the RAID1 tier. We propose an online migration scheme, which aims to move small/partial writes from parity-based RAID to RAID1, based on access frequency of updates. As a complement, we further offer offline migration, whose aim is to make room in the fast tier for future references. Experimental results over enterprise SSDs show that HybRAID improves the performance of write-intensive applications by 3.3× and 2.6×, as well as enhancing performance per cost by 3.1× and 3.0× compared to parity-based RAID and RAID10, respectively, at equivalent costs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2608-2623"},"PeriodicalIF":5.6,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InSS: An Intelligent Scheduling Orchestrator for Multi-GPU Inference With Spatio-Temporal Sharing","authors":"Ziyi Han;Ruiting Zhou;Chengzhong Xu;Yifan Zeng;Renli Zhang","doi":"10.1109/TPDS.2024.3430063","DOIUrl":"10.1109/TPDS.2024.3430063","url":null,"abstract":"As the applications of AI proliferate, it is critical to increase the throughput of online DNN inference services. Multi-process service (MPS) improves the utilization rate of GPU resources by spatial-sharing, but it also brings unique challenges. First, interference between co-located DNN models deployed on the same GPU must be accurately modeled. Second, inference tasks arrive dynamically online, and each task needs to be served within a bounded time to meet the service-level objective (SLO). Third, the problem of fragments has become more serious. To address the above three challenges, we propose an \u0000<underline>In</u>\u0000telligent \u0000<underline>S</u>\u0000cheduling orchestrator for multi-GPU inference servers with spatio-temporal \u0000<underline>S</u>\u0000haring (\u0000<italic>InSS</i>\u0000), aiming to maximize the system throughput. \u0000<italic>InSS</i>\u0000 exploits two key innovations: i) An interference-aware latency analytical model which estimates the task latency. ii) A two-stage intelligent scheduler is tailored to jointly optimize the model placement, GPU resource allocation and adaptively decides batch size by coupling the latency analytical model. Our prototype implementation on four NVIDIA A100 GPUs shows that \u0000<italic>InSS</i>\u0000 can improve the throughput by up to 86% compared to the state-of-the-art GPU schedulers, while satisfying SLOs. We further show the scalability of \u0000<italic>InSS</i>\u0000 on 64 GPUs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1735-1748"},"PeriodicalIF":5.6,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Swift: Expedited Failure Recovery for Large-Scale DNN Training","authors":"Yuchen Zhong;Guangming Sheng;Juncheng Liu;Jinhui Yuan;Chuan Wu","doi":"10.1109/TPDS.2024.3429625","DOIUrl":"10.1109/TPDS.2024.3429625","url":null,"abstract":"As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This article presents \u0000<sc>Swift</small>\u0000, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, \u0000<sc>Swift</small>\u0000 resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that \u0000<sc>Swift</small>\u0000 significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. \u0000<sc>Swift</small>\u0000 can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1644-1656"},"PeriodicalIF":5.6,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STT-RAM-Based Hierarchical in-Memory Computing","authors":"Dhruv Gajaria;Kevin Antony Gomez;Tosiron Adegbija","doi":"10.1109/TPDS.2024.3430853","DOIUrl":"10.1109/TPDS.2024.3430853","url":null,"abstract":"In-memory computing promises to overcome the von Neumann bottleneck in computer systems by performing computations directly within the memory. Previous research has suggested using \u0000<italic>Spin-Transfer Torque RAM (STT-RAM)</i>\u0000 for in-memory computing due to its non-volatility, low leakage power, high density, endurance, and commercial viability. This paper explores \u0000<italic>hierarchical in-memory computing</i>\u0000, where different levels of the memory hierarchy are augmented with processing elements to optimize workload execution. The paper investigates processing in memory (PiM) using non-volatile STT-RAM and processing in cache (PiC) using volatile STT-RAM with relaxed retention, which helps mitigate STT-RAM's write latency and energy overheads. We analyze tradeoffs and overheads associated with data movement for PiC versus write overheads for PiM using STT-RAMs for various workloads. We examine workload characteristics, such as computational intensity and CPU-dependent workloads with limited instruction-level parallelism, and their impact on PiC/PiM tradeoffs. Using these workloads, we evaluate computing in STT-RAM versus SRAM at different cache hierarchy levels and explore the potential of heterogeneous STT-RAM cache architectures with various retention times for PiC and CPU-based computing. Our experiments reveal significant advantages of STT-RAM-based PiC over PiM for specific workloads. Finally, we describe open research problems in hierarchical in-memory computing architectures to further enhance this paradigm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1615-1629"},"PeriodicalIF":5.6,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anish Govind, Yuchen Jing, Stefanie Dao, Michael Granado, Rachel Handran, Davit Margarian, Matthew Mikhailov, Danny Vo, Matei-Alexandru Gardus, Khai Vu, Derek Bouius, Bryan Chin, Mahidhar Tatineni, Mary Thomas
{"title":"Reproducibility of the DaCe Framework on NPBench Benchmarks","authors":"Anish Govind, Yuchen Jing, Stefanie Dao, Michael Granado, Rachel Handran, Davit Margarian, Matthew Mikhailov, Danny Vo, Matei-Alexandru Gardus, Khai Vu, Derek Bouius, Bryan Chin, Mahidhar Tatineni, Mary Thomas","doi":"10.1109/tpds.2024.3427130","DOIUrl":"https://doi.org/10.1109/tpds.2024.3427130","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"14 1","pages":""},"PeriodicalIF":5.3,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Effective Server Deployment for Multi-Access Edge Networks: A Cooperative Scheme","authors":"Rong Cong;Zhiwei Zhao;Linyuanqi Zhang;Geyong Min","doi":"10.1109/TPDS.2024.3426523","DOIUrl":"10.1109/TPDS.2024.3426523","url":null,"abstract":"The combination of 5G/6G and edge computing has been envisioned as a promising paradigm to empower pervasive and intensive computing for the Internet-of-Things (IoT). High deployment cost is one of the major obstacles for realizing 5G/6G edge computing. Most existing works tried to deploy the minimum number of edge servers to cover a target area by avoiding coverage overlaps. However, following this framework, the resource requirement per server will be drastically increased by the peak requirement during workload variations. Even worse, most resources will be left under-utilized for most of the time. To address this problem, we propose CoopEdge, a cost-effective server deployment scheme for cooperative multi-access edge computing. The key idea of CoopEdge is to allow deploying overlapped servers to handle variable requested workloads in a cooperative manner. In this way, the peak demands can be dispersed into multiple servers, and the resource requirement for each server can be greatly reduced. We propose a Two-step Incremental Deployment (TID) algorithm to jointly decide the server deployment and cooperation policies. For the scenarios involving multiple network operators that are unwilling to cooperate with each other, we further extend the TID algorithm to a distributed TID algorithm based on the game theory. Extensive evaluation experiments are conducted based on the measurement results of seven real-world edge applications. The results show that compared with the state-of-the-art work, CoopEdge significantly reduces the deployment cost by 38.7% and improves resource utilization by 36.2%, and the proposed distributed algorithm can achieve a comparable deployment cost with CoopEdge, especially for small-coverage servers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1583-1597"},"PeriodicalIF":5.6,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}