{"title":"SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter Array","authors":"Teng Gao;Lan Huang;Shang Gao;Kangping Wang","doi":"10.1109/TPDS.2024.3434332","DOIUrl":"10.1109/TPDS.2024.3434332","url":null,"abstract":"The use of reconfigurable circuits with parallel computing capabilities has been explored to enhance sorting performance and reduce power consumption. Nonetheless, most sorting algorithms utilizing dedicated processors are designed solely based on the parallelization of the algorithm, lacking considerations of specialized hardware structures. This leads to problems, including but not limited to the consumption of excessive I/O interface resources, on-chip storage resources, and complex layout wiring. In this paper, we propose a Systolic Sorter Array, implemented by a Uniform Recurrence Equation (URE) with highly parameterised in terms of data size, bit width and type. Leveraging this uniformly recursive structure, the sorter can simultaneously sort two independent sequences. In addition, we implemented global and local control modes on the FPGA to achieve higher computational frequencies. In our experiments, we have demonstrated the speed-up ratio of SSA relative to other state of the art (SOTA) sorting algorithms using C++ \u0000<inline-formula><tex-math>$std$</tex-math></inline-formula>\u0000::\u0000<inline-formula><tex-math>$sort()$</tex-math></inline-formula>\u0000 as benchmark. Inheriting the benefits from the Systolic Array architecture, the SSA reaches up to 810 Mhz computing frequency on the U200. The results of our study show that SSA outperforms other sorting algorithms in terms of throughput, speed-up ratio, and computation frequency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1721-1734"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt
{"title":"Long-Range MD Electrostatics Force Computation on FPGAs","authors":"Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt","doi":"10.1109/TPDS.2024.3434347","DOIUrl":"10.1109/TPDS.2024.3434347","url":null,"abstract":"Strong scaling of long-range electrostatic force computation, which is a central concern of long timescale molecular dynamics simulations, is challenging for CPUs and GPUs due to its complex communication structure and global communication requirements. The scalability challenge is seen especially in small simulations of tens to hundreds of thousands of atoms that are of interest to many important applications such as physics-driven drug discovery. FPGA clusters, with their direct, tightly coupled, low-latency interconnects, are able to address these requirements. For FPGA MD clusters to be effective, however, single device performance must also be competitive. In this work, we leverage the inherent benefits of FPGAs to implement a long-range electrostatic force computation architecture. We present an overall framework with numerous algorithmic, mapping, and architecture innovations, including a unified interleaved memory, a spatial scheduling algorithm, and a design for seamless integration with the larger MD system. We examine a number of alternative configurations based on different resource allocation strategies and user parameters. We show that the best configuration of this architecture, implemented on an Intel Agilex FPGA, can achieve \u0000<inline-formula><tex-math>$2124 ns$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$287 ns$</tex-math></inline-formula>\u0000 of simulated time per day of wall-clock time for the two molecular dynamics benchmarks DHFR and ApoA1; simulating 23K and 92K particles, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1690-1707"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism","authors":"Yaqi Xia;Zheng Zhang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Hongyang Chen;Qianlong Sang;Dazhao Cheng","doi":"10.1109/TPDS.2024.3432855","DOIUrl":"10.1109/TPDS.2024.3432855","url":null,"abstract":"Recently, Temporal Graph Neural Networks (TGNNs), as an extension of Graph Neural Networks, have demonstrated remarkable effectiveness in handling dynamic graph data. Distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-device communication that generates significant redundant data. However, existing systems are unable to remove the redundancy in data reuse and transfer, and suffer from severe communication overhead in a distributed setting. This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform. Exploiting dependency patterns of TGNN models, we develop a redundancy-free graph organization to mitigate redundant data transfer. Additionally, we investigate communication imbalance issues among devices and formulate the graph partitioning problem as minimizing the maximum communication balance cost, which is proved to be an NP-hard problem. We propose an approximation algorithm called Re-FlexBiCut to tackle this problem. Furthermore, we incorporate prefetching, adaptive micro-batch pipelining, and asynchronous pipelining to present a hierarchical pipelining mechanism that mitigates the communication overhead. Sven represents the first comprehensive optimization solution for scaling memory-based TGNN training. Through extensive experiments conducted on a 64-GPU cluster, Sven demonstrates impressive speedup, ranging from 1.9x to 3.5x, compared to State-of-the-Art approaches. Additionally, Sven achieves up to 5.26x higher communication efficiency and reduces communication imbalance by up to 59.2%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1904-1919"},"PeriodicalIF":5.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cunyang Wei;Haipeng Jia;Yunquan Zhang;Jianyu Yao;Chendi Li;Wenxuan Cao
{"title":"IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs","authors":"Cunyang Wei;Haipeng Jia;Yunquan Zhang;Jianyu Yao;Chendi Li;Wenxuan Cao","doi":"10.1109/TPDS.2024.3432579","DOIUrl":"10.1109/TPDS.2024.3432579","url":null,"abstract":"The matrix multiplication algorithm is a fundamental numerical technique in linear algebra and plays a crucial role in many scientific computing applications. Despite the high performance of mainstream basic linear algebra libraries for large-scale dense matrix multiplications, they exhibit poor performance when applied to matrix multiplication with irregular input. This paper proposes an input-aware tuning framework that accounts for application scenarios and computer architectures to provide high-performance irregular matrix multiplication on ARMv8 and X86 CPUs. The framework comprises two stages: the install-time stage and the run-time stage. The install-time stage utilizes our proposed computational template to generate high-performance kernels for general data layout and SIMD-friendly data layout. The run-time stage utilizes a tiling algorithm suitable for irregular GEMM to select the optimal kernel and link as an execution plan. Additionally, load-balanced multi-threaded optimization algorithms are defined to exploit the multi-threading capability of modern processors. Experiments demonstrate that the proposed IrGEMM framework can achieve significant performance improvements for irregular GEMM on both ARMv8 and X86 CPUs compared to other mainstream BLAS libraries.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1672-1689"},"PeriodicalIF":5.6,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Tian;Jiazhi Jiang;Jiangsu Du;Dan Huang;Yutong Lu
{"title":"Sophisticated Orchestrating Concurrent DLRM Training on CPU/GPU Platform","authors":"Rui Tian;Jiazhi Jiang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2024.3432620","DOIUrl":"10.1109/TPDS.2024.3432620","url":null,"abstract":"Recommendation systems are essential to the operation of the majority of internet services, with Deep Learning Recommendation Models (DLRMs) serving as a crucial component. However, due to distinct computation, data access, and memory usage characteristics of recommendation models, the trainning of DLRMs may suffer from low resource utilization on prevalent heterogeneous CPU-GPU hardware platforms. Furthermore, as the majority of high-performance computing systems presently depend on multi-GPU computing nodes, the challenge of addressing low resource utilization becomes even more pronounced. Existing concurrent training solutions cannot be straightforwardly applied to DLRM due to various factors, such as insufficient fine-grained memory management and the lack of collaborative CPU-GPU scheduling. In this paper, we introduce RMixer, a scheduling framework that addresses these challenges by providing an efficient job management and scheduling mechanism for DLRM training jobs on heterogeneous CPU-GPU platforms. To facilitate training co-location, we first estimate the peak memory consumption of each job. Additionally, we track and collect resource utilization for DLRM training jobs. Based on the information of computational patterns, a batched job dispatcher with dynamic resource-complementary scheduling policy is proposed to co-locate DLRM training jobs on CPU-GPU platform. Scheduling strategies for both intra-GPU and inter-GPU scenarios were meticulously devised, with a focus on thoroughly examining individual GPU resource utilization and achieving a balanced state across multiple GPUs. Experimental results demonstrate that our implementation achieved up to 5.3× and 7.5× higher throughput on single GPU and 4 GPU respectively for training jobs involving various recommendation models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2177-2192"},"PeriodicalIF":5.6,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN Training","authors":"Haoran Zhou;Wei Rang;Hongyang Chen;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2024.3431910","DOIUrl":"10.1109/TPDS.2024.3431910","url":null,"abstract":"Deep Neural Networks (DNNs) have gained widespread adoption in diverse fields, including image classification, object detection, and natural language processing. However, training large-scale DNN models often encounters significant memory bottlenecks, which ask for efficient management of extensive tensors. Heterogeneous memory system, which combines persistent memory (PM) modules with traditional DRAM, offers an economically viable solution to address tensor management challenges during DNN training. However, existing memory management methods on heterogeneous memory systems often lead to low PM access efficiency, low bandwidth utilization, and incomplete analysis of model characteristics. To overcome these hurdles, we introduce an efficient tensor management approach, DeepTM, tailored for heterogeneous memory to alleviate memory bottlenecks during DNN training. DeepTM employs page-level tensor aggregation to enhance PM read and write performance and executes contiguous page migration to increase memory bandwidth. Through an analysis of tensor access patterns and model characteristics, we quantify the overall performance and transform the performance optimization problem into the framework of Integer Linear Programming. Additionally, we achieve tensor heat recognition by dynamically adjusting the weights of four key tensor characteristics and develop a global optimization strategy using Deep Reinforcement Learning. To validate the efficacy of our approach, we implement and evaluate DeepTM, utilizing the TensorFlow framework running on a PM-based heterogeneous memory system. The experimental results demonstrate that DeepTM achieves performance improvements of up to 36% and 49% compared to the current state-of-the-art memory management strategies AutoTM and Sentinel, respectively. Furthermore, our solution reduces the overhead by 18 times and achieves up to 29% cost reduction compared to AutoTM.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1920-1935"},"PeriodicalIF":5.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams","authors":"Gabriele Mencagli;Patrizio Dazzi;Massimo Coppola","doi":"10.1109/TPDS.2024.3431611","DOIUrl":"10.1109/TPDS.2024.3431611","url":null,"abstract":"An increasing number of application domains require high-throughput processing to extract insights from massive data streams. The Data Stream Processing (DSP) paradigm provides formal approaches to analyze structured data streams considered as special, unbounded relations. The most used class of stateful operators in DSP are the ones running sliding-window aggregation, which continuously extracts insights from the most recent portion of the stream. This article presents \u0000<sc>Springald</small>\u0000, an efficient sliding-window operator leveraging GPU devices. \u0000<sc>Springald</small>\u0000, incorporated in the \u0000<sc>WindFlow</small>\u0000 parallel library, processes out-of-order data streams with watermarks propagation. These two features—GPU processing and out-of-orderliness—make \u0000<sc>Springald</small>\u0000 a novel contribution to this research area. This article describes the methodology behind \u0000<sc>Springald</small>\u0000, its design and implementation. We also provide an extensive experimental evaluation to understand the behavior of \u0000<sc>Springald</small>\u0000 deeply, and we showcase its superior performance against state-of-the-art competitors.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1657-1671"},"PeriodicalIF":5.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10606093","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jungwon Kim;Seyong Lee;Beau Johnston;Jeffrey S. Vetter
{"title":"IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous Computing","authors":"Jungwon Kim;Seyong Lee;Beau Johnston;Jeffrey S. Vetter","doi":"10.1109/TPDS.2024.3429010","DOIUrl":"10.1109/TPDS.2024.3429010","url":null,"abstract":"From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their software framework for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more agile and proactive software framework is essential to increase performance portability and improve user productivity. To this end, we have designed and implemented IRIS: a performance-portable framework for cross-platform heterogeneous computing. IRIS can discover available resources, manage multiple diverse programming platforms (e.g., CUDA, Hexagon, HIP, Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data movement proactively, and provide for user-configurable scheduling. To simplify data movement, IRIS introduces a shared virtual device memory with relaxed consistency among different heterogeneous devices. IRIS also adds an automatic kernel workload partitioning technique using the polyhedral model so that it can resize kernels for a wide range of devices. Our evaluation on three architectures, ranging from Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of diverse heterogeneous architectures with negligible overhead.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1796-1809"},"PeriodicalIF":5.6,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG","authors":"Jiaxing Qi;Wencong Xiao;Mingzhen Li;Chaojie Yang;Yong Li;Wei Lin;Hailong Yang;Zhongzhi Luan;Depei Qian","doi":"10.1109/TPDS.2024.3431189","DOIUrl":"10.1109/TPDS.2024.3431189","url":null,"abstract":"As deep learning (DL) technologies become ubiquitous, GPU clusters are deployed for inference tasks with consistent service level objectives (SLOs). Efficiently utilizing multiple GPUs is crucial for throughput and cost-effectiveness. This article addresses the challenges posed by dynamic input and NVIDIA MIG in scheduling DL workloads. We present ElasticBatch, a scheduling system that simplifies configuration through bucketization and employs a machine learning-based pipeline to optimize settings. Our experiments demonstrate that ElasticBatch achieves a 50% reduction in GPU instances compared to MIG disablement, increases GPU utilization by 1.4% to 6.5% over an ideal scheduler and significantly reduces profiling time. This research contributes to the discourse on efficient utilization of GPU clusters. ElasticBatch's effectiveness in mitigating challenges posed by dynamic inputs and NVIDIA MIG underscores its potential to optimize GPU cluster performance, providing tangible benefits in terms of reduced instances, increased utilization, and significant time savings in real-world deployment scenarios.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1708-1720"},"PeriodicalIF":5.6,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}