IEEE Transactions on Parallel and Distributed Systems最新文献_第4页

Symmetric Properties and Two Variants of Shuffle-Cubes 洗牌立方体的对称性质和两种变体

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-08 DOI: 10.1109/TPDS.2025.3558885

Huazhong Lü;Kai Deng;Xiaomei Yang

{"title":"Symmetric Properties and Two Variants of Shuffle-Cubes","authors":"Huazhong Lü;Kai Deng;Xiaomei Yang","doi":"10.1109/TPDS.2025.3558885","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3558885","url":null,"abstract":"Li et al. in [Inf. Process. Lett. 77 (2001) 35–41] proposed the shuffle-cube <inline-formula><tex-math>$SQ_{n}$</tex-math></inline-formula>, a hypercube variant, as an attractive interconnection network topology for massive parallel and distributed systems. Diameter and symmetry are two desirable measures of network performance in terms of transmission delay and routing algorithms. Almost all <inline-formula><tex-math>$n$</tex-math></inline-formula>-regular hypercube variants of dimension <inline-formula><tex-math>$n$</tex-math></inline-formula> have diameter not less than <inline-formula><tex-math>$n/2$</tex-math></inline-formula>. The diameter of the shuffle-cube is approximately a quarter of the diameter of the hypercube of the same dimension, making it a competitive candidate network topology. By far, symmetric properties of the shuffle-cube remain unknown. In this paper, we show that <inline-formula><tex-math>$SQ_{n}$</tex-math></inline-formula> is not vertex-transitive for <inline-formula><tex-math>$n> 2$</tex-math></inline-formula>, which is not an appealing property in interconnection networks. This shortcoming limits the practical application of the shuffle-cube. To overcome this limitation, two novel variants of the shuffle-cube, namely simplified shuffle-cube <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and balanced shuffle-cube <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula> are introduced, and their vertex-transitivity are proved simultaneously. By proposing the shuffle-cube-like graph, we obtain that both <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula> are maximally connected, implying high connectivity similar to the hypercube. Additionally, super-connectivity, a refined parameter of connectivity, of <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula> are also determined. Then, by vertex-transitivity of <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula>, routing algorithms of <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula> are given for all <inline-formula><tex-math>$n> 2$</tex-math></inline-formula> respectively. We show that both <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> and <inline-formula><tex-math>$BSQ_{n}$</tex-math></inline-formula> possess Hamiltonian cycle embedding for all <inline-formula><tex-math>$n> 2$</tex-math></inline-formula>, and we also show that <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> is Hamiltonian-connected. It is noticeable that each vertex of <inline-formula><tex-math>$SSQ_{n}$</tex-math></inline-formula> is contained in exactly one clique of size four, making it also a viable interconnection topo","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1282-1293"},"PeriodicalIF":5.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PWDFT-SW: Extending the Limit of Plane-Wave DFT Calculations to 16K Atoms on the New Sunway Supercomputer PWDFT-SW：在新神威超级计算机上将平面波DFT计算的极限扩展到16K原子

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-07 DOI: 10.1109/TPDS.2025.3557621

Qingcai Jiang;Zhenwei Cao;Junshi Chen;Xinming Qin;Wei Hu;Hong An;Jinlong Yang

{"title":"PWDFT-SW: Extending the Limit of Plane-Wave DFT Calculations to 16K Atoms on the New Sunway Supercomputer","authors":"Qingcai Jiang;Zhenwei Cao;Junshi Chen;Xinming Qin;Wei Hu;Hong An;Jinlong Yang","doi":"10.1109/TPDS.2025.3557621","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557621","url":null,"abstract":"First-principles density functional theory (DFT) with plane wave (PW) basis set is the most widely used method in quantum mechanical material simulations due to its advantages in accuracy and universality. However, a perceived drawback of PW-based DFT calculations is their substantial computational cost and memory usage, which currently limits their ability to simulate large-scale complex systems containing thousands of atoms. This situation is exacerbated in the new Sunway supercomputer, where each process is limited to a mere 16 GB of memory. Herein, we present a novel parallel implementation of plane wave density functional theory on the new Sunway supercomputer (PWDFT-SW). PWDFT-SW fully extracts the benefits of Sunway supercomputer by extensively refactoring and calibrating our algorithms to align with the system characteristics of the Sunway system. Through extensive numerical experiments, we demonstrate that our methods can substantially decrease both computational costs and memory usage. Our optimizations translate to a speedup of 64.8x for a physical system containing 4,096 silicon atoms, enabling us to push the limit of PW-based DFT calculations to large-scale systems containing 16,384 carbon atoms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1495-1508"},"PeriodicalIF":5.6,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cube-fx: Mapping Taylor Expansion Onto Matrix Multiplier-Accumulators of Huawei Ascend AI Processors Cube-fx：将泰勒展开映射到华为Ascend AI处理器的矩阵乘法器-累加器上

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557444

Yifeng Tang;Huaman Zhou;Zhuoran Ji;Cho-Li Wang

{"title":"Cube-fx: Mapping Taylor Expansion Onto Matrix Multiplier-Accumulators of Huawei Ascend AI Processors","authors":"Yifeng Tang;Huaman Zhou;Zhuoran Ji;Cho-Li Wang","doi":"10.1109/TPDS.2025.3557444","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557444","url":null,"abstract":"Taylor expansion, a mature method for function evaluations used in Artificial Intelligence (AI) applications, approximates functions with polynomials. In addition to the function evaluations, AI applications require massive matrix multiplications, inspiring manufacturers to propose AI processors with matrix multiplier-accumulators (MACs). However, compared with the powerful Matrix MACs, the vectorized units of the AI processors cannot efficiently carry the existing Taylor expansion implementation of Single Instruction Multiple Data (SIMD) parallelism. Leveraging the Matrix MACs for Taylor expansion becomes an ideal direction. In previous studies, migrating optimized algorithms to the Matrix MACs requires matrix generation during the runtime. The generation is expensive and even cancels the accelerations brought by the Matrix MACs on the AI processors, which Taylor expansion also suffers. This article presents Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Cube-fx expresses the building and computation in matrix multiplications without inefficient dynamic matrix generation. On Huawei Ascend processors, Cube-fx averagely achieves 1.64× speedups compared with vectorized Horner’s Method with 56.38<inline-formula><tex-math>$%$</tex-math></inline-formula> vectorized operations reduced.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1115-1129"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers 地理分布数据中心中大图的分布式和自适应分区

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557610

Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang

{"title":"Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers","authors":"Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang","doi":"10.1109/TPDS.2025.3557610","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557610","url":null,"abstract":"Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the <italic>large graph scales, <italic>dynamic graph changes and the <italic>network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose <italic>DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, <italic>DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, <italic>DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, <italic>DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, <italic>DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. <italic>DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, <italic>DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1161-1174"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OneOS: Distributed Operating System for the Edge-to-Cloud Continuum OneOS：用于边缘到云连续体的分布式操作系统

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557747

Kumseok Jung;Julien Gascon-Samson;Sathish Gopalakrishnan;Karthik Pattabiraman

{"title":"OneOS: Distributed Operating System for the Edge-to-Cloud Continuum","authors":"Kumseok Jung;Julien Gascon-Samson;Sathish Gopalakrishnan;Karthik Pattabiraman","doi":"10.1109/TPDS.2025.3557747","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557747","url":null,"abstract":"Application developers often need to employ a combination of software such as communication middleware and cloud-based services to deal with the challenges of heterogeneity and network dynamism in the edge-to-cloud continuum. Consequently, developers write extra glue code peripheral to the application’s core business logic, to provide interoperability between interacting software frameworks. Each software framework comes with its own framework-specific API, and as technology evolves, the developer must keep up with the changing APIs by updating the glue code in their application. Thus, framework-specific APIs hinder interoperability and cause technology fragmentation. We propose a design of a middleware-based distributed operating system (OS) called OneOS to realize a computing paradigm that alleviates such interoperability challenges. OneOS provides a single system image of the distributed computing platform, and transparently provides interoperability between software components through the standard POSIX API. Using OneOS’s domain-specific language, users can compose complex distributed applications from legacy POSIX programs. OneOS tolerates failures by adopting a distributed checkpoint-restore algorithm. We evaluate the performance of OneOS against an open-source IoT Platform, ThingsJS, using an IoT stream processing benchmark suite, and a video processing application. OneOS executes the programs about 3x faster than ThingsJS, reduces the code size by about 22%, and recovers the state of failed applications within 1 s upon detecting their failure.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1175-1192"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Parallel Sparse Tensor Contraction 高效并行稀疏张量收缩

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557750

Somesh Singh;Bora Uçar

{"title":"Efficient Parallel Sparse Tensor Contraction","authors":"Somesh Singh;Bora Uçar","doi":"10.1109/TPDS.2025.3557750","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557750","url":null,"abstract":"We investigate the performance of algorithms for sparse tensor-sparse tensor multiplication (SpGETT). This operation, also called sparse tensor contraction, is a higher order analogue of the sparse matrix-sparse matrix multiplication (SpGEMM) operation. Therefore, SpGETT can be performed by first converting the input tensors into matrices, then invoking high performance variants of SpGEMM, and finally reconverting the resultant matrix into a tensor. Alternatively, one can carry out the scalar operations underlying SpGETT in the realm of tensors without matrix formulation. We discuss the building blocks in both approaches and formulate a hashing-based method to avoid costly search or redirection operations. We present performance results with the current state-of-the-art SpGEMM-based approaches, existing SpGETT approaches, and a carefully implemented SpGETT approach with a new fine-tuned hashing method, proposed in this article. We evaluate the methods on real world tensors by contracting a tensor with itself along varying dimensions. Our proposed hashing-based method for SpGETT consistently outperforms the state-of-the-art method, achieving a 25% reduction in sequential execution time on average and a 21% reduction in parallel execution time on average across a variety of input instances.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1206-1219"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143888387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRHunter: Universal Detection of Instruction Reordering Vulnerabilities for Enhanced Concurrency in Distributed and Parallel Systems IRHunter：用于增强分布式和并行系统并发性的指令重排序漏洞的通用检测

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-02 DOI: 10.1109/TPDS.2025.3556861

GuoHua Xin;Guangquan Xu;Yao Zhang;Cheng Wen;Cen Zhang;Xiaofei Xie;Neal N. Xiong;Shaoying Liu;Pan Gao

{"title":"IRHunter: Universal Detection of Instruction Reordering Vulnerabilities for Enhanced Concurrency in Distributed and Parallel Systems","authors":"GuoHua Xin;Guangquan Xu;Yao Zhang;Cheng Wen;Cen Zhang;Xiaofei Xie;Neal N. Xiong;Shaoying Liu;Pan Gao","doi":"10.1109/TPDS.2025.3556861","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3556861","url":null,"abstract":"Instruction reordering is an essential optimization technique used in both compilers and multi-core processors to enhance parallelism and resource utilization. Although the original intent of this technique is to benefit the program, some improper reordering can significantly impact the program correctness, which we call instruction reordering vulnerability (IRV). However, existing methods detect IRV by defining CPU instruction reordering rules to schedule execution paths while neglecting compiler reordering, and thus generate false positives that require manual filtering and resulting in inefficiency. To bridge this gap, in this paper, we propose the IRV detection method, <italic>IRHunter, which analyzes IRV characteristics and extracts vulnerability patterns, integrating program dependency analysis for compiler reordering and memory model constraints for CPU reordering. Specifically, we use static analysis based on specific patterns to narrow the analysis scope, and adopt log-based dynamic analysis to confirm vulnerability by checking the log constraints. We built the IRV benchmark to compare <italic>IRHunter with five state-of-the-art tools (i.e., GENMC, Nidhugg, CBMC, SHB, BiRD). <italic>IRHunter detected all 19 errors, doubling the best model checking tools’ performance, with half the false positive rate of leading data race detectors. It was 10× faster on small programs and outperformed data race detectors on large programs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1220-1236"},"PeriodicalIF":5.6,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WCET Estimation for CNN Inference on FPGA SoC With Multi-DPU Engines 基于多cpu引擎的FPGA SoC CNN推理WCET估计

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-01 DOI: 10.1109/TPDS.2025.3555968

Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju

{"title":"WCET Estimation for CNN Inference on FPGA SoC With Multi-DPU Engines","authors":"Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju","doi":"10.1109/TPDS.2025.3555968","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555968","url":null,"abstract":"The Deep Learning Processor Unit (DPU) released in the official Xilinx Vitis AI toolchain stands as a commercial off-the-shelf solution tailored for accelerating convolutional neural network (CNN) inference on Xilinx FPGA devices. While most FPGA accelerator focus on high performance and energy-efficiency, analyzing the worst-case execution time (WCET) bound is essential for using CNN accelerations in real-time embedded systems design. In this work, we show that in a multi-DPU environment, the observed worst-case inference time for a CNN inference task could become 3X larger w.r.t. the best case inference time, which prompts the prominent importance of a static timing analysis for FPGA-based CNN inference. We propose, to the best of the authors’ knowledge, the first static timing analysis framework for CNN inference in a multi-DPU environment. The proposed framework introduces a generalized timing behavior model for shared bus arbitration and memory access contention between parallel running DPU engines. Additionally, it incorporates a fine-grained memory access contention analysis that takes into account the characteristics of deep learning applications. For a single-DPU environment, the analysis result is 27% tighter in average compared with the state-of-the-art results. Furthermore, our proposed method produces relatively tight estimated results in the multi-DPU environment.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1146-1160"},"PeriodicalIF":5.6,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraFetch: Accelerating Graph Applications Through Domain Specific Hierarchical Hybrid Prefetching GraFetch：通过特定领域的分层混合预取加速图形应用

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-29 DOI: 10.1109/TPDS.2025.3575106

Pengmiao Zhang;Rajgopal Kannan;Viktor K. Prasanna

{"title":"GraFetch: Accelerating Graph Applications Through Domain Specific Hierarchical Hybrid Prefetching","authors":"Pengmiao Zhang;Rajgopal Kannan;Viktor K. Prasanna","doi":"10.1109/TPDS.2025.3575106","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3575106","url":null,"abstract":"Memory performance bottlenecks the execution of graph applications, from traditional graph analytics (GA) to rapidly evolving graph neural networks (GNNs), due to the large size and complexity of graphs. While machine learning (ML) algorithms have shown potential in data prefetching to hide memory access latency, existing approaches face challenges with phase transitions and irregular memory access patterns in graph applications. To address these challenges, we introduce GraFetch, a specialized prefetching system for accelerating graph applications. GraFetch comprises of 1) a novel Hierarchical Hybrid Prefetching (HHP) framework that supports the cooperation of phase-specific ML predictors for high-complexity pattern prefetching and rule-based prefetchers for low-complexity pattern prefetching; and 2) Domain Specific Machine Learning (DSML) models integrated in the framework, which incorporate domain knowledge of graph applications to detect phases, recognize patterns, and predict memory accesses. We evaluate our approach using popular GA frameworks GPOP and X-Stream, and state-of-the-art GNN frameworks PyG and DGL. Our domain specific attention-based memory access predictors achieve 7.4% higher F1-score for delta (consecutive address jump) prediction and 15.35% higher accuracy@10 for page prediction compared with basic attention models. GraFetch achieves an average IPC improvement of 12.47% for GA and 4.18% for GNNs over a system with no prefetcher. This outperforms state-of-the-art rule-based prefetchers BO (7.12% for GA, 1.10% for GNNs), ISB (3.82% for GA, 1.60% for GNNs), and IMP (8.47% for GA, 2.20% for GNNs), as well as ML-based prefetchers Voyager (9.61% for GA, 3.14% for GNNs) and TransFetch (10.98% for GA, 2.48% for GNNs).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1542-1559"},"PeriodicalIF":5.6,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DFU-E: A Dataflow Architecture for Edge DSP and AI Applications DFU-E：边缘DSP和AI应用的数据流架构

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3555329

Wenming Li;Zhihua Fan;Tianyu Liu;Zhen Wang;Haibin Wu;Meng Wu;Kunming Zhang;Yanhuan Liu;Ninghui Sun;Xiaochun Ye;Dongrui Fan

{"title":"DFU-E: A Dataflow Architecture for Edge DSP and AI Applications","authors":"Wenming Li;Zhihua Fan;Tianyu Liu;Zhen Wang;Haibin Wu;Meng Wu;Kunming Zhang;Yanhuan Liu;Ninghui Sun;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2025.3555329","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555329","url":null,"abstract":"Edge computing aims to enable swift, real-time data processing, analysis, and storage close to the data source. However, edge computing platforms are often constrained by limited processing power and efficiency. This paper presents DFU-E, a dataflow-based accelerator specifically designed to meet the demands of edge digital signal processing (DSP) and artificial intelligence (AI) applications. Our design addresses real-world requirements with three main innovations. First, to accommodate the diverse algorithms utilized at the edge, we propose a multi-layer dataflow mechanism capable of exploiting task-level, instruction block-level, instruction-level, and data-level parallelism. Second, we develop an edge dataflow architecture that includes a customized processing element (PE) array, memory, and on-chip network microarchitecture optimized for the multi-layer dataflow mechanism. Third, we design an edge dataflow software stack that enables automatic optimizations through operator fusion, dataflow graph mapping, and task scheduling. We utilize representative real-world DSP and AI applications for evaluation. Comparing with Nvidia's state-of-the-art edge computing processor, DFU-E achieves up to 1.42× geometric mean performance improvement and 1.27× energy efficiency improvement.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1100-1114"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0