Parallel ComputingPub Date : 2026-03-01Epub Date: 2025-12-16DOI: 10.1016/j.parco.2025.103168
Kazutomo Yoshii , John R. Tramm , Bryce Allen , Tomohiro Ueno , Kentaro Sano , Andrew Siegel , Pete Beckman
{"title":"A case study in hardware specialization for Monte Carlo cross-section lookup","authors":"Kazutomo Yoshii , John R. Tramm , Bryce Allen , Tomohiro Ueno , Kentaro Sano , Andrew Siegel , Pete Beckman","doi":"10.1016/j.parco.2025.103168","DOIUrl":"10.1016/j.parco.2025.103168","url":null,"abstract":"<div><div>Hardware specialization is a promising direction in the post-Moore era, particularly for high-performance computing (HPC). In this work, we present a lightweight prototyping example of hardware specialization using open-source tools. Focusing on the Monte Carlo cross-section lookup kernel, a computation with low resource utilization on general-purpose architectures, we implement a custom hardware pipeline in Chisel and generate Verilog for resource usage estimation. We explore hardware optimization techniques that trade off throughput and resource usage, and show that, as SRAM scaling stalls and memory dominates chip area, using additional logic, even in brute-force forms, can lead to better overall efficiency. Our estimation demonstrates a significant performance gain over general-purpose CPUs. While this is a case study, the methodology provides a practical path for quick feasibility studies in hardware specialization.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103168"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-11-01Epub Date: 2025-11-15DOI: 10.1016/j.parco.2025.103164
Motahhare Mirzaei, Mehrdad Ashtiani, Mohammad Javad Pirhadi, Sauleh Eetemadi
{"title":"LSHDP: Locally sharded heterogeneous data parallel for distributed deep learning","authors":"Motahhare Mirzaei, Mehrdad Ashtiani, Mohammad Javad Pirhadi, Sauleh Eetemadi","doi":"10.1016/j.parco.2025.103164","DOIUrl":"10.1016/j.parco.2025.103164","url":null,"abstract":"<div><div>In today’s world, pre-trained models such as GPT-3 and Llama 3.1, along with the use of transformers, recognized as large AI models, have gained significant importance. To accelerate the training of these models, distributed training has become a fundamental approach. This method enables the execution of model training across multiple GPUs, which is particularly essential for models that require more data and training time. Despite past advancements, achieving optimal utilization of GPU capacity remains a major challenge, especially in academic environments that often feature heterogeneous infrastructures and limited bandwidth between nodes, which do not align with the assumptions of existing methods. In previous methods, the node with the lowest computational power is considered the bottleneck, leading to computational slowdowns and increased waiting times for other nodes. This study addresses the issue by adjusting batch sizes to minimize node waiting times. This approach improves the efficiency of node utilization without reducing the convergence speed. Moreover, to address GPU memory limitations, existing methods often rely on high-speed inter-node communication. This reliance increases training time in scenarios with low network bandwidth (e.g., 1 Gb/s). This research mitigates this challenge using the LSDP (Locally Sharded Data Parallel) method, which leverages CPU memory instead of inter-node communication. Finally, by combining these two strategies, the LSHDP (Locally Sharded Heterogeneous Data Parallel) solution is introduced which is suitable for heterogeneous infrastructures with low inter-node communication speeds. Experiments demonstrate that this method outperforms previous approaches in such environments, achieving improvements of 35.39 % and 52.57 % in terms of speed compared to data-parallel and Fully Sharded Data Parallel (FSDP) methods respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103164"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-11-01Epub Date: 2025-10-10DOI: 10.1016/j.parco.2025.103160
Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin
{"title":"GPU/CUDA-Accelerated gradient growth optimizer for efficient complex numerical global optimization","authors":"Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin","doi":"10.1016/j.parco.2025.103160","DOIUrl":"10.1016/j.parco.2025.103160","url":null,"abstract":"<div><div>Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103160"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-11-01Epub Date: 2025-10-29DOI: 10.1016/j.parco.2025.103162
Pedro Moreno , Miguel Areias , Ricardo Rocha
{"title":"A sleek lock-free hash map in an ERA of safe memory reclamation methods","authors":"Pedro Moreno , Miguel Areias , Ricardo Rocha","doi":"10.1016/j.parco.2025.103162","DOIUrl":"10.1016/j.parco.2025.103162","url":null,"abstract":"<div><div>Lock-free data structures have become increasingly significant due to their algorithmic advantages in multi-core cache-based architectures. Safe Memory Reclamation (SMR) is a technique used in concurrent programming to ensure that memory can be safely reclaimed without causing data corruption, dangling pointers, or access to freed memory. The ERA theorem states that any SMR method for concurrent data structures can only provide at most two of the three main desirable properties: Ease of use, Robustness, and Applicability. This fundamental trade-off influences the design of efficient lock-free data structures at an early stage. This work redesigns a previous lock-free hash map to fully exploit the properties of the ERA theorem and to leverage the characteristics of multi-core cache-based architectures by minimizing the number of cache misses, which are a significant bottleneck in multi-core environments. Experimental results show that our design outperforms the previous design, which was already quite competitive when compared against the Concurrent Hash Map design of the Intel’s TBB library.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103162"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-11-01Epub Date: 2025-11-06DOI: 10.1016/j.parco.2025.103163
Athanasios Margaris , Stavros Souravlas
{"title":"Detecting chaotic regions of recurrent equations in parallel environments","authors":"Athanasios Margaris , Stavros Souravlas","doi":"10.1016/j.parco.2025.103163","DOIUrl":"10.1016/j.parco.2025.103163","url":null,"abstract":"<div><div>This paper investigates how parallel computing techniques, such as OpenMP and CUDA, can be optimized to enhance the computational efficiency of detecting chaotic regions in the parameter space of recurrent equations, a critical task in chaos theory. Leveraging the embarrassingly parallel nature of maximum Lyapunov exponent calculations, our method targets systems with known recurrence relations, where governing equations are analytically defined. Applied to a discretized recurrent neural model, the proposed approach achieves significant speedups, addressing the computational intensity of chaos detection. While building on established parallel techniques, this work fills a gap in their systematic application to chaos detection in high-dimensional systems, offering a scalable solution with potential for real-time analysis. We provide detailed performance metrics, parallel I/O guidelines, and visualization strategies, demonstrating adaptability to other analytically defined chaotic systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103163"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A dependency-aware task offloading in IoT-based edge computing system using an optimized deep learning approach","authors":"Shiva Shankar Reddy , Silpa Nrusimhadri , Gadiraju Mahesh , Veeranki Venkata Rama Maheswara Rao","doi":"10.1016/j.parco.2025.103161","DOIUrl":"10.1016/j.parco.2025.103161","url":null,"abstract":"<div><div>Internet of Things (IoT) devices produce a lot of data, which can be difficult to process on limited computing systems. Edge computing aims to solve this issue by providing localized processing power at the edge of IoT networks to reduce communication delays and network bandwidth. Because of their limited resources and task dependencies, edge computing systems are facing computational issues as a result of the growing usage of IoT devices. An efficient task-offloading system that combines the Fire Hawk Optimizer (FHO) and Deep Reinforcement Learning (DRL) is proposed in this research to address these issues. This paper proposes leveraging deep learning techniques to prioritize and offload computational tasks from IoT applications to edge computing systems, addressing task interdependencies and resource constraints to enhance efficiency. The proposed method consists of two components. The first component uses Petri-Net modelling to analyze interdependencies among tasks, identify subtasks, and map their relationships. The second component uses a residual neural network-based actor-critic deep reinforcement learning (ResNet-ACDRL) decision-making model to offload tasks. Task dependencies and resource availability are assessed by the DRL component, namely a ResNet-ACDRL model, which is utilized to dynamically learn and enhance task-offloading strategies. In order to ensure optimal task allocation across local, edge, and cloud computing resources, the FHO is then used to refine these learned policies. Here, the term \"policy\" refers to the strategy used by the system to decide the most suitable resource for task execution. This dual approach strategy drastically reduces energy usage and execution delays. The suggested framework outperforms existing methods, according to experimental data, especially when managing task interdependencies and a variety of computational loads. The proposed method has been shown to significantly improve time delay and energy consumption compared to existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103161"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Java to create and analyze models of parallel computing systems","authors":"Harish Padmanaban , Nurkasym Arkabaev , Maher Ali Rusho , Vladyslav Kozub , Yurii Kozub","doi":"10.1016/j.parco.2025.103146","DOIUrl":"10.1016/j.parco.2025.103146","url":null,"abstract":"<div><div>The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103146"},"PeriodicalIF":2.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-09-01Epub Date: 2025-07-11DOI: 10.1016/j.parco.2025.103147
Yuyao Niu, Marc Cacas
{"title":"ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures","authors":"Yuyao Niu, Marc Cacas","doi":"10.1016/j.parco.2025.103147","DOIUrl":"10.1016/j.parco.2025.103147","url":null,"abstract":"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103147"},"PeriodicalIF":2.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-09-01Epub Date: 2025-08-11DOI: 10.1016/j.parco.2025.103149
Xiang Zhao, Haitao Du, Yi Kang
{"title":"Enable cross-iteration parallelism for PIM-based graph processing with vertex-level synchronization","authors":"Xiang Zhao, Haitao Du, Yi Kang","doi":"10.1016/j.parco.2025.103149","DOIUrl":"10.1016/j.parco.2025.103149","url":null,"abstract":"<div><div>Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.</div><div>To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8<span><math><mo>×</mo></math></span> and a maximum of 2.3<span><math><mo>×</mo></math></span> compared to Dalorex, the state-of-the-art PIM-based graph processing system.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103149"},"PeriodicalIF":2.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144860808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel ComputingPub Date : 2025-09-01Epub Date: 2025-08-25DOI: 10.1016/j.parco.2025.103150
Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani
{"title":"Software acceleration of multi-user MIMO uplink detection on GPU","authors":"Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani","doi":"10.1016/j.parco.2025.103150","DOIUrl":"10.1016/j.parco.2025.103150","url":null,"abstract":"<div><div>This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.</div><div>Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from <span><math><mrow><mn>2</mn><mo>.</mo><mn>04</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> to <span><math><mrow><mn>1</mn><mo>.</mo><mn>24</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.</div><div>Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103150"},"PeriodicalIF":2.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144922663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}