Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin
{"title":"GPU/CUDA-Accelerated gradient growth optimizer for efficient complex numerical global optimization","authors":"Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin","doi":"10.1016/j.parco.2025.103160","DOIUrl":"10.1016/j.parco.2025.103160","url":null,"abstract":"<div><div>Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103160"},"PeriodicalIF":2.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani
{"title":"Software acceleration of multi-user MIMO uplink detection on GPU","authors":"Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani","doi":"10.1016/j.parco.2025.103150","DOIUrl":"10.1016/j.parco.2025.103150","url":null,"abstract":"<div><div>This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.</div><div>Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from <span><math><mrow><mn>2</mn><mo>.</mo><mn>04</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> to <span><math><mrow><mn>1</mn><mo>.</mo><mn>24</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.</div><div>Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103150"},"PeriodicalIF":2.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144922663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enable cross-iteration parallelism for PIM-based graph processing with vertex-level synchronization","authors":"Xiang Zhao, Haitao Du, Yi Kang","doi":"10.1016/j.parco.2025.103149","DOIUrl":"10.1016/j.parco.2025.103149","url":null,"abstract":"<div><div>Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.</div><div>To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8<span><math><mo>×</mo></math></span> and a maximum of 2.3<span><math><mo>×</mo></math></span> compared to Dalorex, the state-of-the-art PIM-based graph processing system.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103149"},"PeriodicalIF":2.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144860808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures","authors":"Yuyao Niu, Marc Cacas","doi":"10.1016/j.parco.2025.103147","DOIUrl":"10.1016/j.parco.2025.103147","url":null,"abstract":"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103147"},"PeriodicalIF":2.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Java to create and analyze models of parallel computing systems","authors":"Harish Padmanaban , Nurkasym Arkabaev , Maher Ali Rusho , Vladyslav Kozub , Yurii Kozub","doi":"10.1016/j.parco.2025.103146","DOIUrl":"10.1016/j.parco.2025.103146","url":null,"abstract":"<div><div>The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103146"},"PeriodicalIF":2.0,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment","authors":"Wei Qian , Zhengwei Zhu , Chenyang Zhu , Yanping Zhu","doi":"10.1016/j.parco.2025.103138","DOIUrl":"10.1016/j.parco.2025.103138","url":null,"abstract":"<div><div>In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103138"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143912054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EESF: Energy-efficient scheduling framework for deadline-constrained workflows with computation speed estimation method in cloud","authors":"Rupinder Kaur, Gurjinder Kaur, Major Singh Goraya","doi":"10.1016/j.parco.2025.103139","DOIUrl":"10.1016/j.parco.2025.103139","url":null,"abstract":"<div><div>Substantial amount of energy consumed by rapidly growing cloud data centers is a major hindrance to sustainable cloud computing. Therefore, this paper proposes a scheduling framework named EESF aiming at minimizing the energy consumption and makespan of workflow execution under deadline and dependency constraints. The novel aspects of the proposed EESF are outlined as follows: 1) it first estimates the computation speed requirements of the entire workflow application before beginning the execution. Then, it estimates the computation speed requirements of individual tasks dynamically during execution. 2) Different from existing approaches that mainly assign tasks to virtual machines (VMs) with lower energy consumption or use DVFS to lower the frequency or voltage of hosts/VMs leading to longer makespan, EESF considers the degree of dependency of the tasks along with estimated speed for task-VM assignment. 3) Based on the fact that scheduling dependent tasks on same VM is not always energy-efficient, a new concept of virtual task clustering is introduced to schedule the tasks with dependencies in an energy-efficient manner. 4) EESF deploys VMs dynamically as per the necessary computation speed requirements of the tasks to prevent over-provisioning/under-provisioning of computational power. 5) In general, task reassignment causes huge data transfer which also consumes energy, but EESF reassigns tasks to more-energy efficient VMs running on the same host, thereby zeroing the data transfer time. Experiments performed using four real-world scientific workflows and 10 random workflows illustrate that EESF reduces energy consumption by 6%-44% than related algorithms while significantly reducing the makespan.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103139"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143935399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang
{"title":"Multi-level parallelism optimization for two-dimensional convolution vectorization method on multi-core vector accelerator","authors":"Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang","doi":"10.1016/j.parco.2025.103137","DOIUrl":"10.1016/j.parco.2025.103137","url":null,"abstract":"<div><div>The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103137"},"PeriodicalIF":2.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Byzantine-tolerant detection of causality: There is no holy grail","authors":"Anshuman Misra , Ajay D. Kshemkalyani","doi":"10.1016/j.parco.2025.103136","DOIUrl":"10.1016/j.parco.2025.103136","url":null,"abstract":"<div><div>Detecting causality or the “happened before” relation between events in an asynchronous distributed system is a widely used building block in distributed applications. To the best of our knowledge, this problem has not been examined in a system with Byzantine processes. We prove the following results for an asynchronous system with Byzantine processes. (1) We prove that it is impossible to determine causality between events in the presence of even a single Byzantine process when processes communicate by unicasting. (2) We also prove a similar impossibility result when processes communicate by broadcasting. (3) We also prove a similar impossibility result when processes communicate by multicasting. (4–5) In an execution where there exists a causal path between two events passing through only correct processes, we prove that it is possible to detect causality between such a pair of events when processes communicate by unicasting or broadcasting. (6) However, when processes communicate by multicasting and there exists a causal path between two events passing through only correct processes, we prove that it is impossible to detect causality between such a pair of events. (7–9) Even with the use of cryptography, we prove that the impossibility results of (1–3) for unicasts, broadcasts, and multicasts, respectively, hold. (10–12) With the use of cryptography, when there exists a causal path between two events passing through only correct processes, we prove it is possible to detect causality between such a pair of events, irrespective of whether the communication is by unicasts, broadcasts, or multicasts. Our results are significant because Byzantine systems mirror the real world.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103136"},"PeriodicalIF":2.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143838095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič
{"title":"Estimating resource budgets to ensure autotuning efficiency","authors":"Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič","doi":"10.1016/j.parco.2025.103126","DOIUrl":"10.1016/j.parco.2025.103126","url":null,"abstract":"<div><div>Many state-of-the-art HPC applications rely on autotuning to maintain peak performance. Autotuning allows a program to be re-optimized for new hardware, settings, or input — even during execution. However, the approach has an inherent problem that has yet to be properly addressed: since the autotuning process itself requires computational resources, it is also subject to optimization. In other words, while autotuning aims to decrease a program’s run time by improving its efficiency, it also introduces additional overhead that can extend the overall run time. To achieve optimal performance, both the application and the autotuning process should be optimized together, treating them as a single optimization criterion. This framing allows us to determine a reasonable tuning budget to avoid both undertuning, where insufficient autotuning leads to suboptimal performance, and overtuning, where excessive autotuning imposes overhead that outweighs the benefits of program optimization.</div><div>In this paper, we explore the tuning budget optimization problem in detail, highlighting its interesting properties and implications, which have largely been overlooked in the literature. Additionally, we present several viable solutions for tuning budget optimization and evaluate their efficiency across a range of commonly used HPC kernels.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103126"},"PeriodicalIF":2.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}