Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang
{"title":"Multi-level parallelism optimization for two-dimensional convolution vectorization method on multi-core vector accelerator","authors":"Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang","doi":"10.1016/j.parco.2025.103137","DOIUrl":"10.1016/j.parco.2025.103137","url":null,"abstract":"<div><div>The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103137"},"PeriodicalIF":2.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Byzantine-tolerant detection of causality: There is no holy grail","authors":"Anshuman Misra , Ajay D. Kshemkalyani","doi":"10.1016/j.parco.2025.103136","DOIUrl":"10.1016/j.parco.2025.103136","url":null,"abstract":"<div><div>Detecting causality or the “happened before” relation between events in an asynchronous distributed system is a widely used building block in distributed applications. To the best of our knowledge, this problem has not been examined in a system with Byzantine processes. We prove the following results for an asynchronous system with Byzantine processes. (1) We prove that it is impossible to determine causality between events in the presence of even a single Byzantine process when processes communicate by unicasting. (2) We also prove a similar impossibility result when processes communicate by broadcasting. (3) We also prove a similar impossibility result when processes communicate by multicasting. (4–5) In an execution where there exists a causal path between two events passing through only correct processes, we prove that it is possible to detect causality between such a pair of events when processes communicate by unicasting or broadcasting. (6) However, when processes communicate by multicasting and there exists a causal path between two events passing through only correct processes, we prove that it is impossible to detect causality between such a pair of events. (7–9) Even with the use of cryptography, we prove that the impossibility results of (1–3) for unicasts, broadcasts, and multicasts, respectively, hold. (10–12) With the use of cryptography, when there exists a causal path between two events passing through only correct processes, we prove it is possible to detect causality between such a pair of events, irrespective of whether the communication is by unicasts, broadcasts, or multicasts. Our results are significant because Byzantine systems mirror the real world.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103136"},"PeriodicalIF":2.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143838095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič
{"title":"Estimating resource budgets to ensure autotuning efficiency","authors":"Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič","doi":"10.1016/j.parco.2025.103126","DOIUrl":"10.1016/j.parco.2025.103126","url":null,"abstract":"<div><div>Many state-of-the-art HPC applications rely on autotuning to maintain peak performance. Autotuning allows a program to be re-optimized for new hardware, settings, or input — even during execution. However, the approach has an inherent problem that has yet to be properly addressed: since the autotuning process itself requires computational resources, it is also subject to optimization. In other words, while autotuning aims to decrease a program’s run time by improving its efficiency, it also introduces additional overhead that can extend the overall run time. To achieve optimal performance, both the application and the autotuning process should be optimized together, treating them as a single optimization criterion. This framing allows us to determine a reasonable tuning budget to avoid both undertuning, where insufficient autotuning leads to suboptimal performance, and overtuning, where excessive autotuning imposes overhead that outweighs the benefits of program optimization.</div><div>In this paper, we explore the tuning budget optimization problem in detail, highlighting its interesting properties and implications, which have largely been overlooked in the literature. Additionally, we present several viable solutions for tuning budget optimization and evaluate their efficiency across a range of commonly used HPC kernels.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103126"},"PeriodicalIF":2.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter
{"title":"Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid","authors":"Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter","doi":"10.1016/j.parco.2025.103125","DOIUrl":"10.1016/j.parco.2025.103125","url":null,"abstract":"<div><div>Researchers in parallel and distributed computing (PDC) often resort to simulation because experiments conducted using a simulator can be for arbitrary experimental scenarios, are less resource-, labor-, and time-consuming than their real-world counterparts, and are perfectly repeatable and observable. Many frameworks have been developed to ease the development of PDC simulators, and these frameworks provide different levels of accuracy, scalability, versatility, extensibility, and usability. The SimGrid framework has been used by many PDC researchers to produce a wide range of simulators for over two decades. Its popularity is due to a large emphasis placed on accuracy, scalability, and versatility, and is in spite of shortcomings in terms of extensibility and usability. Although SimGrid provides sensible simulation models for the common case, it was difficult for users to extend these models to meet domain-specific needs. Furthermore, SimGrid only provided relatively low-level simulation abstractions, making the implementation of a simulator of a complex system a labor-intensive undertaking. In this work we describe developments in the last decade that have contributed to vastly improving extensibility and usability, thus lowering or removing entry barriers for users to develop custom SimGrid simulators.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103125"},"PeriodicalIF":2.0,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng
{"title":"Scalable tasking runtime with parallelized builders for explicit message passing architectures","authors":"Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng","doi":"10.1016/j.parco.2024.103124","DOIUrl":"10.1016/j.parco.2024.103124","url":null,"abstract":"<div><div>The sequential task flow (STF) model introduces implicit data dependences to exploit task-based parallelism, simplifying programming but also introducing non-negligible runtime overhead. On emerging cache-less, explicit inter-core message passing (EMP) architectures, the long latency of memory access further amplifies the runtime overhead of the traditional STF model, resulting in unsatisfactory performance.</div><div>This paper addresses two main components in the STF tasking runtime. We uncover abundant concurrency in the task dependence graph (TDG) building process through three sufficient conditions, put forward PBH, a parallelized TDG building algorithm with helpers which mixes pipeline parallelism and data parallelism to overcome the TDG building bottleneck for fine-grained tasks. We also introduce a centralized, lock-less task scheduler, EMP-C, based on the EMP interface, and propose three optimizations. These two techniques are implemented and evaluated on a product processor with EMP support, i.e. SW26010. Experimental results show that compared to traditional techniques, PBH achieves an average speedup of 1.55 for fine-grained task workloads, and the EMP-C scheduler brings speedups as high as 1.52 and 2.38 for fine-grained and coarse-grained task workloads, respectively. And the combination of these two techniques significantly improves the granularity scalability of the runtime, reducing the minimum effective task granularity (METG) to 0.1 ms and achieving an order of magnitude decrease in some cases.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103124"},"PeriodicalIF":2.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš
{"title":"Iterative methods in GPU-resident linear solvers for nonlinear constrained optimization","authors":"Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš","doi":"10.1016/j.parco.2024.103123","DOIUrl":"10.1016/j.parco.2024.103123","url":null,"abstract":"<div><div>Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from Arioli et al., (2007) to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103123"},"PeriodicalIF":2.0,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143175823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards resilient and energy efficient scalable Krylov solvers","authors":"Zheng Miao , Jon C. Calhoun , Rong Ge","doi":"10.1016/j.parco.2024.103122","DOIUrl":"10.1016/j.parco.2024.103122","url":null,"abstract":"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103122"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142703732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang
{"title":"Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions","authors":"Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang","doi":"10.1016/j.parco.2024.103121","DOIUrl":"10.1016/j.parco.2024.103121","url":null,"abstract":"<div><div>The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432<span><math><mo>×</mo></math></span> and 18864<span><math><mo>×</mo></math></span> for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6<span><math><mo>×</mo></math></span> for the execution of the Kyber algorithm compared with the advanced hardware design.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103121"},"PeriodicalIF":2.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui
{"title":"FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning","authors":"Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui","doi":"10.1016/j.parco.2024.103114","DOIUrl":"10.1016/j.parco.2024.103114","url":null,"abstract":"<div><div>Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103114"},"PeriodicalIF":2.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142532380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri
{"title":"Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix","authors":"Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri","doi":"10.1016/j.parco.2024.103113","DOIUrl":"10.1016/j.parco.2024.103113","url":null,"abstract":"<div><div>This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103113"},"PeriodicalIF":2.0,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}