Nathalie Furmento, Abdou Guermouche, Gwenolé Lucas, Thomas Morin, Samuel Thibault, Pierre-André Wacrenier
{"title":"Optimizing parallel heterogeneous system efficiency: Dynamic task graph adaptation with recursive tasks","authors":"Nathalie Furmento, Abdou Guermouche, Gwenolé Lucas, Thomas Morin, Samuel Thibault, Pierre-André Wacrenier","doi":"10.1016/j.jpdc.2025.105157","DOIUrl":"10.1016/j.jpdc.2025.105157","url":null,"abstract":"<div><div>Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, <span>PaRSEC</span>, <span>StarPU</span>, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (<span>PaRSEC</span>'s DTD, OmpSs, <span>StarPU</span>) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. <span>StarPU</span>'s recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a <span>StarPU</span> component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105157"},"PeriodicalIF":4.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144749390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"To repair or not to repair: Assessing fault resilience in MPI stencil applications","authors":"Roberto Rocco , Elisabetta Boella , Daniele Gregori , Gianluca Palermo","doi":"10.1016/j.jpdc.2025.105156","DOIUrl":"10.1016/j.jpdc.2025.105156","url":null,"abstract":"<div><div>With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105156"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen
{"title":"Federated multi-task learning with cross-device heterogeneous task subsets","authors":"Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen","doi":"10.1016/j.jpdc.2025.105155","DOIUrl":"10.1016/j.jpdc.2025.105155","url":null,"abstract":"<div><div>Traditional Federated Learning (FL) predominantly focuses on task-consistent scenarios, assuming clients possess identical tasks or task sets. However, in multi-task scenarios, client task sets can vary greatly due to their operating environments, available resources, and hardware configurations. Conventional task-consistent FL cannot address such heterogeneity effectively. We define this statistical heterogeneity of task sets, where each client performs a unique subset of server tasks, as cross-device task heterogeneity. In this work, we propose a novel Federated Partial Multi-task (FedPMT) method, allowing clients with diverse task sets to collaborate and train comprehensive models suitable for any task subset. Specifically, clients deploy partial multi-task models tailored to their localized task sets, while the server utilizes single-task models as an intermediate stage to address the model heterogeneity arising from differing task sets. Collaborative training is facilitated through bidirectional transformations between them. To alleviate the negative transfer caused by task set disparities, we introduce task attenuation factors to modulate the influence of different tasks. This adjustment enhances the performance and task generalization ability of cloud models, promoting models to converge towards a shared optimum across all task subsets. Extensive experiments conducted on the NYUD-v2, PASCAL Context and Cityscapes datasets validate the effectiveness and superiority of FedPMT.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105155"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00116-9","DOIUrl":"10.1016/S0743-7315(25)00116-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105149"},"PeriodicalIF":3.4,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DH_Aligner: A fast short-read aligner on multicore platforms with AVX vectorization","authors":"Qiao Sun , Feng Chen , Leisheng Li , Huiyuan Li","doi":"10.1016/j.jpdc.2025.105142","DOIUrl":"10.1016/j.jpdc.2025.105142","url":null,"abstract":"<div><div>The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignment makes an educated guess on where and how a read is mapped to a given reference sequence. In this paper, we propose DH_Aligner, a fast and accurate short read aligner designed and optimized for x86 multi-core platforms with <span>avx2/avx512</span> SIMD instruction sets. It is based on a three-phased aligning work-flow: seeding-filtering-extension and provides an end-to-end solution for read alignment from <span>Fastq</span> to <span>SAM</span> files. Due to a fast seeding scheme and a seed filtering procedure, DH_Aligner can avoid both of a time-consuming seeding phase and redundant workload of aligning reads at seemingly wrong locations. With the introduction of batched-processing methodology, parallelism is easily exploited at data-, instruction- and thread-level. The performance-critical kernels in DH_Aligner are implemented by both <span>avx2</span> and <span>avx512</span> intrinsics for a better performance and portability. On two typical x86 based platforms: Intel Xeon-6154 and Hygon C86-7285, DH_Aligner can produce a near-best accuracy/sensitivity while outperform state-of-the-art parallel implementations with average speedup: 7.8x, 3.4x, 2.8x-6.7x and 1.5x over bwa-mem, bwa-mem2, bowtie2 and minimap2 respectively.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105142"},"PeriodicalIF":3.4,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janaina Schwarzrock , Hiago Mayk G. de A. Rocha , Arthur F. Lorenzon , Samuel Xavier de Souza , Antonio Carlos S. Beck
{"title":"Integration framework for online thread throttling with thread and page mapping on NUMA systems","authors":"Janaina Schwarzrock , Hiago Mayk G. de A. Rocha , Arthur F. Lorenzon , Samuel Xavier de Souza , Antonio Carlos S. Beck","doi":"10.1016/j.jpdc.2025.105145","DOIUrl":"10.1016/j.jpdc.2025.105145","url":null,"abstract":"<div><div>Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105145"},"PeriodicalIF":3.4,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philippe Leleux , Christina Schwarz , Martin J. Kühn , Carola Kruse , Ulrich Rüde
{"title":"Complexity analysis and scalability of a matrix-free extrapolated geometric multigrid solver for curvilinear coordinates representations from fusion plasma applications","authors":"Philippe Leleux , Christina Schwarz , Martin J. Kühn , Carola Kruse , Ulrich Rüde","doi":"10.1016/j.jpdc.2025.105143","DOIUrl":"10.1016/j.jpdc.2025.105143","url":null,"abstract":"<div><div>Tokamak fusion reactors are promising alternatives for future energy production. Gyrokinetic simulations are important tools to understand physical processes inside tokamaks and to improve the design of future plants. In gyrokinetic codes such as Gysela, these simulations involve at each time step the solution of a gyrokinetic Poisson equation defined on disk-like cross sections. The authors of <span><span>[14]</span></span>, <span><span>[15]</span></span> proposed to discretize a simplified differential equation using symmetric finite differences derived from the resulting energy functional and to use an implicitly extrapolated geometric multigrid scheme tailored to problems in curvilinear coordinates. In this article, we extend the discretization to a more realistic partial differential equation and demonstrate the optimal linear complexity of the proposed solver, in terms of computation and memory. We provide a general framework to analyze floating point operations and memory usage of matrix-free approaches for stencil-based operators. Finally, we give an efficient matrix-free implementation for the considered solver exploiting a task-based multithreaded parallelism which takes advantage of the disk-shaped geometry of the problem. We demonstrate the parallel efficiency for the solution of problems of size up to 50 million unknowns.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105143"},"PeriodicalIF":3.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards efficient program execution on edge-cloud computing platforms","authors":"Jean-François Dollinger, Vincent Vauchey","doi":"10.1016/j.jpdc.2025.105135","DOIUrl":"10.1016/j.jpdc.2025.105135","url":null,"abstract":"<div><div>This paper investigates techniques dedicated to the performance of edge-cloud infrastructures and identifies the challenges to address to maximize their efficiency. Unlike traditional cloud-only processing, edge-cloud platforms meet the stringent requirements of real-time applications via additional computing resources close to the data source. Yet, due to numerous performance factors, it is a complex task to perform efficient computations on such platforms. Thus, we identify the main performance bottlenecks induced by traditional approaches and extensively discuss the performance characteristics of edge computing platforms. Based on these insights, we design an automated framework capable of achieving end-to-end efficacy of edge-cloud applications. We argue that achieving performance on edge-cloud infrastructures requires adaptive offloading of programs based on computational requirements. Thus, we comprehensively study three performance-critical aspects forming the performance workflow of applications: i) performance modelling, ii) program optimization iii) task scheduling. First, we explore performance modelling techniques, forming the foundation of most cost models, to accurately predict and achieve robust code optimization and scheduling. We then cover the whole program optimization chain, from hotspot detection to code optimization, focusing on memory locality, code parallelization, and acceleration. Finally, we discuss task scheduling techniques for selecting the best computing resource and ensuring a balanced workload distribution. Overall, our study provides insights by covering the above performance workflow referencing prominent state-of-the-art works, particularly focusing on those not yet applied in the context of edge-cloud computing. Additionally, we conducted experiments to further validate our findings. Finally, for each topic of interest, we identify the addressed scientific obstacles and outline the open research challenges yet to be overcome.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105135"},"PeriodicalIF":3.4,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hantao Xiong , Wangdong Yang , Weiqing He , Shengle Lin , Keqin Li , Kenli Li
{"title":"MM-AutoSolver: A multimodal machine learning method for the auto-selection of iterative solvers and preconditioners","authors":"Hantao Xiong , Wangdong Yang , Weiqing He , Shengle Lin , Keqin Li , Kenli Li","doi":"10.1016/j.jpdc.2025.105144","DOIUrl":"10.1016/j.jpdc.2025.105144","url":null,"abstract":"<div><div>The solution of large-scale sparse linear systems of the form <span><math><mi>A</mi><mi>x</mi><mo>=</mo><mi>b</mi></math></span> is an important research problem in the field of High-performance Computing (HPC). With the increasing scale of these systems and the development of both HPC software and hardware, iterative solvers along with appropriate preconditioners have become mainstream methods for efficiently solving these sparse linear systems that arise from real-world HPC applications. Among abundant combinations of iterative solvers and preconditioners, the automatic selection of the optimal one has become a vital problem for accelerating the solution of these sparse linear systems. Previous work has utilized machine learning or deep learning algorithms to tackle this problem, but fails to abstract and exploit sufficient features from sparse linear systems, thus unable to obtain satisfactory results. In this work, we propose to address the automatic selection of the optimal combination of iterative solvers and preconditioners through the powerful multimodal machine learning framework, in which features of different modalities can be fully extracted and utilized to improve the results. Based on the multimodal machine learning framework, we put forward a multimodal machine learning model called MM-AutoSolver for the auto-selection of the optimal combination for a given sparse linear system. The experimental results based on a new large-scale matrix collection showcase that the proposed MM-AutoSolver outperforms state-of-the-art methods in predictive performance and has the capability to significantly accelerate the solution of large-scale sparse linear systems in HPC applications.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105144"},"PeriodicalIF":3.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel watershed partitioning: GPU-based hierarchical image segmentation","authors":"Varduhi Yeghiazaryan , Yeva Gabrielyan , Irina Voiculescu","doi":"10.1016/j.jpdc.2025.105140","DOIUrl":"10.1016/j.jpdc.2025.105140","url":null,"abstract":"<div><div>Many image processing applications rely on partitioning an image into disjoint regions whose pixels are ‘similar.’ The watershed and waterfall transforms are established mathematical morphology pixel clustering techniques. They are both relevant to modern applications where groups of pixels are to be decided upon in one go, or where adjacency information is relevant. We introduce three new parallel partitioning algorithms for GPUs. By repeatedly applying watershed algorithms, we produce waterfall results which form a hierarchy of partition regions over an input image. Our watershed algorithms attain competitive execution times in both 2D and 3D, processing an 800 megavoxel image in less than 1.4 sec. We also show how to use this fully deterministic image partitioning as a pre-processing step to machine-learning-based semantic segmentation. This replaces the role of superpixel algorithms, and results in comparable accuracy and faster training times. The code is publicly available at <span><span>https://github.com/hamemm/PRUF-watershed.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105140"},"PeriodicalIF":3.4,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}