{"title":"AFS-GNN: Adaptive and fast scheduling system for distributed GNN training","authors":"Yuting Gao, Yongqiang Gao, Yongmei Liu","doi":"10.1016/j.jpdc.2026.105225","DOIUrl":"10.1016/j.jpdc.2026.105225","url":null,"abstract":"<div><div>Graph Neural Networks (GNNs) have become core models for learning from relational data in domains such as transportation, social networks, and recommender systems. However, distributed GNN training on large graphs suffers from severe GPU workload imbalance and high communication cost caused by dynamic mini-batch sampling and large structural differences among nodes. To address these challenges, we propose AFS-GNN, a scheduling-aware adaptive framework that achieves fine-grained workload balancing in distributed GNN training. AFS-GNN continuously monitors per-GPU mini-batch execution time through lightweight runtime agents and employs Kalman filtering to suppress transient fluctuations and detect persistent imbalance trends. Upon imbalance detection, it constructs a Hierarchical Dependency Graph (HDG) that explicitly captures multi-hop aggregation dependencies and node-level computational costs. Guided by a heuristic load estimator, AFS-GNN applies cost-aware spectral bipartitioning via the Fiedler vector to select structurally coherent migration blocks that minimize inter-GPU communication while maintaining computational consistency. Selected blocks are migrated asynchronously across devices using intra-node or inter-process communication, ensuring non-blocking execution. Extensive experiments on large-scale benchmarks-<em>ogbn-products</em> and <em>ogbn-papers100M</em>-demonstrate that AFS-GNN achieves up to 21.7% acceleration over Euler, 15% over DistDGL, and 13.7% over FlexGraph, while maintaining stable convergence and scalability across diverse batch sizes and partition configurations.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105225"},"PeriodicalIF":4.0,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OptimES: Optimizing federated learning using remote embeddings for graph neural networks","authors":"Pranjal Naman, Yogesh Simmhan","doi":"10.1016/j.jpdc.2026.105227","DOIUrl":"10.1016/j.jpdc.2026.105227","url":null,"abstract":"<div><div>Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. However, in most real-world settings, such as financial transaction networks and healthcare networks, this data is localized to different data owners and cannot be aggregated due to privacy concerns. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model that iteratively aggregates local models trained on decentralized data. This addresses privacy concerns while leveraging parallelism. State-of-the-art methods enhance the privacy-respecting convergence accuracy of federated GNN training by sharing remote embeddings of boundary vertices through a server (EmbC). However, they are limited by diminished performance due to large communication costs. In this article, we propose OptimES, an optimized federated GNN training framework that employs remote neighbourhood pruning, overlapping the push of embeddings to the server with local training, and dynamic pulling of embeddings to reduce network costs and training time. We perform a rigorous evaluation of these strategies for four common graph datasets with up to 111<em>M</em> vertices and 1.6<em>B</em> edges. We see that a modest drop in per-round accuracy due to the preemptive push of embeddings is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, converging up to ≈ 3.5 × faster than EmbC and giving up to ≈ 16% better accuracy than the default federated GNN learning. While accuracy improvements over default federated GNNs are modest for sparser graphs like Arxiv and Papers, they achieve the target accuracy about ≈ 11 × faster than EmbC.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105227"},"PeriodicalIF":4.0,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On complexity of substructure connectivity and restricted connectivity of graphs","authors":"Huazhong Lü , Tingzeng Wu","doi":"10.1016/j.jpdc.2026.105237","DOIUrl":"10.1016/j.jpdc.2026.105237","url":null,"abstract":"<div><div>The connectivity of a graph is an important parameter to evaluate its reliability. <em>k</em>-restricted connectivity (resp. <em>R<sup>h</sup></em>-restricted connectivity) of a graph <em>G</em> is the minimum cardinality of a set <em>S</em> of vertices in <em>G</em>, if exists, whose deletion disconnects <em>G</em> and leaves each component of <span><math><mrow><mi>G</mi><mo>−</mo><mi>S</mi></mrow></math></span> with more than <em>k</em> vertices (resp. <span><math><mrow><mi>δ</mi><mo>(</mo><mi>G</mi><mo>−</mo><mi>S</mi><mo>)</mo><mo>≥</mo><mi>h</mi></mrow></math></span>). In contrast, structure (substructure) connectivity of <em>G</em> is defined as the minimum number of vertex-disjoint subgraphs whose deletion disconnects <em>G</em>. As generalizations of the concept of connectivity, structure (substructure) connectivity, restricted connectivity and <em>R<sup>h</sup></em>-restricted connectivity have been extensively studied from the combinatorial point of view. Very little is known about the computational complexity of these variants, except for the recently established NP-completeness of <em>k</em>-restricted edge-connectivity. In this paper, we prove that the problems of determining structure, substructure, restricted, and <em>R<sup>h</sup></em>-restricted connectivity are all NP-complete.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105237"},"PeriodicalIF":4.0,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146190036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HyBMSearch: A fast multi-Level search algorithm delivering order-of-Magnitude speedups on multi-Billion datasets","authors":"Shashank Raj , Kalyanmoy Deb","doi":"10.1016/j.jpdc.2026.105226","DOIUrl":"10.1016/j.jpdc.2026.105226","url":null,"abstract":"<div><div>We present HyBMSearch (Hybrid Bayesian Multi-Level Search), a Python-based algorithm that redefines how we handle extremely large, sorted datasets. By combining classic methods-binary and interpolation search-with a multi-level chunking approach, this technique achieves significant speedups on arrays ranging from 100 million to 10 billion (tested) elements. At the core of our approach is the integration of a hybrid and custom genetic algorithm with Bayesian optimization, enabling automatic parameter tuning. This eliminates the guesswork of manual tuning while maintaining solid performance across a variety of scenarios. Despite the fact that NumPy’s <span>searchsorted</span> is highly optimized C code, HyBMSearch (written in Python) still delivers dramatic speed gains in multi-threaded tests. It processes 10 million lookups on a 100-million-element dataset in just 0.0244 seconds (compared to 23.67 seconds needed for <span>searchsorted</span>), handles 100 million lookups on a 1-billion-element array in 0.393 seconds (versus 184.89 seconds by <span>NumPy’s searchsorted</span>), performs 500 million lookups on 5 billion elements in 59.00 seconds (rather than 979.73 seconds by <span>NumPy’s searchsorted</span>), and resolves 1 billion lookups on 10 billion elements in 119.68 seconds (instead of 2071.84 seconds by <span>NumPy’s searchsorted</span>). These results set a new milestone for high-performance search methods in parallel and distributed settings, demonstrating the capability of our proposed approach to optimize the search process.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105226"},"PeriodicalIF":4.0,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Arul Edwin Raj , Nabihah Binti Ahmad , Jeffin Gracewell , Renugadevi R , C.T. Kalaivani
{"title":"VLSI design and its hardware implementation for optimal image dehazing with adaptive bilateral filtering","authors":"A. Arul Edwin Raj , Nabihah Binti Ahmad , Jeffin Gracewell , Renugadevi R , C.T. Kalaivani","doi":"10.1016/j.jpdc.2025.105186","DOIUrl":"10.1016/j.jpdc.2025.105186","url":null,"abstract":"<div><div>Fog and smog significantly hinder image processing by reducing visual output quality and disrupting the functionality of systems reliant on visual data. Existing dehazing methods face several challenges, including computational complexity, sensitivity to parameter settings and limited optimization for diverse conditions. To overcome these limitations, this paper introduces the Selective Bilateral Filtering and Color Attenuation Analysis (SBBFC), a new methodology for real-time image dehazing. While offering this benefit, SBBFC eliminates problems that prior methods have by dynamically controlling window sizes and using color attenuation analysis to sustain reliable performance in response to changes in the level of haze and to guarantee accurate color rendition in the dehazed image. The hardware-optimized way uses FPGA or ASIC type of technologies with high throughput and real-time response, better image quality and considerably better detail reproduction. When it comes to ASIC implementation, the concepts of the proposed architecture provide 350 MPixels/s at the cost of 15k gates and 5 mW of power consumption with an area efficiency of 0. 8 mm²/k. In hardware mode targeting FPGA design, it offers 100 MPixels/s performance at a clock frequency of 100 MHz. In light of the above specifications, it is evident that the proposed architecture would be fitting in delivering dehazing in real-time, with high throughput and at low power.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105186"},"PeriodicalIF":4.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mobility -aware server placement and power allocation for randomly walking mobile users","authors":"Keqin Li","doi":"10.1016/j.jpdc.2025.105216","DOIUrl":"10.1016/j.jpdc.2025.105216","url":null,"abstract":"<div><div>We systematically, quantitatively, and mathematically address the problems of optimal mobility-aware server placement and optimal mobility-aware power allocation in mobile edge computing environments with randomly walking mobile users. The new contributions of the paper are highlighted below. We establish a single-server M/G/1 queueing system for mobile user equipment and a multiserver M/G/k queueing system for mobile edge clouds. We consider both the synchronous mobility model and the asynchronous mobility model, which are described by discrete-time Markov chains and continuous-time Markov chains respectively. We discuss two task offloading strategies for user equipment in the same service area, i.e., the equal-response-time method and the equal-load-fraction method. We formally and rigorously define the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We develop optimization algorithms to solve the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We demonstrate numerical data for optimal mobility-aware server placement and optimal mobility-aware power allocation with two mobility models, two task offloading strategies, and two power consumption models. The significance of the paper can be seen from the fact that the above analytical and algorithmic discussion of optimal mobility-aware server placement and optimal mobility-aware power allocation for mobile edge computing environments with randomly walking mobile users has rarely been seen in the existing literature.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105216"},"PeriodicalIF":4.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed quadratic interpolation estimation for large-scale quantile regression","authors":"Ziqian Qin , Yue Chao , Xuejun Ma","doi":"10.1016/j.jpdc.2025.105214","DOIUrl":"10.1016/j.jpdc.2025.105214","url":null,"abstract":"<div><div>A number of statistical learning approaches for large-scale quantile regression (QR) have been rapidly developed to address the optimization issues arising from massive data computations. However, the principal idea behind most distributed QR estimation procedures for solving the nondifferentiable quantile loss problem is to approximate the check function using kernel-based smoothing approaches with bandwidth. In this article, we develop a new communication-efficient distributed QR estimation procedure called Distributed Quadratic Interpolation estimation strategy for QR (DQIQR) to tackle the issue posed by the limited memory constraint on a single computer machine. Specifically, we implement a quadratic function in a small neighborhood around the origin, which transforms the nondifferentiable check function into a convex and smooth quadratic loss function without using kernel-based methods. The minimizer, named the DQIQR estimator, is obtained through an approximate multi-round reweighted least squares aggregations procedure under the divide-and-conquer (DC) framework. Theoretically, we establish the asymptotic normality for the DQIQR estimator and show that our estimator achieves the same efficiency as the QR estimator computed on the entire data. Furthermore, a regularized version of DQIQR (DRQIQR) for processing distributed variable selection procedure is also investigated. Finally, the synthetic and real datasets are used to evaluate the effectiveness of the proposed approaches.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105214"},"PeriodicalIF":4.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(26)00008-0","DOIUrl":"10.1016/S0743-7315(26)00008-0","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105230"},"PeriodicalIF":4.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146189152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zizhao Wang , Wei Bao , Ruoyu Wu , Dong Yuan , Albert Y. Zomaya
{"title":"Optimal schedule for periodic jobs with discretely controllable processing times on two machines","authors":"Zizhao Wang , Wei Bao , Ruoyu Wu , Dong Yuan , Albert Y. Zomaya","doi":"10.1016/j.jpdc.2025.105204","DOIUrl":"10.1016/j.jpdc.2025.105204","url":null,"abstract":"<div><div>In many real-world situations, the processing time of computational jobs can be shortened by lowering the processing quality. This is referred to as discretely controllable processing time, where the original processing time can be shortened to a number of levels with lower processing qualities. In this paper, we study the scheduling problem of periodic jobs with discretely controllable processing times on two machines. The problem is NP-hard, as directly solving it through dynamic programming leads to exponential computational complexity. This is because we need to memorise a set of processed jobs to avoid reprocessing. In order to address this issue, we prove the Ordered Scheduling Structure (OSS) Property and the Consecutive Decision Making (CDM) Property. The OSS Property allows us to search for an optimal solution in which jobs on the same machine are orderly started. The CDM Property allows us to memorise only two jobs to completely avoid the job reprocessing. These two properties greatly decrease the searching space, and the resultant dynamic programming solution to find an optimal solution is with pseudo-polynomial computational complexity.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105204"},"PeriodicalIF":4.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimistic execution in byzantine broadcast protocols that tolerate malicious majority","authors":"Ruomu Hou, Haifeng Yu","doi":"10.1016/j.jpdc.2025.105203","DOIUrl":"10.1016/j.jpdc.2025.105203","url":null,"abstract":"<div><div>We consider the classic byzantine broadcast problem in distributed computing, in the context of a system with <em>n</em> node and at most <span><math><msub><mi>f</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math></span> byzantine failures, under the standard synchronous timing model. Let <em>f</em> be the actual number of byzantine failures in a given execution, where <span><math><mrow><mi>f</mi><mo>≤</mo><msub><mi>f</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></mrow></math></span>. Our goal in this work is to optimize the performance of byzantine broadcast protocols in the common case where <em>f</em> is relative small. To this end, we propose a novel framework, called <span>FlintBB</span>, for adding an <em>optimistic track</em> into existing byzantine broadcast protocols. Using this framework, we show that we can achieve an <em>exponential improvement</em> in several existing byzantine broadcast protocols when <em>f</em> is relatively small. At the same time, our approach does not sacrifice the performance when <em>f</em> is not small.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"209 ","pages":"Article 105203"},"PeriodicalIF":4.0,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}