Alan Araujo , Willian Barreiros Jr. , Jun Kong , Renato Ferreira , George Teodoro
{"title":"IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs","authors":"Alan Araujo , Willian Barreiros Jr. , Jun Kong , Renato Ferreira , George Teodoro","doi":"10.1016/j.jpdc.2025.105066","DOIUrl":"10.1016/j.jpdc.2025.105066","url":null,"abstract":"<div><div>Similarity search is utilized in specialized database systems designed to handle multimedia data, often represented by high-dimensional features. In this paper, we focus on speeding up the search process with GPUs. This problem has been previously approached by accelerating the Inverted File with Asymmetric Distance Computation algorithm on GPUs (IVFADC-GPU). However, the most recent algorithm for CPU, Inverted Multi-Index (IMI), was not considered for parallelization, being found too challenging for efficient GPU deployment. Thus, we propose a novel and efficient version of IMI for GPUs called IMI-GPU. We propose a new design of the multi-sequence algorithm of IMI, enabling efficient GPU execution. We compared IMI-GPU with IVFADC-GPU using a billion-scale dataset in which IMI-GPU achieved speedups of about 3.2× and 1.9× at Recall@1 and at Recall@16 respectively. The algorithms have been compared in a variety of scenarios and our novel IMI-GPU has shown to significantly outperform IVFADC on GPUs for the majority of tested cases.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105066"},"PeriodicalIF":3.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143550639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00027-9","DOIUrl":"10.1016/S0743-7315(25)00027-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105060"},"PeriodicalIF":3.4,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143527269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fei Gao , Hu Ren , Zhuyin Ren , Ming Liu , Chengpeng Zhao , Guangwen Yang
{"title":"Massive parallel simulation of gas turbine combustion using a fully implicit unstructured solver on the heterogeneous Sunway Taihulight supercomputer","authors":"Fei Gao , Hu Ren , Zhuyin Ren , Ming Liu , Chengpeng Zhao , Guangwen Yang","doi":"10.1016/j.jpdc.2025.105055","DOIUrl":"10.1016/j.jpdc.2025.105055","url":null,"abstract":"<div><div>Massive parallel simulations of a full annular aeroengine combustor chamber have been achieved on the on-chip heterogeneous Sunway Taihulight supercomputer. A billion-size unstructured mesh is generated through grid replication and rotation, accompanied by the development of an efficient geometric matching algorithm to address the conformal interface issue. We developed graph-based and tree-based loop fusion approaches for implicit solving procedure of the momentum equation, it is found that the strategic utilization of data reuse and separation of vector computation significantly enhances the performance on many-core processor. For linear system, a finer-grained parallelization based on sparse matrix-vector multiplication and vector computation is validated. Massive parallel tests utilizing 16 K processes with 1 M cores are successfully conducted to simulate the turbulent non-premixed combustion in an aeroengine combustor with nearly one billion cells. Compared to the pre-optimization version, this fully accelerated code achieves an impressive 5.48 times speedup in overall performance, with a parallel efficiency of up to 59 %.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105055"},"PeriodicalIF":3.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed landmark labeling for social networks","authors":"Arda Şener, Hüsnü Yenigün, Kamer Kaya","doi":"10.1016/j.jpdc.2025.105057","DOIUrl":"10.1016/j.jpdc.2025.105057","url":null,"abstract":"<div><div>Distance queries are a fundamental part of many network analysis applications. They can be used to infer the closeness of two users in social networks, the relation between two sites in a web graph, or the importance of the interaction between two proteins or molecules. Being able to answer these queries rapidly has many benefits in the area of network analysis. Pruned Landmark Labeling (<span>Pll</span>) is a technique used to generate an index for a given graph that allows the shortest path queries to be completed in a fraction of the time when compared to a standard breadth-first or a depth-first search-based algorithm. Parallel Shortest-distance Labeling (<span>Psl</span>) reorganizes the steps of <span>Pll</span> for the multithreaded setting and is designed particularly for social networks for which the index sizes can be much larger than what a single server can store. Even for a medium-size, 5 million vertex graph, the index size can be more than 40 GB. This paper proposes a hybrid, shared- and distributed-memory algorithm, DPSL, by partitioning the input graph via a vertex separator. The proposed method improves both the parallel execution time and the maximum memory consumption by distributing both the data and the work across multiple nodes of a cluster. For instance, on a graph with 5M vertices and 150M edges, using 4 nodes, DPSL reduces the execution time and maximum memory consumption by 2.13× and 1.87×, respectively, compared to our improved implementation of <span>Psl</span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105057"},"PeriodicalIF":3.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143427648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring data science workflows: A practice-oriented approach to teaching processing of massive datasets","authors":"Johannes Schoder , H. Martin Bücker","doi":"10.1016/j.jpdc.2025.105043","DOIUrl":"10.1016/j.jpdc.2025.105043","url":null,"abstract":"<div><div>Massive datasets are typically processed by a sequence of different stages, comprising data acquisition and preparation, data processing, data analysis, result validation, and visualization. In conjunction, these stages form a data science workflow, a key element enabling the solution of data-intensive problems. The complexity and heterogeneity of these stages require a diverse set of techniques and skills. This article discusses a hands-on practice-oriented approach aiming to enable and motivate graduate students to engage with realistic data science workflows. A major goal of the approach is to bridge the gap between academia and industry by integrating programming assignments that implement different data workflows with real-world data. In consecutive assignments, students are exposed to the methodology of solving problems using big data frameworks and are required to implement different data workflows of varying complexity. This practice-oriented approach is well received by students, as confirmed by different surveys.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105043"},"PeriodicalIF":3.4,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DePoL: Assuring training integrity in collaborative learning via decentralized verification","authors":"Zhicheng Xu , Xiaoli Zhang , Xuanyu Yin , Hongbing Cheng","doi":"10.1016/j.jpdc.2025.105056","DOIUrl":"10.1016/j.jpdc.2025.105056","url":null,"abstract":"<div><div>Collaborative learning enables multiple participants to jointly train complex models but is vulnerable to attacks like model poisoning or backdoor attacks. Ensuring training integrity can prevent these threats by blocking any tampered contributions from affecting the model. However, traditional approaches often suffer from single points of bottleneck or failure in decentralized environments. To address these issues, we propose <span>DePoL</span>, a secure, scalable, and efficient decentralized verification framework based on duplicated execution. <span>DePoL</span> leverages blockchain to distribute the verification tasks across multiple participant-formed groups, eliminating single-point bottlenecks. Within each group, redundant verification and a majority-based arbitration prevent single points of failure. To further enhance security, <span>DePoL</span> introduces a <em>two-stage plagiarism-free commitment scheme</em> to prevent untrusted verifiers from exploiting public on-chain data. Additionally, a <em>hybrid verification method</em> employs fuzzy matching to handle unpredictable reproduction errors, while a “slow path” ensures zero false positives for honest trainers. Our theoretical analysis demonstrates <span>DePoL</span>'s security and termination properties. Extensive evaluations show that <span>DePoL</span> has overhead similar to common distributed machine learning algorithms, while outperforming centralized verification schemes in scalability, reducing training latency by up to 46%. Additionally, <span>DePoL</span> effectively handles reproduction errors with 0 false positives.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105056"},"PeriodicalIF":3.4,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143420194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient GPU-accelerated parallel cross-correlation","authors":"Karel Maděra, Adam Šmelko, Martin Kruliš","doi":"10.1016/j.jpdc.2025.105054","DOIUrl":"10.1016/j.jpdc.2025.105054","url":null,"abstract":"<div><div>Cross-correlation is a data analysis method widely employed in various signal processing and similarity-search applications. Our objective is to design a highly optimized GPU-accelerated implementation that will speed up the applications and also improve energy efficiency since GPUs are more efficient than CPUs in data-parallel tasks. There are two rudimentary ways to compute cross-correlation — a definition-based algorithm that tries all possible overlaps and an algorithm based on the Fourier transform, which is much more complex but has better asymptotical time complexity. We have focused mainly on the definition-based approach which is better suited for smaller input data and we have implemented multiple CUDA-enabled algorithms with multiple optimization options. The algorithms were evaluated on various scenarios, including the most typical types of multi-signal correlations, and we provide empirically verified optimal solutions for each of the studied scenarios.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105054"},"PeriodicalIF":3.4,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143420195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu
{"title":"GPU memory usage optimization for backward propagation in deep network training","authors":"Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu","doi":"10.1016/j.jpdc.2025.105053","DOIUrl":"10.1016/j.jpdc.2025.105053","url":null,"abstract":"<div><div>In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology <em>rematerialization</em> can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as <em>checkpoints</em>, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the <em>checkpoint selection</em> problem and propose a dynamic programming algorithm with time complexity <span><math><mi>O</mi><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></math></span> to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an <span><math><mi>O</mi><mo>(</mo><mi>n</mi><mo>)</mo></math></span>-time algorithm for finding the optimal checkpoint subset.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105053"},"PeriodicalIF":3.4,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143420196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tia Newhall, Kevin C. Webb, Vasanta Chaganti, Andrew Danner
{"title":"An introductory-level undergraduate CS course that introduces parallel computing","authors":"Tia Newhall, Kevin C. Webb, Vasanta Chaganti, Andrew Danner","doi":"10.1016/j.jpdc.2025.105044","DOIUrl":"10.1016/j.jpdc.2025.105044","url":null,"abstract":"<div><div>We present the curricular design, pedagogy, and goals of an introductory-level course on computer systems that introduces parallel and distributed computing (PDC) to students who have only a CS1 background. With the ubiquity of multicore processors, cloud computing, and hardware accelerators, PDC topics have become fundamental knowledge areas in the undergraduate CS curriculum. As a result, it is increasingly important for students to learn a common core of introductory parallel and distributed computing topics and to develop parallel thinking skills early in their CS studies. Our introductory-level course focuses on three main curricular goals: 1) understanding how a computer runs a program, 2) evaluating system costs associated with running a program, and 3) taking advantage of the power of parallel computing. We elaborate on the goals and details of our course's key modules, and we discuss our pedagogical approach that includes active-learning techniques. We also include an evaluation of our course and a discussion of our experiences teaching it since Fall 2012. We find that the PDC foundation gained through early exposure in our course helps students gain confidence in their ability to expand and apply their understanding of PDC concepts throughout their CS education.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105044"},"PeriodicalIF":3.4,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00014-0","DOIUrl":"10.1016/S0743-7315(25)00014-0","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"198 ","pages":"Article 105047"},"PeriodicalIF":3.4,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143129227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}