Parallel Computing最新文献_第8页

Uphill resampling for particle filter and its implementation on graphics processing unit 粒子滤波的上坡重采样及其在图形处理器上的实现

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102994

Özcan Dülger , Halit Oğuztüzün , Mübeccel Demirekler

{"title":"Uphill resampling for particle filter and its implementation on graphics processing unit","authors":"Özcan Dülger , Halit Oğuztüzün , Mübeccel Demirekler","doi":"10.1016/j.parco.2022.102994","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102994","url":null,"abstract":"<div><p>We introduce a new resampling method, named Uphill, that is free from numerical instability and suitable for parallel implementation on graphics processing unit (GPU). Common resampling algorithms such as Systematic suffer from numerical instability when single precision floating point numbers are used. This is due to cumulative summation over the weights of particles when the weights differ widely or the number of particles is large. The Metropolis and Rejection resampling algorithms do not suffer from numerical instability as they only calculate the ratios of weights pairwise rather than perform collective operations over the weights. They are more suitable for the GPU implementation of the particle filter. However, they undergo non-coalesced global memory access patterns which cause their speed deteriorate rapidly as the number of particles gets large. Uphill also does not suffer from numerical instability but, experiences the same non-coalesced global memory access problem with Metropolis and Rejection. We introduce its faster version named Uphill-Fast which eliminates this problem. We make comparisons of Uphill and Uphill-Fast with the Systematic, Metropolis, Metropolis-C2 and Rejection resampling methods with respect to quality and speed. We also compare them on a highly non-linear system. Uphill-Fast runs faster and attains similar quality, in terms of RMSE, in comparison with Metropolis and Rejection when the number of particles is very large. Uphill-Fast runs with roughly same speed as Metropolis-C2 with better variance and MSE when the number of particles is very large.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102994"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ParVoro++: A scalable parallel algorithm for constructing 3D Voronoi tessellations based on kd-tree decomposition parvoro++:一种基于kd-tree分解的可扩展并行算法，用于构建3D Voronoi镶嵌

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102995

Guoqing Wu, Hongyun Tian, Guo Lu, Wei Wang

{"title":"ParVoro++: A scalable parallel algorithm for constructing 3D Voronoi tessellations based on kd-tree decomposition","authors":"Guoqing Wu, Hongyun Tian, Guo Lu, Wei Wang","doi":"10.1016/j.parco.2023.102995","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102995","url":null,"abstract":"<div><p>The Voronoi tessellation is a fundamental geometric data structure which has numerous applications in various scientific and technological fields. For large particle datasets, computing Voronoi tessellations must be conducted in parallel on a distributed-memory supercomputer in order to satisfy time and memory-size constraints. However, due to load balance and communication, the parallelization of the Voronoi tessellation renders a challenge. In this paper, we present a scalable parallel algorithm for constructing 3D Voronoi tessellations, which evenly distributes the input particles between blocks through kd-tree decomposition. In order to construct the correct global Voronoi topology, we investigate both parametric and non-parametric methods for particle communication among the blocks of a spatial decomposition. The algorithm is implemented exploiting process-level and thread-level parallelization and can be used in a diverse architectural landscape. Using datasets containing up to 330 million particles, we show that our algorithm achieves parallel efficiency up to 57% using 4096 cores on a distributed-memory computer. Moreover, we compare our algorithm with previous attempts to parallelize Voronoi tessellations showing encouraging improvements in terms of computation time.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102995"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating the scheduling of the network resources of the next-generation optical data centers 加快下一代光数据中心网络资源的调度

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102993

G. Patronas, N. Vlassopoulos, Ph. Bellos, D. Reisis

{"title":"Accelerating the scheduling of the network resources of the next-generation optical data centers","authors":"G. Patronas, N. Vlassopoulos, Ph. Bellos, D. Reisis","doi":"10.1016/j.parco.2022.102993","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102993","url":null,"abstract":"<div><p>Data centers (DCs) play a key role in the evolving IT applications and they rely heavily on the optical interconnects to improve their performance and scalability. Optically switched DCs most often exploit the slotted Time Division Multiplexing Access (TDMA) operation and the Wavelength Division Multiplexing (WDM) technology and rely on the effective scheduling of the TDMA frames to decide in real time the end-to-end connections that include the network links, switches and ports. This task becomes computationally intensive as the communication requests increase.</p><p>The current paper builds on a greedy scheduling algorithm to introduce a parallel technique that accelerates the scheduling process and improves optical DC’s performance. The proposed technique handles efficiently the scheduler’s data structures, minimizes the communication among the scheduler’s processors and it is scalable. Moreover, this work presents the technique’s performance results for a variety of scheduling scenarios and DC sizes executed on an algorithm-specific Single Instruction Multiple Data (SIMD) accelerator architecture and on a Graphics Processing Unit (GPU). The performance of the GPU and the SIMD accelerator implemented on FPGA validate the parallel scheduler technique.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102993"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-level parallel multi-layer block reproducible summation algorithm 多级并行多层块可重复求和算法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102996

Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu

{"title":"Multi-level parallel multi-layer block reproducible summation algorithm","authors":"Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu","doi":"10.1016/j.parco.2023.102996","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102996","url":null,"abstract":"<div><p>Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102996"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval 多媒体检索中神经网络搜索分布式内存并行化的空间感知数据分区

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102992

Guilherme Andrade, Renato Ferreira, George Teodoro

{"title":"Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval","authors":"Guilherme Andrade, Renato Ferreira, George Teodoro","doi":"10.1016/j.parco.2022.102992","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102992","url":null,"abstract":"<div><p>Content-based multimedia retrieval (CBMR) applications are becoming very popular in several online services which handles large volumes of data and are submitted to high query rates. While these applications may be complex, finding the nearest neighboring objects (multimedia descriptors) is typically their most time consuming operation. In order to address this problem, several recent works have proposed distributed memory parallelization of approximate nearest neighbors (ANN) search. These solutions employ a variety of ANN algorithms and different parallelization strategies. In this paper, we have identified the currently used parallelization strategies (Data Equal Split (DES) and Bucket Equal Split (BES)) and systematically evaluated their performance. We have also developed a framework to simplify the deployment of ANN algorithms in distributed memory machines with customized parallelization or data partition strategies. We further proposed a novel class of data partition/parallelization strategies that takes into account the data spatial proximity. Our approaches (SABES and SABES++) improves data locality and the system efficiency as compared to DES and BES. For instance, SABES++ achieved speedups of 4.2<span><math><mo>×</mo></math></span> and 1.8<span><math><mo>×</mo></math></span> on top of DES and BES, respectively, in the baseline case (40 nodes). Further, SABES and SABES++ also attained higher multi-node scalability and the gains vs DES and BES increase a larger number of nodes. SABES++ is 14.5<span><math><mo>×</mo></math></span> faster than DES when 160 nodes are used.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102992"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient parallel reduction of bandwidth for symmetric matrices 有效的并行减少带宽对称矩阵

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102998

Valeriy Manin, Bruno Lang

引用次数: 0

Reviewer acknowledgment 评论家承认

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/S0167-8191(23)00010-8

引用次数: 0

Heterogeneous sparse matrix–vector multiplication via compressed sparse row format 异构稀疏矩阵-向量乘法压缩稀疏行格式

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102997

Phillip Allen Lane, Joshua Dennis Booth

{"title":"Heterogeneous sparse matrix–vector multiplication via compressed sparse row format","authors":"Phillip Allen Lane, Joshua Dennis Booth","doi":"10.1016/j.parco.2023.102997","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102997","url":null,"abstract":"<div><p>Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-<span><math><mi>k</mi></math></span>, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA’s cuSPARSE and Sandia National Laboratories’ KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance <span><math><mo>≤</mo></math></span>10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-<span><math><mi>k</mi></math></span> achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super–super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super–super-rows sizes in constant time.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102997"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49705252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient parallel reduction of bandwidth for symmetric matrices 有效的并行减少带宽对称矩阵

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-01-01 DOI: 10.2139/ssrn.4050432

Valeriy Manin, B. Lang

引用次数: 0

Efficient parallel branch-and-bound approaches for exact graph edit distance problem 精确图编辑距离问题的高效并行分支定界方法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102984

Adel Dabah , Ibrahim Chegrane , Saïd Yahiaoui , Ahcene Bendjoudi , Nadia Nouali-Taboudjemat

{"title":"Efficient parallel branch-and-bound approaches for exact graph edit distance problem","authors":"Adel Dabah , Ibrahim Chegrane , Saïd Yahiaoui , Ahcene Bendjoudi , Nadia Nouali-Taboudjemat","doi":"10.1016/j.parco.2022.102984","DOIUrl":"10.1016/j.parco.2022.102984","url":null,"abstract":"<div><p><span>Graph Edit Distance (GED) is a well-known measure used in the graph matching to measure the similarity/dissimilarity between two graphs by computing the minimum cost of edit operations needed to transform one graph into another. This process, Which appears to be simple, is known NP-hard and time consuming since the search space is increasing exponentially. One way to optimally solve this problem is by using Branch and Bound (B&B) algorithms, Which reduce the computation time required to explore the whole search space by performing an implicit enumeration of the search space instead of an exhaustive one based on a pruning technique. nevertheless, They remain inefficient when dealing with large problem instances due to the impractical running time needed to explore the whole search space. To overcome this issue, We propose in this paper three parallel B&B approaches based on shared memory to exploit the multi-core CPU processors: First, a work-stealing approach where several instances of the B&B algorithm explore a single search tree concurrently achieving speedups up to 24</span><span><math><mo>×</mo></math></span> faster than the sequential version. Second, a tree-based approach where multiple parts of the search tree are explored simultaneously by independent B&B instances achieving speedups up to 28<span><math><mo>×</mo></math></span>. Finally, Due to the irregular nature of the GED problem, two load-balancing strategies are proposed to ensure a fair workload between parallel processes achieving impressive speedups up to 300<span><math><mo>×</mo></math></span>. all experiments have been carried out on well-known datasets</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102984"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72384574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2