ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms fgSpMSpV: HPC平台上的细粒度并行SpMSpV框架
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512770
Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya
{"title":"fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms","authors":"Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya","doi":"10.1145/3512770","DOIUrl":"https://doi.org/10.1145/3512770","url":null,"abstract":"Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 29"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48217478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Chronic giant cranial diploe hematoma in hemophiliac. 血友病患者的慢性巨型头颅二叶血肿。
IF 1
ACM Transactions on Parallel Computing Pub Date : 2022-03-30 DOI: 10.1055/a-1813-0090
Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo
{"title":"Chronic giant cranial diploe hematoma in hemophiliac.","authors":"Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo","doi":"10.1055/a-1813-0090","DOIUrl":"10.1055/a-1813-0090","url":null,"abstract":"<p><p>Cranial diploe hematoma is a hematoma that occurs between the inner and outer layer of the skull and is often in infants and young children. Hemophilia A is an inherited X-linked bleeding disorder caused by a deficiency of coagulation factor VIII (FVIII) . Epidemiological survey results show that the prevalence of hemophilia in 24 provinces and cities in China is 2.73/100,000, while only about 5% of patients are registered . Hemophilia is mainly characterized by bleeding, which can occur anywhere in the pa-tient's body and manifest as intracranial, gastrointestinal, or pharyngeal bleeding, which can be life-threatening in severe cases. This article shares a case of a patient with he-mophilia A complicated by a chronic giant diploe hematoma.</p>","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87263626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BQ: A Lock-Free Queue with Batching 基于批处理的无锁队列
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512757
Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank
{"title":"BQ: A Lock-Free Queue with Batching","authors":"Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank","doi":"10.1145/3512757","DOIUrl":"https://doi.org/10.1145/3512757","url":null,"abstract":"Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 49"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49253770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations 使用秩结构矩阵计算的高性能三维非结构化网格变形
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512756
Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes
{"title":"High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations","authors":"Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes","doi":"10.1145/3512756","DOIUrl":"https://doi.org/10.1145/3512756","url":null,"abstract":"The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45543029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods 放大共轭梯度法的性能分析及最优节点感知通信
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-11 DOI: 10.1145/3580003
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
{"title":"Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3580003","DOIUrl":"https://doi.org/10.1145/3580003","url":null,"abstract":"Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48394069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM CC和DSM的确定性常摊销rmr可终止互斥
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-12-09 DOI: 10.1145/3490559
P. Jayanti, S. Jayanti
{"title":"Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM","authors":"P. Jayanti, S. Jayanti","doi":"10.1145/3490559","DOIUrl":"https://doi.org/10.1145/3490559","url":null,"abstract":"The abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real-time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worst-case constant remote memory reference algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both cache coherent (CC) and distributed shared memory multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worst-case requirement to amortized, algorithms are only known for the CC model. In this article, we improve this state of the art by designing a deterministic algorithm that uses Fetch&Store to achieve amortized O(1) remote memory reference in both the CC and distributed shared memory models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal) and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O(1) space per process, and satisfies a novel fairness condition that we call Airline FCFS. Our algorithm is short with fewer than a dozen lines of code.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46508215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Adaptive Erasure Coded Fault Tolerant Linear System Solver 自适应擦除编码容错线性系统求解器
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-12-08 DOI: 10.1145/3490557
X. Kang, D. Gleich, A. Sameh, A. Grama
{"title":"Adaptive Erasure Coded Fault Tolerant Linear System Solver","authors":"X. Kang, D. Gleich, A. Sameh, A. Grama","doi":"10.1145/3490557","DOIUrl":"https://doi.org/10.1145/3490557","url":null,"abstract":"As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 19"},"PeriodicalIF":1.6,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45097716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery 面向层次密集子图发现的二部网络并行剥离
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-10-24 DOI: 10.1145/3583084
Kartik Lakhotia, R. Kannan, V. Prasanna
{"title":"Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery","authors":"Kartik Lakhotia, R. Kannan, V. Prasanna","doi":"10.1145/3583084","DOIUrl":"https://doi.org/10.1145/3583084","url":null,"abstract":"Wing and Tip decomposition are motif-based analytics for bipartite graphs that construct a hierarchy of butterfly (2,2-biclique) dense edge and vertex induced subgraphs, respectively. They have applications in several domains, including e-commerce, recommendation systems, document analysis, and others. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of the subgraph density. They iteratively select the edges or vertices with minimum butterfly count peel, i.e., remove them along with their butterflies. The amount of butterflies in real-world bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a large number of sequentially dependent iterations. Consequently, parallel algorithms based on bottom up peeling incur heavy synchronization and poor scalability. In this article, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework that adopts a two-phased peeling approach to relax the order of peeling, and in turn, dramatically reduce synchronization. The first phase divides the decomposition hierarchy into few partitions and requires little synchronization. The second phase concurrently processes all partitions to generate individual levels of the hierarchy and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency of PBNG. We empirically evaluate PBNG using several real-world bipartite graphs and demonstrate radical improvements over the existing approaches. On a shared-memory 36 core server, PBNG achieves up to 19.7× self-relative parallel speedup. Compared to the state-of-the-art parallel framework ParButterfly, PBNG reduces synchronization by up to 15,260× and execution time by up to 295×. Furthermore, it achieves up to 38.5× speedup over state-of-the-art algorithms specifically tuned for wing decomposition. Our source code is made available at https://github.com/kartiklakhotia/RECEIPT.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2021-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44283113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Metrics and Design of an Instruction Roofline Model for AMD GPUs AMD GPU指令屋顶线模型的度量与设计
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-10-15 DOI: 10.1145/3505285
M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran
{"title":"Metrics and Design of an Instruction Roofline Model for AMD GPUs","authors":"M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran","doi":"10.1145/3505285","DOIUrl":"https://doi.org/10.1145/3505285","url":null,"abstract":"Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 14"},"PeriodicalIF":1.6,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41415302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Introduction to the Special Issue for SPAA 2019 SPAA 2019特刊简介
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3477610
P. Berenbrink
{"title":"Introduction to the Special Issue for SPAA 2019","authors":"P. Berenbrink","doi":"10.1145/3477610","DOIUrl":"https://doi.org/10.1145/3477610","url":null,"abstract":"1. Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Łącki, Vahab Mirrokni, Warren Schudy: Massively Parallel Computation via Remote Memory Access 2. Faith Ellen, Barun Gorain, Avery Miller, Andrzej Pelc: Constant-Length Labeling Schemes for Deterministic Radio Broadcast 3. Michael A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan: External-Memory Dictionaries in the Affine and PDAM Models.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 1"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44969161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信