2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
An Integral-equation-oriented Vectorized SpMV Algorithm and its Application on CT Imaging Reconstruction 面向积分方程的矢量化SpMV算法及其在CT成像重建中的应用
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00080
Weicai Ye, Chenghuan Huang, Jiasheng Huang, Jiajun Li, Yao Lu, Ying Jiang
{"title":"An Integral-equation-oriented Vectorized SpMV Algorithm and its Application on CT Imaging Reconstruction","authors":"Weicai Ye, Chenghuan Huang, Jiasheng Huang, Jiajun Li, Yao Lu, Ying Jiang","doi":"10.1109/ipdps53621.2022.00080","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00080","url":null,"abstract":"Sparse-matrix vector multiplication (SpMV) is a core routine in many applications. Its performance is limited by memory bandwidth, which is for matrix transport between processors and memory, and instruction latency in computations. Vectorized operations (SIMD) can dramatically improve the execution efficiency, but irregular matrices' sparsity pattern is not compatible with the style of SIMD execution. We present a new matrix format, Compressed Sparse Column Vector (CSCV), and a corresponding vectorized SpMV algorithm for matrices arising from integral equations. This SpMV algorithm can inherently suit wide SIMD instructions and reduce the memory bandwidth used. We implement this algorithm for Computed Tomography (CT) imaging reconstructions on both Intel and AMD x86 platforms and compare it with seven state-of-the-art SpMV implementations using different CT imaging matrices. Experimental results show that CSCV can achieve up to 96.9 GFLOP/s in single-precision tests, with speedup 3.70× to MKL and 3.48× to the second place implementation. Furthermore, the implementation of CSCV SpMV is performance portable, which excludes almost all SIMD assemble code and has promising performance with compiler-assisted vectorization. Code Availability: https://github.com/sysu-compsci/cscv","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124019769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Archpipe: Fast and Flexible Pipelined Erasure-coded Archival Scheme for Heterogeneous Networks Archpipe:快速灵活的异构网络的管道擦除编码归档方案
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00132
Bin Xu, Jianzhong Huang, X. Qin, Q. Cao, Yuanyuan Dong, Weikang Kong
{"title":"Archpipe: Fast and Flexible Pipelined Erasure-coded Archival Scheme for Heterogeneous Networks","authors":"Bin Xu, Jianzhong Huang, X. Qin, Q. Cao, Yuanyuan Dong, Weikang Kong","doi":"10.1109/ipdps53621.2022.00132","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00132","url":null,"abstract":"Erasure-coded archival converts the redundancy mechanism of low access-frequency data from replication to erasure coding for balancing access performance and storage efficiency. A variety of pipelined schemes are designed to speed up the archival operation, however they neglect such three factors as heterogeneous network, under-utilization of replica resources and tight coupling with underlying platforms which restrict or even negate the performance gains. In this paper, we propose Archpipe, a fast and flexible pipelined erasure-coded archival scheme. It exhibits three distinct features: 1) heterogeneous network awareness, for a single-pipelined construction, sufficient-bandwidth links are given high scheduling priority to avoid network congestion, while considering locality to reducing network transmissions; 2) parallel encoding, the unused replica resources are exploited to adaptively construct multiple pipelines for each stripe based on the single-pipelined algorithm, thereby enabling parity blocks to be encoded in parallel; 3) loose coupling, it does not rely on specific block placement policies and stripe construction algorithms. Experimental results indicate that, Archpipe can be seamlessly integrated with common distributed storage systems, and it improves the erasure-coded archival performance by 3.6 ∼ 4.7× and 1.3 ∼ 2.6× in on-disk and in-memory scenarios, respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115519938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSMV: A Highly Scalable Multi-Versioned Software Transactional Memory for GPUs CSMV:用于gpu的高度可扩展的多版本软件事务性内存
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00057
D. Nunes, Daniel Castro, P. Romano
{"title":"CSMV: A Highly Scalable Multi-Versioned Software Transactional Memory for GPUs","authors":"D. Nunes, Daniel Castro, P. Romano","doi":"10.1109/ipdps53621.2022.00057","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00057","url":null,"abstract":"GPUs have traditionally focused on streaming applications with regular parallelism. Over the last years, though, GPUs have also been successfully used to accelerate irregular applications in a number of application domains by using fine grained synchronization schemes. Unfortunately, fine-grained synchronization strategies are notoriously complex and error-prone. This has motivated the search for alternative paradigms aimed to simplify concurrent programming and, among these, Transactional Memory (TM) is probably one of the most prominent proposals. This paper introduces CSMV (Client Server Multiversioned), a multi-versioned Software TM (STM) for GPUs that adopts an innovative client-server design. By decoupling the execution of transactions from their commit process, CSMV provides two main benefits: (i) it enables the use of fast on chip memory to access the global metadata used to synchronize transaction (ii) it allows for implementing highly efficient collaborative commit procedures, tailored to take full advantage of the architectural characteristics of GPUs. Via an extensive experimental study, we show that CSMV achieves up to 3 orders of magnitude speed-ups with respect to state of the art STMs for GPUs and that it can accelerate by up to 20× irregular applications running on state of the art STMs for CPUs.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123299968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mixed Precision $s$-step Conjugate Gradient with Residual Replacement on GPUs gpu上混合精度$s$步共轭梯度残差替换
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00091
I. Yamazaki, E. Carson, Brian Kelley
{"title":"Mixed Precision $s$-step Conjugate Gradient with Residual Replacement on GPUs","authors":"I. Yamazaki, E. Carson, Brian Kelley","doi":"10.1109/ipdps53621.2022.00091","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00091","url":null,"abstract":"The $s$-step Conjugate Gradient (CG) algorithm has the potential to reduce the communication cost of standard CG by a factor of $s$. However, though mathematically equivalent, $s$-step CG may be numerically less stable compared to standard CG in finite precision, exhibiting slower convergence and decreased attainable accuracy. This limits the use of $s$-step CG in practice. To improve the numerical behavior of $s$-step CG and overcome this potential limitation, we incorporate two techniques. First, we improve convergence behavior through the use of higher precision at critical parts of the $s$-step iteration and second, we integrate a residual replacement strategy into the resulting mixed precision $s$-step CG to improve attainable accuracy. Our experimental results on the Summit Supercomputer demonstrate that when the higher precision is implemented in hardware, these techniques have virtually no overhead on the iteration time while improving both the convergence rate and the attainable accuracy of $s$-step CG. Even when the higher precision is implemented in software, these techniques may still reduce the time-to-solution (speedups of up to $1.8times$ in our experiments), especially when $s$-step CG suffers from numerical instability with a small step size and the latency cost becomes a significant part of its iteration time.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129783120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Top-Down Performance Profiling on NVIDIA's GPUs NVIDIA gpu自上而下的性能分析
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00026
Álvaro Sáiz, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente
{"title":"Top-Down Performance Profiling on NVIDIA's GPUs","authors":"Álvaro Sáiz, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente","doi":"10.1109/ipdps53621.2022.00026","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00026","url":null,"abstract":"The rise of data-intensive algorithms, such as Machine Learning ones, has meant a strong diversification of Graphics Processing Units (GPU) in fields with intensive Data-Level Parallelism. This trend, known as general-purpose computing on GPU (GP-GPU), makes the execution process on a GPU (seemingly simple in its architecture) far from trivial when targeting performance for many dissimilar applications. A proof of this is the existence of many profiling tools that help programmers to understand how to maximize hardware utilization. In contrast, this paper proposes a profiling tool focused on microarchitecture analysis under large sets of dissimilar applications. Therefore, the tool has a double objective. On the one hand, to check the suitability of a GPU for diverse sets of application kernels. On the other hand, to identify possible bottlenecks in a given GPU microarchitecture, facilitating the improvement of subsequent designs. For this purpose, using Top-Down methodology proposed by Intel for their CPUs as inspiration, we have defined a hierarchical organization for the execution pipeline of the GPU. The proposal makes use of the available hardware performance counters to identify how each component contributes to performance losses. We demonstrate the feasibility of the proposed methodology, analyzing how different modern NVIDIA architectures behave running relevant benchmarks, assessing in which microarchitecture component performance losses are the most significant.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128716482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption 编译器定向增量检查点低延迟GPU抢占
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00078
Zhuoran Ji, Cho-Li Wang
{"title":"Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption","authors":"Zhuoran Ji, Cho-Li Wang","doi":"10.1109/ipdps53621.2022.00078","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00078","url":null,"abstract":"GPUs are widely used in data centers to accelerate data-parallel applications. The multiuser and multitasking environment provides a strong incentive for preemptive GPU multitasking, especially for latency-sensitive jobs. Due to the large contexts of GPU kernels, preemptive GPU context switching is costly. Many novel GPU preemption techniques are proposed. Among them, checkpoint-based GPU preemption enables low latency GPU preemption but incurs a high runtime overhead. Prior studies propose to exclude dead registers from the checkpoint file to reduce the runtime overhead. It works well for CPUs, but it is not rare that a live register is not updated between two checkpoints for GPU kernels. This paper presents TripleC, a compiler-directed incremental checkpointing technique specially designed for GPU preemption. It further excludes the registers, which have not been overwritten since the last time they were spilled, from the checkpoint file with data flow analysis. The checkpoint placement algorithm of TripleC can properly estimate a checkpoint's cost under incremental checkpointing. It also considers the interaction among checkpoints so that the overall cost is minimized. Moreover, TripleC relaxes the conventional checkpointing constraint that the whole register context must be spilled before passing the checkpoint. Because of the diverse control flow, placing a register spilling instruction at different points incurs different costs. TripleC minimizes the cost with a two-phase algorithm that schedules these register spilling instructions at compilation time. Evaluations show that TripleC reduces the runtime overhead by 12.9 % on average compared with the state-of-the-art non-incremental checkpointing approach.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122446868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads 面向科学深度学习工作负载的预处理流水线优化
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00112
K. Ibrahim, L. Oliker
{"title":"Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads","authors":"K. Ibrahim, L. Oliker","doi":"10.1109/ipdps53621.2022.00112","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00112","url":null,"abstract":"Newly developed machine learning technology is promising to profoundly impact high-performance computing, with the potential to significantly accelerate scientific discoveries. However, scientific machine learning performance is often constrained by data movement overheads, particularly on existing and emerging hardware-accelerated systems. In this work, we focus on optimizing the data movement across storage and memory systems, by developing domain-specific data encoder/decoders. These plugins have the dual benefit of significantly reducing communication while enabling efficient decoding on the accelerated hardware. We explore detailed performance analysis for two important scientific learning workloads from cosmology and climate analytics, CosmoFlow and DeepCAM, on the GPU-enabled Summit and Cori supercomputers. Results demonstrate that our optimizations can significantly improve overall performance by up to 10× compared with the default baseline, while preserving convergence behavior. Overall, this methodology can be applied to various machine learning domains and emerging AI technologies.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130886139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DENOVA: Deduplication Extended NOVA File System DENOVA:重复数据删除扩展NOVA文件系统
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00134
Hyungjoon Kwon, Yonghyeon Cho, Awais Khan, Yeohyeon Park, Youngjae Kim
{"title":"DENOVA: Deduplication Extended NOVA File System","authors":"Hyungjoon Kwon, Yonghyeon Cho, Awais Khan, Yeohyeon Park, Youngjae Kim","doi":"10.1109/ipdps53621.2022.00134","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00134","url":null,"abstract":"This paper shows mathematically and experimentally that inline deduplication is not suitable for file systems on ultra-low latency Intel Optane DC PM devices in terms of performance, and proposes DeNova, an offline deduplication specially designed for log-structured NVM file systems such as NOVA. DeNova offers high-performance and low-latency I/O processing and executes deduplication in the background without interfering with foreground I/Os. DeNova employs DRAM-free persistent deduplication metadata, favoring CPU cache line, and ensures failure consistency on any system failure. We implement DeNova in the NOVA file system. Evaluation with DeNova confirms a negligible performance drop of baseline NOVA of less than 1%, while gaining high storage space savings. Extensive experiments show DeNova is failure consistent in all failure scenario cases.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134007904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring Efficient Microservice Level Parallelism 探索高效的微服务级并行性
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00030
Xinkai Wang, Chao Li, Lu Zhang, Xiaofeng Hou, Quan Chen, Minyi Guo
{"title":"Exploring Efficient Microservice Level Parallelism","authors":"Xinkai Wang, Chao Li, Lu Zhang, Xiaofeng Hou, Quan Chen, Minyi Guo","doi":"10.1109/ipdps53621.2022.00030","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00030","url":null,"abstract":"The microservice architecture has recently become a driving trend in the cloud by disaggregating a monolithic application into many scenario-oriented service blocks (microservices). The decomposition process results in a highly dynamic execution scenario, in which various chained microservices contend for computing resources in different ways. While parallelism has been exploited at both the instruction/thread level and the task/request level, very limited work has been done with the grain-size of a microservice. Current parallel processing solutions are sub-optimal as they neither capture the unique characteristics of microservices nor consider the uncertainty arises in the microservice environment. In this work we introduce microservice level parallelism (MLP), a technique that aims to precisely coalesce and align parallel microservice chains for better system performance and resource utilization. We identify major issues that prevent servers from effectively exploiting MLP and we define metrics that can guide MLP optimization. We propose v-MLP, a volatility-aware MLP that is able to adapt to a highly heterogeneous and dynamic microservice environment. We show that v-MLP can reduce tail latency by up to 50% and improve resource utilization by up to 15 % under various scenarios.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134303845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
PARSEC: PARallel Subgraph Enumeration in CUDA 并行子图枚举在CUDA
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00025
Vibhor Dodeja, M. Almasri, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu
{"title":"PARSEC: PARallel Subgraph Enumeration in CUDA","authors":"Vibhor Dodeja, M. Almasri, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1109/ipdps53621.2022.00025","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00025","url":null,"abstract":"Subgraph enumeration is an important problem in the field of Graph Analytics with numerous applications. The problem is provably NP-complete and requires sophisticated heuristics and highly efficient implementations to be feasible on problem sizes of realistic scales. Parallel solutions have shown a lot of promise on CPUs and distributed environments. Recently, GPU-based parallel solutions have also been proposed to take advantage of the massive execution resources in modern GPUs. Subgraph enumeration involves traversing a search tree for each vertex of the data graph to find matches of a query in a graph. Most GPU-based solutions traverse the tree in breadth-first manner that exploits parallelism at the cost of high memory requirement and presents a formidable challenge for processing large graphs with high-degree vertices since the memory capacity of GPUs is significantly lower than that of CPUs. In this work, we propose a novel GPU solution based on a hybrid BFS and DFS approach where the top level(s) of the search trees are traversed in a fully parallel, breadth-first manner while each subtree is traversed in a more space-efficient, depth-first manner. The depth-first traversal of subtrees requires less memory but presents more challenges for parallel execution. To overcome the less parallel nature of depth-first traversal, we exploit fine-grained parallelism in each step of the depth-first traversal of sub-trees. We further identify and implement various optimizations to efficiently utilize memory and compute resources of the GPUs. We evaluate our performance in comparison with the state-of-the-art GPU and CPU implementations. We outperform the GPU and CPU implementations with a geometric mean speedup of 9.47× (up to 92.01×) and 2.37× (up to 12.70×), respectively. We also show that the proposed approach can efficiently process the graphs that previously cannot be processed by the state-of-the-art GPU solutions due to their excessive memory requirement.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131820352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信