Parallel Computing最新文献

筛选
英文 中文
Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures 异构体系结构中不规则点对点通信节点感知策略的性能表征
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103021
Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson
{"title":"Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures","authors":"Shelby Lockhart ,&nbsp;Amanda Bienz ,&nbsp;William D. Gropp ,&nbsp;Luke N. Olson","doi":"10.1016/j.parco.2023.103021","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103021","url":null,"abstract":"<div><p>Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix–vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103021"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segment based power-efficient scheduling for real-time DAG tasks on edge devices 基于段的边缘设备实时DAG任务节能调度
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103022
Lei Yu , Tianqi Zhong , Peng Bi , Lan Wang , Fei Teng
{"title":"Segment based power-efficient scheduling for real-time DAG tasks on edge devices","authors":"Lei Yu ,&nbsp;Tianqi Zhong ,&nbsp;Peng Bi ,&nbsp;Lan Wang ,&nbsp;Fei Teng","doi":"10.1016/j.parco.2023.103022","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103022","url":null,"abstract":"<div><p><span>Smart Mobile Devices<span><span><span> (SMDs) are crucial for the edge computing paradigm’s real-world sensing. Real-time applications, which are computationally intensive and periodic with strict time constraints, can typically be used to replicate real-world sensing. Such applications call for increased processing speed, memory capacity, and battery life on SMDs, which are typically resource-constrained due to physical size restrictions. As a result, scheduling real-time applications for SMDs that are power efficient is crucial for the regular operation of edge computing platforms, and downstream decision-making tasks like </span>computation offloading require the prediction of </span>power consumption using power-saving approaches like DVFS. The main question is how to swiftly develop a better solution to the NP-Hard power efficient scheduling problem with DVFS. Thus, by segmenting the aligned tasks on an SMD, we present a segment-based analysis approach. Additionally, we offer a segment-based </span></span>scheduling algorithm (SEDF) that draws inspiration from the segment-based analysis approach to achieve power-efficient scheduling for these real-time workloads. This segment-based approach yields a power consumption bound (PB), and a computation offloading use case is developed to demonstrate the application of PB in the subsequent decision-making processes. Both simulations and actual device tests are used to confirm the PB, SEDF, and the effectiveness of offloading decision-making. We demonstrate empirically that PB can be utilized to make approximative optimal decisions in decision-making problems involving computation offloading. SEDF is a straightforward and effective scheduling approach that can cut the power consumption of a multi-core SMD by roughly 30%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103022"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient checkpoint/Restart of CUDA applications 有效的检查点/重新启动CUDA应用程序
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103018
Akira Nukada , Taichiro Suzuki , Satoshi Matsuoka
{"title":"Efficient checkpoint/Restart of CUDA applications","authors":"Akira Nukada ,&nbsp;Taichiro Suzuki ,&nbsp;Satoshi Matsuoka","doi":"10.1016/j.parco.2023.103018","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103018","url":null,"abstract":"<div><p>We present NVCR<span> which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR’ overhead is acceptably low.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103018"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU acceleration of Levenshtein distance computation between long strings 长字符串间Levenshtein距离计算的GPU加速
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103019
David Castells-Rufas
{"title":"GPU acceleration of Levenshtein distance computation between long strings","authors":"David Castells-Rufas","doi":"10.1016/j.parco.2023.103019","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103019","url":null,"abstract":"<div><p>Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103019"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers NPDP基准套件,用于评估自动优化编译器的有效性
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103016
Marek Palkowski, Wlodzimierz Bielecki
{"title":"NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers","authors":"Marek Palkowski,&nbsp;Wlodzimierz Bielecki","doi":"10.1016/j.parco.2023.103016","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103016","url":null,"abstract":"<div><p><span>The paper presents a benchmark suite of ten non-serial polyadic dynamic programming<span> (NPDP) kernels, which are designed to test the efficiency of tiled code generated by polyhedral optimization compilers. These kernels are mainly derived from bioinformatics algorithms, which pose a significant challenge for automatic loop nest tiling transformations. The paper describes algorithms implemented with examined kernels and unifies them in the form of loop nests presented in the C language. The purpose is to reconsider the execution and monitoring of codes, typically used in past and current publications. For carrying out experiments with introduced benchmarks, we applied the two source-to-source compilers, PLuTo and TRACO, to generate cache-efficient codes and analyzed their performance on four multi-core machines. We discuss the limitations of well-known tiling approaches and outline future tiling strategies to generate effective tiled code by means of </span></span>optimizing compilers for introduced benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103016"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A parallel non-convex approximation framework for risk parity portfolio design 风险平价投资组合设计的并行非凸近似框架
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.102999
Yidong Chen , Chen Li , Yonghong Hu , Zhonghua Lu
{"title":"A parallel non-convex approximation framework for risk parity portfolio design","authors":"Yidong Chen ,&nbsp;Chen Li ,&nbsp;Yonghong Hu ,&nbsp;Zhonghua Lu","doi":"10.1016/j.parco.2023.102999","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102999","url":null,"abstract":"<div><p>In this paper, we propose a parallel non-convex approximation framework (NCAQ) for optimization problems whose objective is to minimize a convex function plus the sum of non-convex functions. Based on the structure of the objective function, our framework transforms the non-convex constraints to the logarithmic barrier function and approximates the non-convex problem by a parallel quadratic approximation scheme, which will allow the original problem to be solved by accelerated inexact gradient descent in the parallel environment. Moreover, we give a detailed convergence analysis for the proposed framework. The numerical experiments show that our framework outperforms the state-of-art approaches in terms of accuracy and computation time on the high dimension non-convex Rosenbrock test functions and the risk parity problems. In particular, we implement the proposed framework on CUDA, showing a more than 25 times speed-up ratio and removing the computational bottleneck for non-convex risk-parity portfolio design. Finally, we construct the high dimension risk parity portfolio which can consistently outperform the equal weight portfolio in the application of Chinese stock markets.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 102999"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An optimal scheduling algorithm considering the transactions worst-case delay for multi-channel hyperledger fabric network 多通道超级账本网络中考虑事务最坏延迟的最优调度算法
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103041
Ou Wu, Shanshan Li, He Zhang, Liwen Liu, Haoming Li, Yanze Wang, Ziyi Zhang
{"title":"An optimal scheduling algorithm considering the transactions worst-case delay for multi-channel hyperledger fabric network","authors":"Ou Wu, Shanshan Li, He Zhang, Liwen Liu, Haoming Li, Yanze Wang, Ziyi Zhang","doi":"10.1016/j.parco.2023.103041","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103041","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103041"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of software techniques to emulate heterogeneous memory systems in high-performance computing 在高性能计算中模拟异构存储系统的软件技术综述
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103023
Clément Foyer, Brice Goglin, Andrès Rubio Proaño
{"title":"A survey of software techniques to emulate heterogeneous memory systems in high-performance computing","authors":"Clément Foyer,&nbsp;Brice Goglin,&nbsp;Andrès Rubio Proaño","doi":"10.1016/j.parco.2023.103023","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103023","url":null,"abstract":"<div><p><span>Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or </span>NVDIMM<span> allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors.</span></p><p>High-performance computing developers must prepare their runtimes and applications for these architectures, even before they are actually available. Hence, we survey software solutions for emulating them. First, we list many ways to modify the performance of platforms so that developers may test their code under different memory performance profiles. This is required to identify kernels and data buffers that are sensitive to memory performance.</p><p>Then, we present several techniques for exposing fake heterogeneous memory information to the software stack. This is useful for adapting runtimes and applications to heterogeneous memory so that different kinds of memory are detected at runtime and so that buffers are allocated in the appropriate one.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103023"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A lightweight semi-centralized strategy for the massive parallelization of branching algorithms 分支算法大规模并行化的轻量级半集中式策略
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103024
Andres Pastrana-Cruz, Manuel Lafond
{"title":"A lightweight semi-centralized strategy for the massive parallelization of branching algorithms","authors":"Andres Pastrana-Cruz,&nbsp;Manuel Lafond","doi":"10.1016/j.parco.2023.103024","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103024","url":null,"abstract":"<div><p>Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly. However, previous centralized approaches require too much communication to be efficient, whereas decentralized approaches are more efficient but have difficulty keeping track of the global state of the exploration.</p><p>In this work, we propose to revisit the centralized paradigm while avoiding previous bottlenecks. In our strategy, the center has lightweight responsibilities, requires only a few bits for every communication, but is still able to keep track of the progress of every worker. In particular, the center never holds any task but is able to guarantee that a process with no work always receives the highest priority task globally.</p><p>Our strategy was implemented in a generic C++ library called GemPBA, which allows a programmer to convert a sequential branching algorithm into a parallel version by changing only a few lines of code. An experimental case study on the vertex cover problem demonstrates that some of the toughest instances from the DIMACS challenge graphs that would take months to solve sequentially can be handled within two hours with our approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103024"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters 集群中异步多任务运行时的基于生命线的负载平衡方案
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103020
Lukas Reitz, Kai Hardenbicker, Tobias Werner, Claudia Fohry
{"title":"Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters","authors":"Lukas Reitz,&nbsp;Kai Hardenbicker,&nbsp;Tobias Werner,&nbsp;Claudia Fohry","doi":"10.1016/j.parco.2023.103020","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103020","url":null,"abstract":"<div><p><span>A popular approach to program scalable irregular applications is Asynchronous Many-Task (AMT) Programming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker </span>threads in multiple processes.</p><p>Thereby, dynamic load balancing can be achieved via cooperative work stealing, coordinated work stealing, or work sharing. A well-performing cooperative work stealing variant is the lifeline scheme. While previous implementations of this scheme are restricted to single-worker processes, a recent hybrid extension combines it with intra-process work sharing between multiple workers. The hybrid scheme, which was proposed for both DIT and NFJ, comes at the price of a higher complexity.</p><p>This paper investigates whether this complexity is indispensable for multi-worker processes by contrasting the hybrid scheme with a novel pure work stealing extension of the lifeline scheme to multiple workers. We independently implemented the extension for DIT and NFJ. In experiments based on four benchmarks, we observed the pure scheme to be on a par or even outperform the hybrid one by up to 18% for DIT and up to 5% for NFJ.</p><p>Building on this main result, we studied a modification of the pure scheme, which prefers local over global victims, and more heavily loaded over less loaded ones. The modification improves the performance of the pure scheme by up to 15%. Finally, we explored whether the lifeline scheme can profit from a change to coordinated work stealing. We developed a coordinated multi-worker implementation for DIT and observed a performance improvement over the cooperative scheme by up to 17%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103020"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信