Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson
{"title":"Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures","authors":"Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson","doi":"10.1016/j.parco.2023.103021","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103021","url":null,"abstract":"<div><p>Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix–vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103021"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Yu , Tianqi Zhong , Peng Bi , Lan Wang , Fei Teng
{"title":"Segment based power-efficient scheduling for real-time DAG tasks on edge devices","authors":"Lei Yu , Tianqi Zhong , Peng Bi , Lan Wang , Fei Teng","doi":"10.1016/j.parco.2023.103022","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103022","url":null,"abstract":"<div><p><span>Smart Mobile Devices<span><span><span> (SMDs) are crucial for the edge computing paradigm’s real-world sensing. Real-time applications, which are computationally intensive and periodic with strict time constraints, can typically be used to replicate real-world sensing. Such applications call for increased processing speed, memory capacity, and battery life on SMDs, which are typically resource-constrained due to physical size restrictions. As a result, scheduling real-time applications for SMDs that are power efficient is crucial for the regular operation of edge computing platforms, and downstream decision-making tasks like </span>computation offloading require the prediction of </span>power consumption using power-saving approaches like DVFS. The main question is how to swiftly develop a better solution to the NP-Hard power efficient scheduling problem with DVFS. Thus, by segmenting the aligned tasks on an SMD, we present a segment-based analysis approach. Additionally, we offer a segment-based </span></span>scheduling algorithm (SEDF) that draws inspiration from the segment-based analysis approach to achieve power-efficient scheduling for these real-time workloads. This segment-based approach yields a power consumption bound (PB), and a computation offloading use case is developed to demonstrate the application of PB in the subsequent decision-making processes. Both simulations and actual device tests are used to confirm the PB, SEDF, and the effectiveness of offloading decision-making. We demonstrate empirically that PB can be utilized to make approximative optimal decisions in decision-making problems involving computation offloading. SEDF is a straightforward and effective scheduling approach that can cut the power consumption of a multi-core SMD by roughly 30%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103022"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient checkpoint/Restart of CUDA applications","authors":"Akira Nukada , Taichiro Suzuki , Satoshi Matsuoka","doi":"10.1016/j.parco.2023.103018","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103018","url":null,"abstract":"<div><p>We present NVCR<span> which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR’ overhead is acceptably low.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103018"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU acceleration of Levenshtein distance computation between long strings","authors":"David Castells-Rufas","doi":"10.1016/j.parco.2023.103019","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103019","url":null,"abstract":"<div><p>Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103019"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers","authors":"Marek Palkowski, Wlodzimierz Bielecki","doi":"10.1016/j.parco.2023.103016","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103016","url":null,"abstract":"<div><p><span>The paper presents a benchmark suite of ten non-serial polyadic dynamic programming<span> (NPDP) kernels, which are designed to test the efficiency of tiled code generated by polyhedral optimization compilers. These kernels are mainly derived from bioinformatics algorithms, which pose a significant challenge for automatic loop nest tiling transformations. The paper describes algorithms implemented with examined kernels and unifies them in the form of loop nests presented in the C language. The purpose is to reconsider the execution and monitoring of codes, typically used in past and current publications. For carrying out experiments with introduced benchmarks, we applied the two source-to-source compilers, PLuTo and TRACO, to generate cache-efficient codes and analyzed their performance on four multi-core machines. We discuss the limitations of well-known tiling approaches and outline future tiling strategies to generate effective tiled code by means of </span></span>optimizing compilers for introduced benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103016"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel non-convex approximation framework for risk parity portfolio design","authors":"Yidong Chen , Chen Li , Yonghong Hu , Zhonghua Lu","doi":"10.1016/j.parco.2023.102999","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102999","url":null,"abstract":"<div><p>In this paper, we propose a parallel non-convex approximation framework (NCAQ) for optimization problems whose objective is to minimize a convex function plus the sum of non-convex functions. Based on the structure of the objective function, our framework transforms the non-convex constraints to the logarithmic barrier function and approximates the non-convex problem by a parallel quadratic approximation scheme, which will allow the original problem to be solved by accelerated inexact gradient descent in the parallel environment. Moreover, we give a detailed convergence analysis for the proposed framework. The numerical experiments show that our framework outperforms the state-of-art approaches in terms of accuracy and computation time on the high dimension non-convex Rosenbrock test functions and the risk parity problems. In particular, we implement the proposed framework on CUDA, showing a more than 25 times speed-up ratio and removing the computational bottleneck for non-convex risk-parity portfolio design. Finally, we construct the high dimension risk parity portfolio which can consistently outperform the equal weight portfolio in the application of Chinese stock markets.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 102999"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of software techniques to emulate heterogeneous memory systems in high-performance computing","authors":"Clément Foyer, Brice Goglin, Andrès Rubio Proaño","doi":"10.1016/j.parco.2023.103023","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103023","url":null,"abstract":"<div><p><span>Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or </span>NVDIMM<span> allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors.</span></p><p>High-performance computing developers must prepare their runtimes and applications for these architectures, even before they are actually available. Hence, we survey software solutions for emulating them. First, we list many ways to modify the performance of platforms so that developers may test their code under different memory performance profiles. This is required to identify kernels and data buffers that are sensitive to memory performance.</p><p>Then, we present several techniques for exposing fake heterogeneous memory information to the software stack. This is useful for adapting runtimes and applications to heterogeneous memory so that different kinds of memory are detected at runtime and so that buffers are allocated in the appropriate one.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103023"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight semi-centralized strategy for the massive parallelization of branching algorithms","authors":"Andres Pastrana-Cruz, Manuel Lafond","doi":"10.1016/j.parco.2023.103024","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103024","url":null,"abstract":"<div><p>Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly. However, previous centralized approaches require too much communication to be efficient, whereas decentralized approaches are more efficient but have difficulty keeping track of the global state of the exploration.</p><p>In this work, we propose to revisit the centralized paradigm while avoiding previous bottlenecks. In our strategy, the center has lightweight responsibilities, requires only a few bits for every communication, but is still able to keep track of the progress of every worker. In particular, the center never holds any task but is able to guarantee that a process with no work always receives the highest priority task globally.</p><p>Our strategy was implemented in a generic C++ library called GemPBA, which allows a programmer to convert a sequential branching algorithm into a parallel version by changing only a few lines of code. An experimental case study on the vertex cover problem demonstrates that some of the toughest instances from the DIMACS challenge graphs that would take months to solve sequentially can be handled within two hours with our approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103024"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Reitz, Kai Hardenbicker, Tobias Werner, Claudia Fohry
{"title":"Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters","authors":"Lukas Reitz, Kai Hardenbicker, Tobias Werner, Claudia Fohry","doi":"10.1016/j.parco.2023.103020","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103020","url":null,"abstract":"<div><p><span>A popular approach to program scalable irregular applications is Asynchronous Many-Task (AMT) Programming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker </span>threads in multiple processes.</p><p>Thereby, dynamic load balancing can be achieved via cooperative work stealing, coordinated work stealing, or work sharing. A well-performing cooperative work stealing variant is the lifeline scheme. While previous implementations of this scheme are restricted to single-worker processes, a recent hybrid extension combines it with intra-process work sharing between multiple workers. The hybrid scheme, which was proposed for both DIT and NFJ, comes at the price of a higher complexity.</p><p>This paper investigates whether this complexity is indispensable for multi-worker processes by contrasting the hybrid scheme with a novel pure work stealing extension of the lifeline scheme to multiple workers. We independently implemented the extension for DIT and NFJ. In experiments based on four benchmarks, we observed the pure scheme to be on a par or even outperform the hybrid one by up to 18% for DIT and up to 5% for NFJ.</p><p>Building on this main result, we studied a modification of the pure scheme, which prefers local over global victims, and more heavily loaded over less loaded ones. The modification improves the performance of the pure scheme by up to 15%. Finally, we explored whether the lifeline scheme can profit from a change to coordinated work stealing. We developed a coordinated multi-worker implementation for DIT and observed a performance improvement over the cooperative scheme by up to 17%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103020"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}