{"title":"Parallel multi-view HEVC for heterogeneously embedded cluster system","authors":"Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho","doi":"10.1016/j.parco.2022.102948","DOIUrl":"10.1016/j.parco.2022.102948","url":null,"abstract":"<div><p><span>In this paper, we present a computer cluster with heterogeneous computing<span> components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher </span></span>computational complexity<span><span>. To this point, research using the MV-HEVC has had to use the Central Processing Unit<span><span> (CPU) on a Personal Computer (PC) or workstation for decompression<span>, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a </span></span>parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously </span></span>distributed computing compared to the single device (3D-HTM single threading).</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102948"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74144309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman
{"title":"Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers","authors":"Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman","doi":"10.1016/j.parco.2022.102940","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102940","url":null,"abstract":"<div><p><span>The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the </span><span><math><mrow><mi>Q</mi><mi>R</mi></mrow></math></span><span><span><span> factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur </span>eigenvalue computations. Performance experiments on the ORNL Summit </span>supercomputer then establish the superiority of DCGS2 over CGS2.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102940"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl
{"title":"Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms","authors":"Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl","doi":"10.1016/j.parco.2022.102920","DOIUrl":"10.1016/j.parco.2022.102920","url":null,"abstract":"<div><p>We push the boundaries of electronic structure-based <em>ab-initio</em> molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102920"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000242/pdfft?md5=cb708fe8c83694714bb33b45ee473a37&pid=1-s2.0-S0167819122000242-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77029045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial- and time- division multiplexing in CNN accelerator","authors":"Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga","doi":"10.1016/j.parco.2022.102922","DOIUrl":"10.1016/j.parco.2022.102922","url":null,"abstract":"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102922"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90824046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang
{"title":"OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight","authors":"Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang","doi":"10.1016/j.parco.2022.102893","DOIUrl":"10.1016/j.parco.2022.102893","url":null,"abstract":"<div><p><span>The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the </span>OpenACC<span><span> version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 </span>silicon atoms.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102893"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75647516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance","authors":"Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb","doi":"10.1016/j.parco.2022.102918","DOIUrl":"10.1016/j.parco.2022.102918","url":null,"abstract":"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102918"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90453784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards scaling community detection on distributed-memory heterogeneous systems","authors":"Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman","doi":"10.1016/j.parco.2022.102898","DOIUrl":"10.1016/j.parco.2022.102898","url":null,"abstract":"<div><p>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as <em>communities</em> or <em>clusters</em> such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using <em>modularity</em>. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the <em>Louvain</em> method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.</p><p>Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.</p><p>In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<span><math><mo>×</mo></math></span> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<span><math><mo>×</mo></math></span> speedup relative to the NVIDIA RAPIDS® <span>cuGraph</span>® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102898"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88658806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task-parallel tiled direct solver for dense symmetric indefinite systems","authors":"Zhongyu Shen , Jilin Zhang , Tomohiro Suzuki","doi":"10.1016/j.parco.2022.102900","DOIUrl":"10.1016/j.parco.2022.102900","url":null,"abstract":"<div><p>This paper proposes a direct solver for symmetric indefinite linear systems. The program is parallelized via the OpenMP task construct and outperforms existing programs. The proposed solver avoids pivoting, which requires a lot of data movement, during factorization with preconditioning using the symmetric random butterfly transformation. The matrix data layout is tiled after the preconditioning to more efficiently use cache memory during factorization. Given the low-rank property of the input matrices, an adaptive crossing approximation is used to make a low-rank approximation before the update step to reduce the computation load. Iterative refinement is then used to improve the accuracy of the final result. Finally, the performance of the proposed solver is compared to that of various symmetric indefinite linear system solvers to show its superiority.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102900"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85415549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vianney Kengne Tchendji , Hermann Bogning Tepiele , Mathias Akong Onabid , Jean Frédéric Myoupo , Jerry Lacmou Zeutouo
{"title":"A coarse-grained multicomputer parallel algorithm for the sequential substring constrained longest common subsequence problem","authors":"Vianney Kengne Tchendji , Hermann Bogning Tepiele , Mathias Akong Onabid , Jean Frédéric Myoupo , Jerry Lacmou Zeutouo","doi":"10.1016/j.parco.2022.102927","DOIUrl":"10.1016/j.parco.2022.102927","url":null,"abstract":"<div><p>In this paper, we study the sequential substring constrained longest common subsequence (SSCLCS) problem. It is widely used in the bioinformatics field. Given two strings <span><math><mi>X</mi></math></span> and <span><math><mi>Y</mi></math></span> with respective lengths <span><math><mi>m</mi></math></span> and <span><math><mi>n</mi></math></span>, formed on an alphabet <span><math><mi>Σ</mi></math></span> and a constraint sequence <span><math><mi>C</mi></math></span> formed by ordered strings <span><math><mrow><mo>(</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>1</mn></mrow></msup><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo><mo>…</mo><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mi>l</mi></mrow></msup><mo>)</mo></mrow></math></span> with total length <span><math><mi>r</mi></math></span>, the SSCLCS problem is to find the longest common subsequence <span><math><mi>D</mi></math></span> between <span><math><mi>X</mi></math></span> and <span><math><mi>Y</mi></math></span> such that <span><math><mi>D</mi></math></span> contains in an ordered way <span><math><mrow><msup><mrow><mi>c</mi></mrow><mrow><mn>1</mn></mrow></msup><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo><mo>…</mo><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mi>l</mi></mrow></msup></mrow></math></span>. To solve this problem, Tseng et al. proposed a dynamic-programming algorithm that runs in <span><math><mrow><mi>O</mi><mfenced><mrow><mi>m</mi><mi>n</mi><mi>r</mi><mo>+</mo><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mi>n</mi><mo>)</mo></mrow><mo>|</mo><mi>Σ</mi><mo>|</mo></mrow></mfenced></mrow></math></span><span><span> time. We rely on this work to propose a parallel algorithm for the SSCLCS problem on the Coarse-Grained </span>Multicomputer<span><span> (CGM) model. We design a three-dimensional partitioning technique of the corresponding dependency graph to reduce the latency time of processors by ensuring that at each step, the size of the </span>subproblems to be performed by processors is small. It also minimizes the number of communications between processors. Our solution requires </span></span><span><math><mrow><mi>O</mi><mfenced><mrow><mfrac><mrow><mi>n</mi><mi>m</mi><mi>r</mi><mo>+</mo><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mi>n</mi><mo>)</mo></mrow><mo>|</mo><mi>Σ</mi><mo>|</mo></mrow><mrow><mi>p</mi></mrow></mfrac></mrow></mfenced></mrow></math></span> execution time with <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow></mrow></math></span> communication rounds on <span><math><mi>p</mi></math></span> processors. The experimental results show that our solution speedups up to 59.7 on 64 processors. This is better than the CGM-based parallel techniques that have been used in solving similar problems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102927"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87832997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kasia Świrydowicz , Eric Darve , Wesley Jones , Jonathan Maack , Shaked Regev , Michael A. Saunders , Stephen J. Thomas , Slaven Peleš
{"title":"Linear solvers for power grid optimization problems: A review of GPU-accelerated linear solvers","authors":"Kasia Świrydowicz , Eric Darve , Wesley Jones , Jonathan Maack , Shaked Regev , Michael A. Saunders , Stephen J. Thomas , Slaven Peleš","doi":"10.1016/j.parco.2021.102870","DOIUrl":"10.1016/j.parco.2021.102870","url":null,"abstract":"<div><p><span>The linear equations<span> that arise in interior methods for constrained optimization are sparse symmetric indefinite, and they become extremely ill-conditioned as the interior method converges. These linear systems present a challenge for existing solver frameworks based on sparse LU or </span></span><span><math><msup><mrow><mtext>LDL</mtext></mrow><mrow><mtext>T</mtext></mrow></msup></math></span><span><span> decompositions. We benchmark five well known direct linear solver packages on CPU- and GPU-based hardware, using matrices extracted from power grid optimization problems. The achieved solution accuracy varies greatly among the packages. None of the tested packages delivers significant </span>GPU acceleration for our test cases. For completeness of the comparison we include results for MA57, which is one of the most efficient and reliable CPU solvers for this class of problem.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102870"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80695625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}