ACM Transactions on Mathematical Software最新文献_第9页

Accurate Calculation of Euclidean Norms Using Double-word Arithmetic 用双字算术精确计算欧几里得范数

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-10-25 DOI: 10.1145/3568672

V. Lefèvre, N. Louvet, J. Muller, Joris Picot, L. Rideau

引用次数: 2

IFISS3D: A Computational Laboratory for Investigating Finite Element Approximation in Three Dimensions IFISS3D:研究三维有限元逼近的计算实验室

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-09-27 DOI: 10.1145/3604934

Georgios Papanikos, C. Powell, D. Silvester

引用次数: 1

Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado 基于Sacado的新兴多核体系结构c++代码自动识别

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-09-27 DOI: 10.1145/3560262

E. Phipps, R. Pawlowski, C. Trott

{"title":"Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado","authors":"E. Phipps, R. Pawlowski, C. Trott","doi":"10.1145/3560262","DOIUrl":"https://doi.org/10.1145/3560262","url":null,"abstract":"Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"48 1","pages":"1 - 29"},"PeriodicalIF":2.7,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42830984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Remark on Algorithm 1010: Boosting Efficiency in Solving Quartic Equations with No Compromise in Accuracy 算法1010:在不影响精度的情况下提高求解四次方程的效率

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-09-19 DOI: 10.1145/3564270

C. De Michele

引用次数: 1

emgr – EMpirical GRamian Framework Version 5.99 emgr -经验语法框架版本5.99

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-09-08 DOI: 10.1145/3609860

Christian Himpe

引用次数: 19

Cache-oblivious Hilbert Curve-based Blocking Scheme for Matrix Transposition 基于缓存遗忘Hilbert曲线的矩阵换位分块方案

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-08-09 DOI: 10.1145/3555353

J. N. F. Alves, L. Russo, Alexandre P. Francisco

{"title":"Cache-oblivious Hilbert Curve-based Blocking Scheme for Matrix Transposition","authors":"J. N. F. Alves, L. Russo, Alexandre P. Francisco","doi":"10.1145/3555353","DOIUrl":"https://doi.org/10.1145/3555353","url":null,"abstract":"This article presents a fast SIMD Hilbert space-filling curve generator, which supports a new cache-oblivious blocking-scheme technique applied to the out-of-place transposition of general matrices. Matrix operations found in high performance computing libraries are usually parameterized based on host microprocessor specifications to minimize data movement within the different levels of memory hierarchy. The performance of cache-oblivious algorithms does not rely on such parameterizations. This type of algorithm provides an elegant and portable solution to address the lack of standardization in modern-day processors. Our solution consists in an iterative blocking scheme that takes advantage of the locality-preserving properties of Hilbert space-filling curves to minimize data movement in any memory hierarchy. This scheme traverses the input matrix, in O(nm) time and space, improving the behavior of matrix algorithms that inherently present poor memory locality. The application of this technique to the problem of out-of-place matrix transposition achieved competitive results when compared to state-of-the-art approaches. The performance of our solution surpassed Intel MKL version after employing standard software prefetching techniques.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"48 1","pages":"1 - 28"},"PeriodicalIF":2.7,"publicationDate":"2022-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43226486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Algorithm xxx: SC-SR1: MATLAB Software for Limited-Memory SR1 Trust-Region Methods 算法xxx:SC-SR1:MATLAB有限内存软件SR1信赖域方法

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-07-22 DOI: 10.1145/3550269

J. Brust, O. Burdakov, Jennifer B. Erway, Roummel F. Marcia

引用次数: 4

Algorithms for Parallel Generic hp-Adaptive Finite Element Software 并行通用hp-自适应有限元软件算法

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-06-13 DOI: 10.1145/3603372

M. Fehling, W. Bangerth

{"title":"Algorithms for Parallel Generic hp-Adaptive Finite Element Software","authors":"M. Fehling, W. Bangerth","doi":"10.1145/3603372","DOIUrl":"https://doi.org/10.1145/3603372","url":null,"abstract":"The hp-adaptive finite element method—where one independently chooses the mesh size (h) and polynomial degree (p) to be used on each cell—has long been known to have better theoretical convergence properties than either h- or p-adaptive methods alone. However, it is not widely used, owing at least in part to the difficulty of the underlying algorithms and the lack of widely usable implementations. This is particularly true when used with continuous finite elements. Herein, we discuss algorithms that are necessary for a comprehensive and generic implementation of hp-adaptive finite element methods on distributed-memory, parallel machines. In particular, we will present a multistage algorithm for the unique enumeration of degrees of freedom suitable for continuous finite element spaces, describe considerations for weighted load balancing, and discuss the transfer of variable size data between processes. We illustrate the performance of our algorithms with numerical examples and demonstrate that they scale reasonably up to at least 16,384 message passage interface processes. We provide a reference implementation of our algorithms as part of the open source library deal.II.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 26"},"PeriodicalIF":2.7,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44759580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ARKODE: A Flexible IVP Solver Infrastructure for One-step Methods ARKODE:用于一步法的灵活IVP求解器基础结构

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-05-27 DOI: 10.1145/3594632

D. Reynolds, D. J. Gardner, C. Woodward, Rujeko Chinomona

引用次数: 4

Algorithm XXX: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions 算法XXX:并行交替最小二乘的多重同时正则多进分解

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2022-04-29 DOI: 10.1145/3519383

C. Psarras, L. Karlsson, R. Bro, P. Bientinesi

{"title":"Algorithm XXX: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions","authors":"C. Psarras, L. Karlsson, R. Bro, P. Bientinesi","doi":"10.1145/3519383","DOIUrl":"https://doi.org/10.1145/3519383","url":null,"abstract":"Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41347702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3