{"title":"Tree Partitioning Reduction","authors":"A. P. Diéguez, M. Amor, R. Doallo","doi":"10.1145/3328731","DOIUrl":"https://doi.org/10.1145/3328731","url":null,"abstract":"Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"408 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76467447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Agulhari, Alexandre Felipe, R. Oliveira, P. Peres
{"title":"Algorithm 998","authors":"C. Agulhari, Alexandre Felipe, R. Oliveira, P. Peres","doi":"10.1145/3323925","DOIUrl":"https://doi.org/10.1145/3323925","url":null,"abstract":"The ROLMIP (Robust LMI Parser) is a toolbox specialized in control theory for uncertain linear systems, built to work under MATLAB jointly with YALMIP, to ease the programming of sufficient Linear Matrix Inequality (LMI) conditions that, if feasible, assure the validity of parameter-dependent LMIs in the entire set of uncertainty considered. This article presents the new version of the ROLMIP toolbox, which was completely remodeled to provide a high-level user-friendly interface to cope with distinct uncertain domains (hypercube and multi-simplex) and to treat time-varying parameters in discrete- and continuous-time. By means of simple commands, the user is able to define polynomial matrices as well as to describe the desired parameter-dependent LMIs in an easy way, considerably reducing the programming time to end up with implementable LMI conditions. Therefore, ROLMIP helps the popularization of the state-of-the-art robust control methods for uncertain systems based on LMIs among graduate students, researchers, and engineers in control systems.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"44 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74625694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithm 997","authors":"R. Speck","doi":"10.1145/3310410","DOIUrl":"https://doi.org/10.1145/3310410","url":null,"abstract":"In this article, we present the Python framework pySDC for solving collocation problems with spectral deferred correction (SDC) methods and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit timestepping to high-order implicit-explicit or multi-implicit splitting and multilevel SDCs. The software package comes with many different, preimplemented examples and has seven tutorials to help new users with their first steps. Time parallelism is implemented either in an emulated way for debugging and prototyping or using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: those of the user and those of the developer. The first sheds light on the front-end, the examples, and the tutorials, and the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities, and the usage of pySDC. In addition, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"72 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88231176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU","authors":"Carl Yang, A. Buluç, John Douglas Owens","doi":"10.1145/3466795","DOIUrl":"https://doi.org/10.1145/3466795","url":null,"abstract":"High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"12 1","pages":"1 - 51"},"PeriodicalIF":0.0,"publicationDate":"2019-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75148402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adjoint Code Design Patterns","authors":"U. Naumann","doi":"10.1145/3326162","DOIUrl":"https://doi.org/10.1145/3326162","url":null,"abstract":"Adjoint methods have become fundamental ingredients of the scientific computing toolbox over the past decades. Large-scale parameter sensitivity analysis, uncertainty quantification, and nonlinear optimization would otherwise turn out computationally infeasible. The symbolic derivation of adjoint mathematical models for relevant problems in science and engineering and their implementation in consistency with the implementation of the underlying primal model frequently proves highly challenging. Hence, an increased interest in algorithmic adjoints can be observed. The algorithmic derivation of adjoint numerical simulation programs shifts some of the problems faced from functional and numerical analysis to computer science. It becomes a highly complex software engineering task requiring expertise in software analysis, transformation, and optimization. Despite rather mature software tool support for algorithmic differentiation, substantial user intervention is typically required when targeting nontrivial numerical programs. A large number of patterns shared by numerous application codes results in repeated duplication of development effort. The adjoint code design patterns introduced in this article aim to reduce this problem through improved formalization from the software engineering perspective. Fully functional reference implementations are provided through github.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"122 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87665183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enclosing Chebyshev Expansions in Linear Time","authors":"B. Hashemi","doi":"10.1145/3319395","DOIUrl":"https://doi.org/10.1145/3319395","url":null,"abstract":"We consider the problem of computing rigorous enclosures for polynomials represented in the Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods relies on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded, e.g., with a divide-and-conquer strategy. Our main category of methods that are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered that has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high-degree Chebyshev expansions at thousands of points in a few seconds on an average computer. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of the Chebfun feval command.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"160 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76973574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-performance Implementation of Elliptic Curve Cryptography Using Vector Instructions","authors":"Armando Faz-Hernández, Julio López, R. Dahab","doi":"10.1145/3309759","DOIUrl":"https://doi.org/10.1145/3309759","url":null,"abstract":"Elliptic curve cryptosystems are considered an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used for instantiating Diffie-Hellman protocols named X25519 and X448. Mapping these curves to twisted Edwards curves allowed deriving two new signature instances, called Ed25519 and Ed448, of the Edwards Digital Signature Algorithm. In this work, we focus on the secure and efficient software implementation of these algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. Our contributions result in a high-performance software library for AVX2-ready processors. For example, our library computes digital signatures 19% (for Ed25519) and 29% (for Ed448) faster than previous optimized implementations. Also, our library improves by 10% and 20% the execution time of X25519 and X448, respectively.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"23 1","pages":"1 - 35"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88350200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithm 995","authors":"Juliette Pardue, Andrey N. Chernikov","doi":"10.1145/3301321","DOIUrl":"https://doi.org/10.1145/3301321","url":null,"abstract":"A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator starting from the basic operations of vertex insertion and Delaunay triangles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases, still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions— one-dimensional spacing functions generated from an exponential distribution. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles instead of isotropic triangles. The validity of the meshes is shown through solution characteristic comparisons to verified reference solutions. A 79% parallel weak scaling efficiency on 1,024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes, is shown through numerical experiments.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"33 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84554039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithm 994","authors":"F. Hernando, Francisco D. Igual, G. Quintana-Ortí","doi":"10.1145/3302389","DOIUrl":"https://doi.org/10.1145/3302389","url":null,"abstract":"The minimum distance of an error-correcting code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is crucial to many problems in this area. In this article, we present and assess a family of implementations of both the brute-force algorithm and the Brouwer-Zimmermann algorithm for computing the minimum distance of a random linear code over F2 that are faster than current implementations, both in the commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that produce high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable improvements compared with state-of-the-art implementations widely used nowadays.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"21 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77977526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGPOPS","authors":"Yunus M. Agamawi, Anil V. Rao","doi":"10.1145/3390463","DOIUrl":"https://doi.org/10.1145/3390463","url":null,"abstract":"A general-purpose C++ software program called CGPOPS is described for solving multiple-phase optimal control problems using adaptive direct orthogonal collocation methods. The software employs a Legendre-Gauss-Radau direct orthogonal collocation method to transcribe the continuous optimal control problem into a large sparse nonlinear programming problem (NLP). A class of hp mesh refinement methods are implemented that determine the number of mesh intervals and the degree of the approximating polynomial within each mesh interval to achieve a specified accuracy tolerance. The software is interfaced with the open source Newton NLP solver IPOPT. All derivatives required by the NLP solver are computed via central finite differencing, bicomplex-step derivative approximations, hyper-dual derivative approximations, or automatic differentiation. The key components of the software are described in detail, and the utility of the software is demonstrated on five optimal control problems of varying complexity. The software described in this article provides researchers a transitional platform to solve a wide variety of complex constrained optimal control problems.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"57 1","pages":"1 - 38"},"PeriodicalIF":0.0,"publicationDate":"2019-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85956700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}