{"title":"Meataxe64: High performance linear algebra over finite fields","authors":"R. Parker","doi":"10.1145/3115936.3115947","DOIUrl":"https://doi.org/10.1145/3115936.3115947","url":null,"abstract":"","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128888216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Algorithm For Spliting Polynomial Systems Based On F4","authors":"M. Monagan, Roman Pearce","doi":"10.1145/3115936.3115948","DOIUrl":"https://doi.org/10.1145/3115936.3115948","url":null,"abstract":"We present algorithms for splitting polynomial systems using Gröbner bases. For zero dimensional systems, we use FGLM to compute univariate polynomials and factor them, placing the ideal into general position if necessary. For positive dimensional systems, we successively eliminate variables using F4 and use the leading co-efficients of the last variable to split the system. We also present a known optimization to reduce the cost of zero-reductions in F4, an improvement for FGLM over the rationals, and an algorithm for quickly detecting redundant ideals in a decomposition.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115182205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Parallel Multi-point Evaluation of Sparse Polynomials","authors":"M. Monagan, Alan Wong","doi":"10.1145/3115936.3115940","DOIUrl":"https://doi.org/10.1145/3115936.3115940","url":null,"abstract":"We present a parallel algorithm to evaluate a sparse polynomial in Zp[x0, ..., xn] into many bivariate images, based on the fast multi-point evaluation technique described by van der Hoeven and Lecerf [11]. We have implemented the fast parallel algorithm in Cilk C. We present benchmarks demonstrating good parallel speedup for multi-core computers. Our algorithm was developed with a specific application in mind, namely, the sparse polynomial GCD algorithm of Hu and Monagan [6] which requires evaluations of this form. We present benchmarks showing a large speedup for the polynomial GCD problem.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126852562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multithreaded programming on the GPU: pointers and hints for the computer algebraist","authors":"M. M. Maza","doi":"10.1145/3115936.3115939","DOIUrl":"https://doi.org/10.1145/3115936.3115939","url":null,"abstract":"It is well-known that the advent of hardware acceleration technologies (multicore processors, graphics processing units, field programmable gate arrays) provide vast opportunities for innovation in computing. In particular, GPUs combined with low-level heterogeneous programming models, such as CUDA (the Compute Unified Device Architecture, see [6, 7]), brought super-computing to the level of the desktop computer. However, these low-level programming models carry notable challenges, even to expert programmers. Indeed, fully exploiting the power of hardware accelerators by writing CUDA code often requires significant code optimization effort. This two-hour tutorial attempts to cover the key principles that computer algebraists interested in GPU programming should have in mind. The first half introduces the basics of GPU architecture and the CUDA programming model: no preliminary experience with GPU programming will be assumed; see [10] for a reference. In the second hour, we shall discuss the recent developments in terms of GPU architecture (e.g. dynamic parallelism [12]) and programming models (e.g. OpenMP [1, 9] and OpenACC [8, 11] as well as techniques for improving code performance (e.g MWP-CWP mode [4], TMM model [5], MCM model [3]). Illustrative examples are taken from the CUMODP library [2] for dense polynomial arithmetic over finite fields.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Archibald, Patrick Maier, Robert J. Stewart, P. Trinder, J. Beule
{"title":"Towards Generic Scalable Parallel Combinatorial Search","authors":"B. Archibald, Patrick Maier, Robert J. Stewart, P. Trinder, J. Beule","doi":"10.1145/3115936.3115942","DOIUrl":"https://doi.org/10.1145/3115936.3115942","url":null,"abstract":"Combinatorial search problems in mathematics, e.g. in finite geometry, are notoriously hard; a state-of-the-art backtracking search algorithm can easily take months to solve a single problem. There is clearly demand for parallel combinatorial search algorithms scaling to hundreds of cores and beyond. However, backtracking combinatorial searches are challenging to parallelise due to their sensitivity to search order and due to the their irregularly shaped search trees. Moreover, scaling parallel search to hundreds of cores generally requires highly specialist parallel programming expertise. This paper proposes a generic scalable framework for solving hard combinatorial problems. Key elements are distributed memory task parallelism (to achieve scale), work stealing (to cope with irregularity), and generic algorithmic skeletons for combinatorial search (to reduce the parallelism expertise required). We outline two implementations: a mature Haskell Tree Search Library (HTSL) based around algorithmic skeletons and a prototype C++ Tree Search Library (CTSL) that uses hand coded applications. Experiments on maximum clique problems and on a problem in finite geometry, the search for spreads in H(4, 22), show that (1) CTSL consistently outperforms HTSL on sequential runs, and (2) both libraries scale to 200 cores, e.g. speeding up spreads search by a factor of 81 (HTSL) and 60 (CTSL), respectively. This demonstrates the potential of our generic framework for scaling parallel combinatorial search to large distributed memory platforms.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Fast Möbius (Reed-Muller) Transform and its Implementation with CUDA on GPUs","authors":"D. Bikov, I. Bouyukliev","doi":"10.1145/3115936.3115941","DOIUrl":"https://doi.org/10.1145/3115936.3115941","url":null,"abstract":"One of the most important cryptographic characteristics of the Boolean and vector Boolean functions is the algebraic degree which is connected with the Algebraic Normal Form. In this paper, we present an algorithm for computing the Algebraic Normal Form of a Boolean function using binary Fast Möbius (Reed-Muller) Transform implemented in CUDA for parallel execution on GPU. In the end, we give some experimental results.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. A. Haque, A. Hashemi, Davood Mohajerani, M. M. Maza
{"title":"Plain, and Somehow Sparse, Univariate Polynomial Division on Graphics Processing Units","authors":"S. A. Haque, A. Hashemi, Davood Mohajerani, M. M. Maza","doi":"10.1145/3115936.3115946","DOIUrl":"https://doi.org/10.1145/3115936.3115946","url":null,"abstract":"We present multithreaded adaptations of the Euclidean plain division and the Euclidean GCD algorithms to the many-core GPU architectures We report on implementation with NVIDIA CUDA and complexity analysis with an enhanced version of the PRAM model.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115762634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing Experiments in Enumerative and Algebraic Combinatorics","authors":"F. Hivert","doi":"10.1145/3115936.3115938","DOIUrl":"https://doi.org/10.1145/3115936.3115938","url":null,"abstract":"The goal of this abstract is to report on some parallel and high performance computations in combinatorics, each involving large datasets generated recursively: we start by presenting a small framework implemented in Sagemath [12] allowing performance of map/reduce like computations on such recursively defined sets. In the second part, we describe a methodology used to achieve large speedups in several enumeration problems involving similar map/reduced computations. We illustrate this methodology on the challenging problem of counting the number of numerical semigroups [5], and present briefly another problem about enumerating integer vectors upto the action of a permutation group [2]. We believe that these techniques are fairly general for those kinds of algorithms.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116686300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler auto-vectorization of matrix multiplication modulo small primes","authors":"Matthew A. Lambert, B. D. Saunders","doi":"10.1145/3115936.3115943","DOIUrl":"https://doi.org/10.1145/3115936.3115943","url":null,"abstract":"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116737048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charles Bouillaguet, Claire Delaplace, Marie-Emilie Voge
{"title":"Parallel Sparse PLUQ Factorization modulo p","authors":"Charles Bouillaguet, Claire Delaplace, Marie-Emilie Voge","doi":"10.1145/3115936.3115944","DOIUrl":"https://doi.org/10.1145/3115936.3115944","url":null,"abstract":"In this paper, we present the results of our experiments to compute the rank of several large sparse matrices from Dumas's Sparse Integer Matrix Collection, by computing sparse PLUQ factorizations. Our approach consists in identifying as many pivots as possible before performing any arithmetic operation, based solely on the location of non-zero entries in the input matrix. These \"structural\" pivots are then all eliminated in parallel, in a single pass. We describe several heuristic structural pivot selection algorithms (the problem is NP-hard). These algorithms allows us to compute the ranks of several large sparse matrices in a few minutes, versus many days using Wiedemann's algorithm. Lastly, we describe a multi-thread implementation using OpenMP achieving 70% parallel efficiency on 24 cores on the largest benchmark.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}