Proceedings of the International Workshop on Parallel Symbolic Computation最新文献

Meataxe64: High performance linear algebra over finite fields 有限域上的高性能线性代数

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115947

R. Parker

引用次数: 1

An Algorithm For Spliting Polynomial Systems Based On F4 基于F4的多项式系统分裂算法

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115948

M. Monagan, Roman Pearce

引用次数: 5

Fast Parallel Multi-point Evaluation of Sparse Polynomials 稀疏多项式的快速并行多点求值

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115940

M. Monagan, Alan Wong

引用次数: 3

Multithreaded programming on the GPU: pointers and hints for the computer algebraist GPU上的多线程编程:计算机代数的指针和提示

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115939

M. M. Maza

{"title":"Multithreaded programming on the GPU: pointers and hints for the computer algebraist","authors":"M. M. Maza","doi":"10.1145/3115936.3115939","DOIUrl":"https://doi.org/10.1145/3115936.3115939","url":null,"abstract":"It is well-known that the advent of hardware acceleration technologies (multicore processors, graphics processing units, field programmable gate arrays) provide vast opportunities for innovation in computing. In particular, GPUs combined with low-level heterogeneous programming models, such as CUDA (the Compute Unified Device Architecture, see [6, 7]), brought super-computing to the level of the desktop computer. However, these low-level programming models carry notable challenges, even to expert programmers. Indeed, fully exploiting the power of hardware accelerators by writing CUDA code often requires significant code optimization effort. This two-hour tutorial attempts to cover the key principles that computer algebraists interested in GPU programming should have in mind. The first half introduces the basics of GPU architecture and the CUDA programming model: no preliminary experience with GPU programming will be assumed; see [10] for a reference. In the second hour, we shall discuss the recent developments in terms of GPU architecture (e.g. dynamic parallelism [12]) and programming models (e.g. OpenMP [1, 9] and OpenACC [8, 11] as well as techniques for improving code performance (e.g MWP-CWP mode [4], TMM model [5], MCM model [3]). Illustrative examples are taken from the CUMODP library [2] for dense polynomial arithmetic over finite fields.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Generic Scalable Parallel Combinatorial Search 通用可扩展并行组合搜索

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115942

B. Archibald, Patrick Maier, Robert J. Stewart, P. Trinder, J. Beule

{"title":"Towards Generic Scalable Parallel Combinatorial Search","authors":"B. Archibald, Patrick Maier, Robert J. Stewart, P. Trinder, J. Beule","doi":"10.1145/3115936.3115942","DOIUrl":"https://doi.org/10.1145/3115936.3115942","url":null,"abstract":"Combinatorial search problems in mathematics, e.g. in finite geometry, are notoriously hard; a state-of-the-art backtracking search algorithm can easily take months to solve a single problem. There is clearly demand for parallel combinatorial search algorithms scaling to hundreds of cores and beyond. However, backtracking combinatorial searches are challenging to parallelise due to their sensitivity to search order and due to the their irregularly shaped search trees. Moreover, scaling parallel search to hundreds of cores generally requires highly specialist parallel programming expertise. This paper proposes a generic scalable framework for solving hard combinatorial problems. Key elements are distributed memory task parallelism (to achieve scale), work stealing (to cope with irregularity), and generic algorithmic skeletons for combinatorial search (to reduce the parallelism expertise required). We outline two implementations: a mature Haskell Tree Search Library (HTSL) based around algorithmic skeletons and a prototype C++ Tree Search Library (CTSL) that uses hand coded applications. Experiments on maximum clique problems and on a problem in finite geometry, the search for spreads in H(4, 22), show that (1) CTSL consistently outperforms HTSL on sequential runs, and (2) both libraries scale to 200 cores, e.g. speeding up spreads search by a factor of 81 (HTSL) and 60 (CTSL), respectively. This demonstrates the potential of our generic framework for scaling parallel combinatorial search to large distributed memory platforms.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Parallel Fast Möbius (Reed-Muller) Transform and its Implementation with CUDA on GPUs 并行快速Möbius (Reed-Muller)变换及其在gpu上的CUDA实现

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115941

D. Bikov, I. Bouyukliev

引用次数: 5

Plain, and Somehow Sparse, Univariate Polynomial Division on Graphics Processing Units 图形处理单元上的单变量多项式除法

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115946

S. A. Haque, A. Hashemi, Davood Mohajerani, M. M. Maza

引用次数: 0

High Performance Computing Experiments in Enumerative and Algebraic Combinatorics 枚举与代数组合中的高性能计算实验

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115938

F. Hivert

引用次数: 1

Compiler auto-vectorization of matrix multiplication modulo small primes 矩阵乘模小素数的编译器自动向量化

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115943

Matthew A. Lambert, B. D. Saunders

{"title":"Compiler auto-vectorization of matrix multiplication modulo small primes","authors":"Matthew A. Lambert, B. D. Saunders","doi":"10.1145/3115936.3115943","DOIUrl":"https://doi.org/10.1145/3115936.3115943","url":null,"abstract":"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116737048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallel Sparse PLUQ Factorization modulo p 并行稀疏PLUQ分解模

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI: 10.1145/3115936.3115944

Charles Bouillaguet, Claire Delaplace, Marie-Emilie Voge

引用次数: 4