Fatima K. Abu Salem, Khalil El-Harake, Karl Gemayel
{"title":"Cache oblivious sparse polynomial factoring using the funnel heap","authors":"Fatima K. Abu Salem, Khalil El-Harake, Karl Gemayel","doi":"10.1145/2790282.2790283","DOIUrl":"https://doi.org/10.1145/2790282.2790283","url":null,"abstract":"In [2] we demonstrated that overlapping sums of products arising in the Hensel lifting phase of the polytope factoring method using a Max priority queue reduces expression swell and achieves asymptotic reductions in the Hensel lifting phase. In this paper, we propose to implement the priority queue as a Funnel Heap, when polynomials are in sparse distributed representation. Funnel Heap is a cache oblivious priority queue with optimal cache complexity, and we additionally tailor several of its features to the polynomial arithmetic required. Funnel Heap is able to identify equal order monomials \"for free\" whilst it re-organises itself over sufficiently many updates. We adopt a batched mode for chaining equal order monomials that gets overlapped with Funnel Heap's mechanism for emptying its in-core components. We also develop a customised analysis of performance that captures the overhead due to chaining in terms of the fraction of reduction and replication observed in the queue, and get that batched chaining is sensitive to the number of distinct monomials residing in the queue, as opposed to the number of replicas chained. For sufficiently large input size with respect to the cache-line length, batched chaining that is \"search free\" leads to an implementation of Hensel lifting that exhibits optimal cache complexity in the number of replicas found in the queue. Additionally, we obtain an order of magnitude reduction in space, as well as a reduction in the logarithmic factor in work and cache complexity, when comparing our adaptation against [2]. Also, the resulting Hensel lifting process is cache-oblivious. Our benchmarks of the polytope method using Funnel Heap with chaining demonstrate dramatic improvements over the regular binary heap as well as MAGMA, where the latter fails to process sufficiently high degree but sparse polynomial factorisations.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131599280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A compact parallel implementation of F4","authors":"M. Monagan, Roman Pearce","doi":"10.1145/2790282.2790293","DOIUrl":"https://doi.org/10.1145/2790282.2790293","url":null,"abstract":"We present a compact and parallel C implementation of the F4 algorithm for computing Gröbner bases which uses Cilk. We give an easy way to parallelize the sparse linear algebra which is the main cost in practice. To obtain more speedup we attempted to parallelize the generation of sparse matrices as well. We present timings to assess the effectiveness of our approach and to compare our implementation to others.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125976994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Direct solution of the (11,9,8)-MinRank problem by the block Wiedemann algorithm in magma with a tesla GPU","authors":"A. Steel","doi":"10.1145/2790282.2791392","DOIUrl":"https://doi.org/10.1145/2790282.2791392","url":null,"abstract":"We show how some very large multivariate polynomial systems over finite fields can be solved by Gröbner basis techniques coupled with the Block Wiedemann algorithm, thus extending the Wiedemann-based 'Sparse FGLM' approach of Faugère and Mou. The main components of our approach are a dense variant of the Faugère F4 Gröbner basis algorithm and the Block Wiedemann algorithm, which have been implemented within the Magma Computer Algebra System (released in version V2.20 in late 2014). A major feature of the algorithms is that they map much of the computation to dense matrix multiplication, and this allows dramatic speedups to be achieved for large examples when an Nvidia Tesla GPU is available. As a result, the Magma implementation can directly solve a 16-bit random instance of the Courtois (11,9,8)-MinRank Challenge C in about 15.1 hours with a single Intel Sandybridge CPU core coupled with an Nvidia Tesla K40 GPU.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122614958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Gautier, Jean-Louis Roch, Ziad Sultan, Bastien Vialla
{"title":"Parallel algebraic linear algebra dedicated interface","authors":"T. Gautier, Jean-Louis Roch, Ziad Sultan, Bastien Vialla","doi":"10.1145/2790282.2790286","DOIUrl":"https://doi.org/10.1145/2790282.2790286","url":null,"abstract":"This work deals with parallelism in linear algebra routines. We propose a domain specific language based on C/C++ macros, PALADIn (Parallel Algebraic Linear Algebra Dedicated Interface). This domain specific language allows the user to write C++ code and benefit from sequential and parallel executions on shared memory architectures. With a unique syntax, the user can switch between different parallel runtime systems such as OpenMP, TBB and xKaapi. This interface provides data and task parallelism. Depending on the runtime system, task parallelism can use explicit synchronizations or data-dependency based synchronizations. Also, this language provides different matrix cutting strategies according to one or two dimensions. Moreover, block algorithms, such as block iterative and recursive matrix multiplication, can involve splitting according to three dimensions. The latter is also a feature that is provided to the user. The PALADIn interface can be used in any C++ library for linear algebra computation and gets the best performance from the three supported parallel runtime systems.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hybrid symbolic-numeric approach to exceptional sets of generically zero-dimensional systems","authors":"J. Hauenstein, Alan C. Liddell","doi":"10.1145/2790282.2790288","DOIUrl":"https://doi.org/10.1145/2790282.2790288","url":null,"abstract":"Exceptional sets are the sets where the dimension of the fiber of a map is larger than the generic fiber dimension, which we assume is zero. Such situations naturally arise in kinematics, for example, when designing a mechanism that moves when the generic case is rigid. In 2008, Sommese and Wampler showed that one can use fiber products to promote such sets to become irreducible components. We propose an alternative approach using rank constraints on Macaulay matrices. Symbolic computations are used to construct the proper Macaulay matrices, while numerical computations are used to solve the rank-constraint problem. Various exceptional sets are computed, including exceptional RR dyads, lines on surfaces in C3, and exceptional planar pentads.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115497915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel implementation for polynomial multiplication modulo a prime","authors":"M. Law, M. Monagan","doi":"10.1145/2790282.2790291","DOIUrl":"https://doi.org/10.1145/2790282.2790291","url":null,"abstract":"We present a parallel implementation in Cilk C of a modular algorithm for multiplying two polynomials in Zq[x] for integer q > 1, for multi-core computers. Our algorithm uses Chinese remaindering. It multiplies modulo primes p1, p2, ... in parallel and uses a parallel FFT for each prime. Our software multiplies two polynomials of degree 109 modulo a 32 bit integer q in 83 seconds on a 20 core computer.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126501816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel sparse multivariate polynomial division","authors":"M. Gastineau, J. Laskar","doi":"10.1145/2790282.2790285","DOIUrl":"https://doi.org/10.1145/2790282.2790285","url":null,"abstract":"We present a scalable algorithm for dividing two sparse multivariate polynomials represented in a distributed format on shared memory multicore computers. The scalability on the large number of cores is ensured by the lack of synchronizations during the main parallel step. The merge and sorting operations are based on binary heap or tree data structures.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"50 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122423310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High performance implementation of the inverse TFT","authors":"Lingchuan Meng, Jeremy R. Johnson","doi":"10.1145/2790282.2790292","DOIUrl":"https://doi.org/10.1145/2790282.2790292","url":null,"abstract":"The inverse truncated Fourier transform (ITFT) is a key component in the fast polynomial and large integer algorithms introduced by van der Hoeven. This paper reports a high performance implementation of the ITFT which poses additional challenges compared to that of the forward transform. A general-radix variant of the ITFT algorithm is developed to allow the implementation to automatically adapt to the memory hierarchy. Then a parallel ITFT algorithm is developed that trades off small arithmetic cost for full vectorization and improved multi-threaded parallelism. The algorithms are automatically generated and tuned to produce an arbitrary-size ITFT library. The new algorithms and the implementation smooths out the staircase performance associated with power-of-two modular FFT implementations, and provide significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134226862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing and parallelizing the modular GCD algorithm","authors":"Matthew Gibson, M. Monagan","doi":"10.1145/2790282.2790287","DOIUrl":"https://doi.org/10.1145/2790282.2790287","url":null,"abstract":"Our goal is to design and implement a high performance modular GCD algorithm for polynomial GCD computation in Zp[x1, x2, ..., xn] for multi-core computers which will be used to compute the GCD of polynomials over Z. For n = 2 we have designed and implemented in C a highly optimized serial code for primes p < 263. For n > 2 we parallelized in Cilk C Brown's dense modular GCD algorithm using our serial bivariate code at the base. For n = 3, we obtain good parallel speedup on multi-core computers with 16 and 20 cores. We also compare our code with the GCD codes in Maple and Magma.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-acceleration of optimal permutation-puzzle solving","authors":"Hayakawa Hiroki, Ishida Naoaki, M. Hirokazu","doi":"10.1145/2790282.2790289","DOIUrl":"https://doi.org/10.1145/2790282.2790289","url":null,"abstract":"We first investigate parallelization of Rubik's cube optimal solver, especially for acceleration by GPU. To examine its efficacy, we implement a simple solver based on Korf's algorithm, with which CPU and GPU collaborate in IDA* algorithm and a large number of GPU cores are utilized for speedup instead of a huge distance table used for pruning. Empirical studies succeeded to attain sufficient speedup by GPU-acceleration. There are many other similar puzzles of so-called permutation puzzles. The puzzle solving, i.e., restoring the original ordered state from a scrambled one is equivalent to the path-finding in the Cayley graph of the permutation group. We generalize the method used for Rubik's cube to much smaller problems, and examine its efficacy. The focus of our research interest is how efficient the parallel path-finding can be and whether the use of a large number of cores substitutes for a large distance table used for pruning, in general.","PeriodicalId":384227,"journal":{"name":"Proceedings of the 2015 International Workshop on Parallel Symbolic Computation","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128532329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}