{"title":"High performance linear algebra using interval arithmetic","authors":"Hong Diep Nguyen, N. Revol","doi":"10.1145/1837210.1837236","DOIUrl":"https://doi.org/10.1145/1837210.1837236","url":null,"abstract":"In this paper, we describe implementations of interval matrix multiplication and verified solution to a linear system, using entirely BLAS routines, which are fully optimized and parallelized.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123568711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting multicore systems with Cilk","authors":"Stephen Lewin-Berlin","doi":"10.1145/1837210.1837214","DOIUrl":"https://doi.org/10.1145/1837210.1837214","url":null,"abstract":"The increasing prevalence of multicore processors has led to a renewed interest in parallel programming. Cilk is a language extension to C and C++ designed to simplify programming shared-memory multiprocessor systems. The workstealing scheduler in Cilk is provably efficient and maintains well-defined space bounds. [1, 2] A deterministic program (that is, a race-free Cilk program that uses no lock constructs) maintains serial semantics, and such a Cilk program running on P processors will use no more than P times the stack space required by the corresponding serial program.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121471278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel computations in modular group algebras","authors":"A. Konovalov, S. Linton","doi":"10.1145/1837210.1837231","DOIUrl":"https://doi.org/10.1145/1837210.1837231","url":null,"abstract":"We report about the parallelisation of the algorithm to compute the normalised unit group <i>V</i> (F<sub><i>p</i></sub><i>G</i>) of a modular group algebra F<sub><i>p</i></sub><i>G</i> of a finite <i>p</i>-group <i>G</i> over the field of <i>p</i> elements F<sub><i>p</i></sub> in the computational algebra system GAP. We present its distributed memory implementation using the new remote procedure call framework based on the the Symbolic Computation Software Composability Protocol (SCSCP). Using it, we were able for for the first time to perform practical computations of <i>V</i> (F<sub><i>p</i></sub><i>G</i>) for groups of orders 2<sup>9</sup> and 3<sup>6</sup>.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134276014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fifteen years after DSC and WLSS2 what parallel computations I do today: invited lecture at PASCO 2010","authors":"E. Kaltofen","doi":"10.1145/1837210.1837213","DOIUrl":"https://doi.org/10.1145/1837210.1837213","url":null,"abstract":"A second wave of parallel and distributed computing research is rolling in. Today's multicore/multiprocessor computers facilitate everyone's parallel execution. In the mid 1990s, manufactures of expensive main-frame parallel computers faltered and computer science focused on the Internet and the computing grid. After a ten year hiatus, the Parallel Symbolic Computation Conference (PASCO) is awakening with new vigor.\u0000 I shall look back on the highlights of my own research on theoretical and practical aspects of parallel and distributed symbolic computation, and forward to what is to come by example of several current projects. An important technique in symbolic computation is the evaluation/interpolation paradigm, and multivariate sparse polynomial parallel interpolation constitutes a keystone operation, for which we present a new algorithm. Several embarrassingly parallel searches for special polynomials and exact sum-of-squares certificates have exposed issues in even today's multiprocessor architectures. Solutions are in both software and hardware. Finally, we propose the paradigm of interactive symbolic supercomputing, a symbolic computation environment analog of the STAR-P Matlab platform.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122256284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingchuan Meng, Y. Voronenko, Jeremy R. Johnson, M. M. Maza, F. Franchetti, Yuzhen Xie
{"title":"Spiral-generated modular FFT algorithms","authors":"Lingchuan Meng, Y. Voronenko, Jeremy R. Johnson, M. M. Maza, F. Franchetti, Yuzhen Xie","doi":"10.1145/1837210.1837235","DOIUrl":"https://doi.org/10.1145/1837210.1837235","url":null,"abstract":"This paper presents an extension of the Spiral system to automatically generate and optimize FFT algorithms for the discrete Fourier transform over finite fields. The generated code is intended to support modular algorithms for multivariate polynomial computations in the modpn library used by Maple. The resulting code provides an order of magnitude speedup over the original implementations in the modpn library, and the Spiral system provides the ability to automatically tune the FFT code to different computing platforms.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128917253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel sparse polynomial division using heaps","authors":"M. Monagan, Roman Pearce","doi":"10.1145/1837210.1837227","DOIUrl":"https://doi.org/10.1145/1837210.1837227","url":null,"abstract":"We present a parallel algorithm for exact division of sparse distributed polynomials on a multicore processor. This is a problem with significant data dependencies, so our solution requires fine-grained parallelism. Our algorithm manages to avoid waiting for each term of the quotient to be computed, and it achieves superlinear speedup over the fastest known sequential method. We present benchmarks comparing the performance of our C implementation of sparse polynomial division to the routines of other computer algebra systems.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121436502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel disk-based computation for large, monolithic binary decision diagrams","authors":"D. Kunkle, Vlad Slavici, G. Cooperman","doi":"10.1145/1837210.1837222","DOIUrl":"https://doi.org/10.1145/1837210.1837222","url":null,"abstract":"Binary Decision Diagrams (BDDs) are widely used in formal verification. They are also widely known for consuming large amounts of memory. For larger problems, a BDD computation will often start thrashing due to lack of memory within minutes. This work uses the parallel disks of a cluster or a SAN (storage area network) as an extension of RAM, in order to efficiently compute with BDDs that are orders of magnitude larger than what is available on a typical computer. The use of parallel disks overcomes the bandwidth problem of single disk methods, since the bandwidth of 50 disks is similar to the bandwidth of a single RAM sub-system. In order to overcome the latency issues of disk, the Roomy library is used for the sake of its latency-tolerant data structures. A breadth-first algorithm is implemented. A further advantage of the algorithm is that RAM usage can be very modest, since its largest use is as buffers for open files. The success of the method is demonstrated by solving the 16-queens problem, and by solving a more unusual problem --- counting the number of tie games in a three-dimensional 4x4x4 tic-tac-toe board.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121302393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Jeannerod, C. Mouilleron, J. Muller, G. Revy, C. Bertin, Jingyan Jourdan-Lu, Herve Knochel, Christophe Monat
{"title":"Techniques and tools for implementing IEEE 754 floating-point arithmetic on VLIW integer processors","authors":"C. Jeannerod, C. Mouilleron, J. Muller, G. Revy, C. Bertin, Jingyan Jourdan-Lu, Herve Knochel, Christophe Monat","doi":"10.1145/1837210.1837212","DOIUrl":"https://doi.org/10.1145/1837210.1837212","url":null,"abstract":"Recently, some high-performance IEEE 754 single precision floating-point software has been designed, which aims at best exploiting some features (integer arithmetic, parallelism) of the STMicroelectronics ST200 Very Long Instruction Word (VLIW) processor. We review here the techniques and software tools used or developed for this design and its implementation, and how they allowed very high instruction-level parallelism (ILP) exposure. Those key points include a hierarchical description of function evaluation algorithms, the exploitation of the standard encoding of floating-point data, the automatic generation of fast and accurate polynomial evaluation schemes, and some compiler optimizations.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114468812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated performance tuning","authors":"Jeremy R. Johnson","doi":"10.1145/1837210.1837215","DOIUrl":"https://doi.org/10.1145/1837210.1837215","url":null,"abstract":"This tutorial presents automated techniques for implementing and optimizing numeric and symbolic libraries on modern computing platforms including SSE, multicore, and GPU. Obtaining high performance requires effective use of the memory hierarchy, short vector instructions, and multiple cores. Highly tuned implementations are difficult to obtain and are platform dependent. For example, Intel Core i7 980 XE has a peak floating point performance of over 100 GFLOPS and the NVIDIA Tesla C870 has a peak floating point performance of over 500 GFLOPS, however, achieving close to peak performance on such platforms is extremely difficult. Consequently, automated techniques are now being used to tune and adapt high performance libraries such as ATLAS (math-atlas.sourceforge.net), PLASMA (icl.cs.utk.edu/plasma) and MAGMA (icl.cs.utk.edu/magma) for dense linear algebra, OSKI (bebop.cs.berkeley.edu/oski) for sparse linear algebra, FFTW (www.fftw.org) for the fast Fourier transform (FFT), and SPIRAL (www.spiral.net) for wide class of digital signal processing (DSP) algorithms. Intel currently uses SPIRAL to generate parts of their MKL and IPP libraries.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127128011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quantitative study of reductions in algebraic libraries","authors":"Yue Li, G. D. Reis","doi":"10.1145/1837210.1837226","DOIUrl":"https://doi.org/10.1145/1837210.1837226","url":null,"abstract":"How much of existing computer algebra libraries is amenable to automatic parallelization? This is a difficult topic, yet of practical importance in the era of commodity multicore machines. This paper reports on a quantitative study of reductions in the AXIOM-family computer algebra systems. The experiment builds on the introduction of assumptions in OpenAxiom. It identifies a variety of reductions that are candidate for implicit concurrent execution. An assumption is an axiomatic statement of an algebraic property. We hope that this study will encourage wider adoption of axioms, not just for the purpose of expression simplification and provably correct libraries, but also to enable derivation of implicit concurrency in a scalable fashion.","PeriodicalId":123389,"journal":{"name":"International Workshop on Parallel Symbolic Computation","volume":"38 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131433505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}