2013 IEEE 21st Symposium on Computer Arithmetic最新文献

筛选
英文 中文
Parallel Modular Multiplication on Multi-core Processors 多核处理器上的并行模块化乘法
2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.20
Pascal Giorgi, L. Imbert, T. Izard
{"title":"Parallel Modular Multiplication on Multi-core Processors","authors":"Pascal Giorgi, L. Imbert, T. Izard","doi":"10.1109/ARITH.2013.20","DOIUrl":"https://doi.org/10.1109/ARITH.2013.20","url":null,"abstract":"Current processors typically embed many cores running at high speed. The main goal of this paper is to assess the efficiency of software parallelism for low level arithmetic operations by providing a thorough comparison of several parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split the computations into independent processes. Our experiments show that this new algorithm is well suited to software parallelism.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128067369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Floating Point Architecture Extensions for Optimized Matrix Factorization 优化矩阵分解的浮点架构扩展
2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.21
A. Pedram, A. Gerstlauer, R. V. D. Geijn
{"title":"Floating Point Architecture Extensions for Optimized Matrix Factorization","authors":"A. Pedram, A. Gerstlauer, R. V. D. Geijn","doi":"10.1109/ARITH.2013.21","DOIUrl":"https://doi.org/10.1109/ARITH.2013.21","url":null,"abstract":"This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Split-Path Fused Floating Point Multiply Accumulate (FPMAC) 分路融合浮点乘法累加(FPMAC)
2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.32
S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla
{"title":"Split-Path Fused Floating Point Multiply Accumulate (FPMAC)","authors":"S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla","doi":"10.1109/ARITH.2013.32","DOIUrl":"https://doi.org/10.1109/ARITH.2013.32","url":null,"abstract":"Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multiple-Precision Evaluation of the Airy Ai Function with Reduced Cancellation 减少消去的Airy函数的多精度评价
2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2012-12-19 DOI: 10.1109/ARITH.2013.33
S. Chevillard, M. Mezzarobba
{"title":"Multiple-Precision Evaluation of the Airy Ai Function with Reduced Cancellation","authors":"S. Chevillard, M. Mezzarobba","doi":"10.1109/ARITH.2013.33","DOIUrl":"https://doi.org/10.1109/ARITH.2013.33","url":null,"abstract":"The series expansion at the origin of the Airy function Ai(x) is alternating and hence problematic to evaluate for x > 0 due to cancellation. Based on a method recently proposed by Gawronski, Müller, and Rein hard, we exhibit two functions F and G, both with nonnegative Taylor expansions at the origin, such that Ai(x) = G(x)/F(x). The sums are now well-conditioned, but the Taylor coefficients of G turn out to obey an ill-conditioned three-term recurrence. We use the classical Miller algorithm to overcome this issue. We bound all errors and our implementation allows an arbitrary and certified accuracy, that can be used, e.g., for providing correct rounding in arbitrary precision.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123618775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信