Speculative hardware/software co-designed floating-point multiply-add fusion

Marc Lupon, E. Gibert, G. Magklis, S. Samudrala, Raúl Martínez, Kyriakos Stavrou, D. Ditzel
{"title":"Speculative hardware/software co-designed floating-point multiply-add fusion","authors":"Marc Lupon, E. Gibert, G. Magklis, S. Samudrala, Raúl Martínez, Kyriakos Stavrou, D. Ditzel","doi":"10.1145/2541940.2541978","DOIUrl":null,"url":null,"abstract":"A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2541940.2541978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.
推测硬件/软件协同设计的浮点乘加融合
融合乘加(FMA)指令目前在许多通用处理器中可用。它通过减少相关操作的延迟来提高性能,并通过将结果计算为不可分割操作而不进行中间舍入来提高精度。然而,由于单个四舍五入FMA操作的算术行为不同于独立的FP乘法和FP加法指令,当它们被编译为使用FMA操作时,一些算法需要大量的重新验证和重写工作才能像预期的那样工作——开发人员可能不愿意支付这样的成本。因此,大量遗留应用程序无法利用FMA指令。在本文中,我们提出了一种新的硬件/软件协作技术,该技术能够通过增加FMA的利用率来有效地执行工作负载,通过添加选项来获得与单独的FP乘法和FP加法对相同的数值结果。特别是,我们扩展了硬件/软件协同设计处理器的主机ISA,使用新的组合乘法-加法(CMA)指令,该指令执行带有中间舍入的FMA操作。该指令由透明的动态翻译软件层使用,该软件层使用推测指令融合优化将FP乘法和FP加法序列转换为CMA指令。FMA单元进行了轻微修改,以支持单舍入和双舍入融合指令,而不会增加其延迟,并在错误估计的情况下提供保守的回退路径。在周期精确定时模拟器上的评估表明,CMA将SPECfp性能提高了6.3%,并减少了4.7%的执行指令。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信