Speculative hardware/software co-designed floating-point multiply-add fusion

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI:10.1145/2541940.2541978

Marc Lupon, E. Gibert, G. Magklis, S. Samudrala, Raúl Martínez, Kyriakos Stavrou, D. Ditzel

{"title":"Speculative hardware/software co-designed floating-point multiply-add fusion","authors":"Marc Lupon, E. Gibert, G. Magklis, S. Samudrala, Raúl Martínez, Kyriakos Stavrou, D. Ditzel","doi":"10.1145/2541940.2541978","DOIUrl":null,"url":null,"abstract":"A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2541940.2541978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.

查看原文本刊更多论文

推测硬件/软件协同设计的浮点乘加融合

融合乘加(FMA)指令目前在许多通用处理器中可用。它通过减少相关操作的延迟来提高性能，并通过将结果计算为不可分割操作而不进行中间舍入来提高精度。然而，由于单个四舍五入FMA操作的算术行为不同于独立的FP乘法和FP加法指令，当它们被编译为使用FMA操作时，一些算法需要大量的重新验证和重写工作才能像预期的那样工作——开发人员可能不愿意支付这样的成本。因此，大量遗留应用程序无法利用FMA指令。在本文中，我们提出了一种新的硬件/软件协作技术，该技术能够通过增加FMA的利用率来有效地执行工作负载，通过添加选项来获得与单独的FP乘法和FP加法对相同的数值结果。特别是，我们扩展了硬件/软件协同设计处理器的主机ISA，使用新的组合乘法-加法(CMA)指令，该指令执行带有中间舍入的FMA操作。该指令由透明的动态翻译软件层使用，该软件层使用推测指令融合优化将FP乘法和FP加法序列转换为CMA指令。FMA单元进行了轻微修改，以支持单舍入和双舍入融合指令，而不会增加其延迟，并在错误估计的情况下提供保守的回退路径。在周期精确定时模拟器上的评估表明，CMA将SPECfp性能提高了6.3%，并减少了4.7%的执行指令。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

自引率

0.00%

发文量