延迟敏感FMA设计

2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI:10.1109/ARITH.2011.26

Sameh Galal, M. Horowitz

{"title":"延迟敏感FMA设计","authors":"Sameh Galal, M. Horowitz","doi":"10.1109/ARITH.2011.26","DOIUrl":null,"url":null,"abstract":"The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Latency Sensitive FMA Design\",\"authors\":\"Sameh Galal, M. Horowitz\",\"doi\":\"10.1109/ARITH.2011.26\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.\",\"PeriodicalId\":272151,\"journal\":{\"name\":\"2011 IEEE 20th Symposium on Computer Arithmetic\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 20th Symposium on Computer Arithmetic\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ARITH.2011.26\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 20th Symposium on Computer Arithmetic","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ARITH.2011.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

合并浮点乘加运算的实现可以通过多种方式进行优化。对于延迟敏感型应用，我们的级联设计比融合设计减少了2倍的累积相关延迟，但代价是非累积相关延迟增加了13%。一个简单的顺序执行模型表明，这种设计在大多数应用程序中都是优越的，平均减少了12%的FP延迟，并将性能提高了6%。对超标量无序机器的模拟显示，在2路机器中CPI平均提高了4%，在4路机器中提高了4.6%。该级联设计与传统的融合多加FMA具有相同的面积和能量预算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Latency Sensitive FMA Design

The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE 20th Symposium on Computer Arithmetic

自引率

0.00%

发文量