Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00064

Adrián Barredo, J. M. Cebrian, Miquel Moretó, Marc Casas, M. Valero

{"title":"Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions","authors":"Adrián Barredo, J. M. Cebrian, Miquel Moretó, Marc Casas, M. Valero","doi":"10.1109/HPCA47549.2020.00064","DOIUrl":null,"url":null,"abstract":"Vector processors offer a wide range of unexplored opportunities to improve performance and energy efficiency. However, despite its potential, vector code generation and execution have significant challenges, the most relevant ones being control flow divergence. Most modern processors including SIMD extensions (such as AVX) rely on predication to support divergence control. In predicated codes, performance and energy consumption are usually insensitive to the number of true values in a predicated mask. This implies that the system efficiency becomes sub-optimal as vector length increases. In this paper we focus on SIMD extensions and propose a novel approach to improve execution efficiency in predicated SIMD instructions, the Compaction/Restoration (CR) technique. CR delays predicated SIMD instructions with inactive elements and compacts them with instances of the same instruction from different loop iterations to form an equivalent dense vector instruction, where, in the best case, all the elements are active. After executing such dense instructions, their results are restored to the original instructions. Our evaluation shows that CR improves performance by up to 25% and reduces dynamic energy consumption by up to 43% on real unmodified applications with predicated execution. Moreover, CR allows executing unmodified legacy code with short vector instructions (AVX-2) on newer architectures with wider vectors (AVX-512), achieving up to 56% performance benefits.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Vector processors offer a wide range of unexplored opportunities to improve performance and energy efficiency. However, despite its potential, vector code generation and execution have significant challenges, the most relevant ones being control flow divergence. Most modern processors including SIMD extensions (such as AVX) rely on predication to support divergence control. In predicated codes, performance and energy consumption are usually insensitive to the number of true values in a predicated mask. This implies that the system efficiency becomes sub-optimal as vector length increases. In this paper we focus on SIMD extensions and propose a novel approach to improve execution efficiency in predicated SIMD instructions, the Compaction/Restoration (CR) technique. CR delays predicated SIMD instructions with inactive elements and compacts them with instances of the same instruction from different loop iterations to form an equivalent dense vector instruction, where, in the best case, all the elements are active. After executing such dense instructions, their results are restored to the original instructions. Our evaluation shows that CR improves performance by up to 25% and reduces dynamic energy consumption by up to 43% on real unmodified applications with predicated execution. Moreover, CR allows executing unmodified legacy code with short vector instructions (AVX-2) on newer architectures with wider vectors (AVX-512), achieving up to 56% performance benefits.

查看原文本刊更多论文

通过压缩/恢复SIMD指令提高预测效率

矢量处理器为提高性能和能源效率提供了广泛的未开发机会。然而，尽管有潜力，矢量代码的生成和执行仍然面临着重大的挑战，最相关的挑战是控制流发散。包括SIMD扩展(如AVX)在内的大多数现代处理器都依赖于预测来支持散度控制。在预测代码中，性能和能耗通常对预测掩码中真值的数量不敏感。这意味着随着向量长度的增加，系统效率变得次优。在本文中，我们关注SIMD扩展，并提出了一种新的方法来提高预测SIMD指令的执行效率，即压缩/恢复(CR)技术。CR延迟具有非活动元素的谓词SIMD指令，并将它们与来自不同循环迭代的相同指令的实例压缩，以形成等效的密集向量指令，其中，在最佳情况下，所有元素都是活动的。在执行这些密集指令后，它们的结果被恢复到原始指令。我们的评估表明，在具有预测执行的实际未修改应用程序上，CR将性能提高了25%，并将动态能耗降低了43%。此外，CR允许在具有更宽向量的新架构(AVX-512)上执行具有短向量指令(AVX-2)的未经修改的遗留代码，从而实现高达56%的性能优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量