Loner: utilizing the CPU vector datapath to process scalar integer data

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI:10.1145/3497776.3517767

Armand Behroozi, Sunghyun Park, S. Mahlke

{"title":"Loner: utilizing the CPU vector datapath to process scalar integer data","authors":"Armand Behroozi, Sunghyun Park, S. Mahlke","doi":"10.1145/3497776.3517767","DOIUrl":null,"url":null,"abstract":"Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code often lacks data-level parallelism or has complex control and data dependencies, which prevents vectorization. Thus, CPU vector registers and functional units frequently sit idle while the scalar datapath unilaterally executes code. In this paper, we present Loner, a profile-guided compiler methodology for optimizing scalar integer loops using the otherwise idle vector datapath. Loner expands the traditional definition of vectorization by identifying two situations where it is beneficial to perform vector operations with a single data element (\"Loner\" data). In the first, the scalar register file and functional units are overburdened, resulting in unnecessary spill/reload operations and stalls due to structural hazards. In the second, we describe a set of \"vector-amenable\" computation patterns that the vector pipeline naturally executes more efficiently than its scalar counterpart. Loner identifies hot code regions that exhibit either characteristic and offloads a subset of a program's computation graph to the vector datapath for maximum performance. We evaluate Loner on an x86 Whiskey Lake processor using select benchmarks from the SPEC, GAP, and MiBench benchmark suites where it improves performance by 2.64% (geomean) up to 40.28%.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3497776.3517767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code often lacks data-level parallelism or has complex control and data dependencies, which prevents vectorization. Thus, CPU vector registers and functional units frequently sit idle while the scalar datapath unilaterally executes code. In this paper, we present Loner, a profile-guided compiler methodology for optimizing scalar integer loops using the otherwise idle vector datapath. Loner expands the traditional definition of vectorization by identifying two situations where it is beneficial to perform vector operations with a single data element ("Loner" data). In the first, the scalar register file and functional units are overburdened, resulting in unnecessary spill/reload operations and stalls due to structural hazards. In the second, we describe a set of "vector-amenable" computation patterns that the vector pipeline naturally executes more efficiently than its scalar counterpart. Loner identifies hot code regions that exhibit either characteristic and offloads a subset of a program's computation graph to the vector datapath for maximum performance. We evaluate Loner on an x86 Whiskey Lake processor using select benchmarks from the SPEC, GAP, and MiBench benchmark suites where it improves performance by 2.64% (geomean) up to 40.28%.

查看原文本刊更多论文

Loner:利用CPU矢量数据路径处理标量整数数据

现代cpu利用SIMD矢量指令和硬件扩展来加速具有数据级并行性的代码。这允许在选定的应用领域(如图像和信号处理)获得高性能。然而，通用代码通常缺乏数据级并行性，或者具有复杂的控制和数据依赖关系，这阻碍了向量化。因此，当标量数据路径单方面执行代码时，CPU矢量寄存器和功能单元经常处于空闲状态。在本文中，我们提出了Loner，这是一种配置文件引导的编译器方法，用于使用空闲的矢量数据路径优化标量整数循环。Loner扩展了向量化的传统定义，指出了两种情况，在这两种情况下，对单个数据元素(“Loner”数据)执行向量操作是有益的。首先，标量寄存器文件和功能单元负担过重，导致不必要的溢出/重新加载操作和由于结构危险而导致的停机。在第二部分中，我们描述了一组“适合向量”的计算模式，向量管道比标量管道自然地执行得更有效。Loner识别表现出任一特征的热点代码区域，并将程序计算图的子集卸载到矢量数据路径以获得最大性能。我们在x86 Whiskey Lake处理器上使用SPEC、GAP和MiBench基准测试套件中选择的基准测试来评估Loner，它将性能提高了2.64%(几何)，最高可达40.28%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

自引率

0.00%

发文量