Loner: utilizing the CPU vector datapath to process scalar integer data

Armand Behroozi, Sunghyun Park, S. Mahlke
{"title":"Loner: utilizing the CPU vector datapath to process scalar integer data","authors":"Armand Behroozi, Sunghyun Park, S. Mahlke","doi":"10.1145/3497776.3517767","DOIUrl":null,"url":null,"abstract":"Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code often lacks data-level parallelism or has complex control and data dependencies, which prevents vectorization. Thus, CPU vector registers and functional units frequently sit idle while the scalar datapath unilaterally executes code. In this paper, we present Loner, a profile-guided compiler methodology for optimizing scalar integer loops using the otherwise idle vector datapath. Loner expands the traditional definition of vectorization by identifying two situations where it is beneficial to perform vector operations with a single data element (\"Loner\" data). In the first, the scalar register file and functional units are overburdened, resulting in unnecessary spill/reload operations and stalls due to structural hazards. In the second, we describe a set of \"vector-amenable\" computation patterns that the vector pipeline naturally executes more efficiently than its scalar counterpart. Loner identifies hot code regions that exhibit either characteristic and offloads a subset of a program's computation graph to the vector datapath for maximum performance. We evaluate Loner on an x86 Whiskey Lake processor using select benchmarks from the SPEC, GAP, and MiBench benchmark suites where it improves performance by 2.64% (geomean) up to 40.28%.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3497776.3517767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code often lacks data-level parallelism or has complex control and data dependencies, which prevents vectorization. Thus, CPU vector registers and functional units frequently sit idle while the scalar datapath unilaterally executes code. In this paper, we present Loner, a profile-guided compiler methodology for optimizing scalar integer loops using the otherwise idle vector datapath. Loner expands the traditional definition of vectorization by identifying two situations where it is beneficial to perform vector operations with a single data element ("Loner" data). In the first, the scalar register file and functional units are overburdened, resulting in unnecessary spill/reload operations and stalls due to structural hazards. In the second, we describe a set of "vector-amenable" computation patterns that the vector pipeline naturally executes more efficiently than its scalar counterpart. Loner identifies hot code regions that exhibit either characteristic and offloads a subset of a program's computation graph to the vector datapath for maximum performance. We evaluate Loner on an x86 Whiskey Lake processor using select benchmarks from the SPEC, GAP, and MiBench benchmark suites where it improves performance by 2.64% (geomean) up to 40.28%.
Loner:利用CPU矢量数据路径处理标量整数数据
现代cpu利用SIMD矢量指令和硬件扩展来加速具有数据级并行性的代码。这允许在选定的应用领域(如图像和信号处理)获得高性能。然而,通用代码通常缺乏数据级并行性,或者具有复杂的控制和数据依赖关系,这阻碍了向量化。因此,当标量数据路径单方面执行代码时,CPU矢量寄存器和功能单元经常处于空闲状态。在本文中,我们提出了Loner,这是一种配置文件引导的编译器方法,用于使用空闲的矢量数据路径优化标量整数循环。Loner扩展了向量化的传统定义,指出了两种情况,在这两种情况下,对单个数据元素(“Loner”数据)执行向量操作是有益的。首先,标量寄存器文件和功能单元负担过重,导致不必要的溢出/重新加载操作和由于结构危险而导致的停机。在第二部分中,我们描述了一组“适合向量”的计算模式,向量管道比标量管道自然地执行得更有效。Loner识别表现出任一特征的热点代码区域,并将程序计算图的子集卸载到矢量数据路径以获得最大性能。我们在x86 Whiskey Lake处理器上使用SPEC、GAP和MiBench基准测试套件中选择的基准测试来评估Loner,它将性能提高了2.64%(几何),最高可达40.28%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信