ARM NEON的控制流矢量化

Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems Pub Date : 2018-05-28 DOI:10.1145/3207719.3207721

Angela Pohl, Biagio Cosenza, B. Juurlink

{"title":"ARM NEON的控制流矢量化","authors":"Angela Pohl, Biagio Cosenza, B. Juurlink","doi":"10.1145/3207719.3207721","DOIUrl":null,"url":null,"abstract":"Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. When this is not possible automatically, a user may still write code in a manner that allows the compiler to deduce that vectorization is possible, or by explicitly define how to vectorize by using intrinsics. This work analyzes the challenge of generating efficient vector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets. Comparing the vectorization rates for the AVX2 and NEON instruction sets, we observed that the presence of control flow poses a major problem for the vectorization on NEON. We consequently propose a set of solutions to generate efficient vector instructions in the presence of control flow. In particular, we show how to overcome the lack of masked load and store instruction with different code generation strategies. Results show that we enable vectorization of conditional read operations with a minimal overhead, while our technique of atomic select stores achieves a speedup of more than 2x over state of the art for large vectorization factors.","PeriodicalId":284835,"journal":{"name":"Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Control Flow Vectorization for ARM NEON\",\"authors\":\"Angela Pohl, Biagio Cosenza, B. Juurlink\",\"doi\":\"10.1145/3207719.3207721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. When this is not possible automatically, a user may still write code in a manner that allows the compiler to deduce that vectorization is possible, or by explicitly define how to vectorize by using intrinsics. This work analyzes the challenge of generating efficient vector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets. Comparing the vectorization rates for the AVX2 and NEON instruction sets, we observed that the presence of control flow poses a major problem for the vectorization on NEON. We consequently propose a set of solutions to generate efficient vector instructions in the presence of control flow. In particular, we show how to overcome the lack of masked load and store instruction with different code generation strategies. Results show that we enable vectorization of conditional read operations with a minimal overhead, while our technique of atomic select stores achieves a speedup of more than 2x over state of the art for large vectorization factors.\",\"PeriodicalId\":284835,\"journal\":{\"name\":\"Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3207719.3207721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3207719.3207721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

处理器中的单指令多数据(SIMD)扩展实现了对数据向量的操作的核内并行性。从编译器的角度来看，SIMD指令需要自动技术来确定如何以及何时可以用向量操作来表示计算。当不能自动实现这一点时，用户仍然可以以一种允许编译器推断向量化是可能的方式编写代码，或者通过使用intrinsic显式定义如何向量化。本文通过在两个SIMD指令集上使用三个编译器对151个循环模式进行基准测试，分析了生成高效矢量指令的挑战。通过比较AVX2和NEON指令集的矢量化率，我们发现控制流的存在是NEON指令集矢量化的主要问题。因此，我们提出了一组在控制流存在的情况下生成有效向量指令的解决方案。特别是，我们展示了如何使用不同的代码生成策略来克服屏蔽加载和存储指令的缺乏。结果表明，我们以最小的开销实现了条件读取操作的矢量化，而我们的原子选择存储技术在大矢量化因素下实现了比现有技术状态的2倍以上的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Control Flow Vectorization for ARM NEON

Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. When this is not possible automatically, a user may still write code in a manner that allows the compiler to deduce that vectorization is possible, or by explicitly define how to vectorize by using intrinsics. This work analyzes the challenge of generating efficient vector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets. Comparing the vectorization rates for the AVX2 and NEON instruction sets, we observed that the presence of control flow poses a major problem for the vectorization on NEON. We consequently propose a set of solutions to generate efficient vector instructions in the presence of control flow. In particular, we show how to overcome the lack of masked load and store instruction with different code generation strategies. Results show that we enable vectorization of conditional read operations with a minimal overhead, while our technique of atomic select stores achieves a speedup of more than 2x over state of the art for large vectorization factors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems

自引率

0.00%

发文量