矩阵乘模小素数的编译器自动向量化

Proceedings of the International Workshop on Parallel Symbolic Computation Pub Date : 2017-07-23 DOI:10.1145/3115936.3115943

Matthew A. Lambert, B. D. Saunders

{"title":"矩阵乘模小素数的编译器自动向量化","authors":"Matthew A. Lambert, B. D. Saunders","doi":"10.1145/3115936.3115943","DOIUrl":null,"url":null,"abstract":"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Compiler auto-vectorization of matrix multiplication modulo small primes\",\"authors\":\"Matthew A. Lambert, B. D. Saunders\",\"doi\":\"10.1145/3115936.3115943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.\",\"PeriodicalId\":102463,\"journal\":{\"name\":\"Proceedings of the International Workshop on Parallel Symbolic Computation\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Workshop on Parallel Symbolic Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3115936.3115943\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Workshop on Parallel Symbolic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3115936.3115943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

现代cpu有矢量指令集，如SSE2和AVX2，它们支持位级操作(and, or, xor等)以及浮点数和整数运算。此外，编译器，如g++和Clang，具有自动向量化特性来利用向量指令。在这项研究中，我们利用这些工具来提高GF2, GF3和其他小领域的矩阵乘法性能。目的是提高四俄罗斯矩阵乘法算法的性能，为使用Strassen方法中的块分解进行更大矩阵的乘法提供有效的基本情况。这种环境的本质是字级并行性已经存在，因为多个字段元素被塞进一个字中。硬件矢量运算进一步增强了所需的矢量加法运算和2的小幂运算。算术模2或3是通过位级操作实现的。对于其他小域，填充方案是这样的，向量加法和缩放操作必须遵循周期性归一化。通过诱导编译器利用256位SIMD指令，我们在64位字上获得了大约2到3倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compiler auto-vectorization of matrix multiplication modulo small primes

Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Workshop on Parallel Symbolic Computation

自引率

0.00%

发文量