{"title":"矩阵乘模小素数的编译器自动向量化","authors":"Matthew A. Lambert, B. D. Saunders","doi":"10.1145/3115936.3115943","DOIUrl":null,"url":null,"abstract":"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.","PeriodicalId":102463,"journal":{"name":"Proceedings of the International Workshop on Parallel Symbolic Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Compiler auto-vectorization of matrix multiplication modulo small primes\",\"authors\":\"Matthew A. Lambert, B. D. Saunders\",\"doi\":\"10.1145/3115936.3115943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.\",\"PeriodicalId\":102463,\"journal\":{\"name\":\"Proceedings of the International Workshop on Parallel Symbolic Computation\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Workshop on Parallel Symbolic Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3115936.3115943\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Workshop on Parallel Symbolic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3115936.3115943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Compiler auto-vectorization of matrix multiplication modulo small primes
Modern CPUs have vector instruction sets such as SSE2 and AVX2 which support the bit level operations (and, or, xor, etc. ) as well as floating point and integer arithmetic. Furthermore compilers, such as g++ and Clang, have auto-vectorization features to exploit the vector instructions. In this study we take advantage of these tools to improve performance of matrix multiplication over GF2, GF3, and other small fields. The purpose is to enhance performance of the Four Russians matrix multiplication algorithm, providing an efficient base case for multiplication of larger matrices using block decomposition as in Strassen's method. The essence of this environment is that already word level parallelism exists, since multiple field elements are stuffed into a word. The hardware vector operations further enhance the needed vector operations of addition and scaling by small powers of 2. Arithmetic modulo 2 or 3 is achieved via bit level operations. For other small fields the packing scheme is such that the vector addition and scaling operations must be followed by periodic normalization. We obtain approximately 2 to 3 fold speedup over these arithmetics on 64 bit words by coaxing compiler exploitation of the 256-bit SIMD instructions.