使用Intel AVX-512自动调优优化并行GEMM例程

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI:10.1145/3293320.3293334

Raehyun Kim, Jaeyoung Choi, Myungho Lee

{"title":"使用Intel AVX-512自动调优优化并行GEMM例程","authors":"Raehyun Kim, Jaeyoung Choi, Myungho Lee","doi":"10.1145/3293320.3293334","DOIUrl":null,"url":null,"abstract":"This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512\",\"authors\":\"Raehyun Kim, Jaeyoung Choi, Myungho Lee\",\"doi\":\"10.1145/3293320.3293334\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3293320.3293334\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3293320.3293334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

本文介绍了在代号为Knights Landing (KNL)的Intel Xeon Phi处理器和基于Intel AVX-512固有函数的自动调优方法的Intel Xeon Scalable处理器上单精度和双精度通用矩阵-矩阵乘法(GEMM)例程的优化实现。我们的自动调优方法精确地确定反映目标体系结构特性的参数。我们的方法显著减少了搜索空间，并派生出最优参数集，包括子矩阵的大小、预取距离、循环展开深度和并行化方案。没有一行汇编代码，我们的GEMM内核显示出与英特尔MKL相当的性能结果，并且优于其他开源BLAS库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量