{"title":"深度学习稀疏矩阵核在Intel Max系列GPU上的性能优化","authors":"Mohammad Zubair, Christoph Bauinger","doi":"arxiv-2311.00368","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on three sparse matrix operations that are relevant\nfor machine learning applications, namely, the sparse-dense matrix\nmultiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM),\nand the composition of the SDDMM with SPMM, also termed as FusedMM. We develop\noptimized implementations for SPMM, SDDMM, and FusedMM operations utilizing\nIntel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or\nSYCL, the ESIMD API enables the writing of explicitly vectorized kernel code.\nSparse matrix algorithms implemented with the ESIMD API achieved performance\nclose to the peak of the targeted Intel Data Center GPU. We compare our\nperformance results to Intel's oneMKL library on Intel GPUs and to a recent\nCUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and\ndemonstrate that our implementations for sparse matrix operations outperform\neither.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"12 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU\",\"authors\":\"Mohammad Zubair, Christoph Bauinger\",\"doi\":\"arxiv-2311.00368\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we focus on three sparse matrix operations that are relevant\\nfor machine learning applications, namely, the sparse-dense matrix\\nmultiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM),\\nand the composition of the SDDMM with SPMM, also termed as FusedMM. We develop\\noptimized implementations for SPMM, SDDMM, and FusedMM operations utilizing\\nIntel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or\\nSYCL, the ESIMD API enables the writing of explicitly vectorized kernel code.\\nSparse matrix algorithms implemented with the ESIMD API achieved performance\\nclose to the peak of the targeted Intel Data Center GPU. We compare our\\nperformance results to Intel's oneMKL library on Intel GPUs and to a recent\\nCUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and\\ndemonstrate that our implementations for sparse matrix operations outperform\\neither.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"12 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2311.00368\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2311.00368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU
In this paper, we focus on three sparse matrix operations that are relevant
for machine learning applications, namely, the sparse-dense matrix
multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM),
and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop
optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing
Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or
SYCL, the ESIMD API enables the writing of explicitly vectorized kernel code.
Sparse matrix algorithms implemented with the ESIMD API achieved performance
close to the peak of the targeted Intel Data Center GPU. We compare our
performance results to Intel's oneMKL library on Intel GPUs and to a recent
CUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and
demonstrate that our implementations for sparse matrix operations outperform
either.