Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, S. Matsuoka
{"title":"加速图卷积网络的批处理稀疏矩阵乘法","authors":"Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, S. Matsuoka","doi":"10.1109/CCGRID.2019.00037","DOIUrl":null,"url":null,"abstract":"Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks\",\"authors\":\"Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, S. Matsuoka\",\"doi\":\"10.1109/CCGRID.2019.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.\",\"PeriodicalId\":234571,\"journal\":{\"name\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2019.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
摘要
图卷积网络(GCNs)作为一种高精度的机器学习方法,近年来在生物信息学和化学信息学领域受到广泛关注。GCNs与图结构一起处理卷积运算,当图结构以稀疏矩阵形式表示为邻接矩阵时,gpu可以处理包括稀疏密集矩阵乘法(SpMM)在内的大量运算。然而,在节点数为数十或数百的小图上进行SpMM操作,很难利用GPU的高并行性或计算能力。因此,SpMM成为GCNs应用中训练和推理的瓶颈。为了提高GCNs应用程序的性能,我们提出了新的SpMM算法,特别是针对小稀疏矩阵和批处理SpMM算法,该算法通过在单个CUDA内核上处理多个SpMM操作来利用GPU的高并行性。据我们所知,这是SpMM的批处理方法的第一次工作。我们在采用NVIDIA Tesla P100 GPU的TSUBAME3.0上评估了GCNs应用程序的性能,我们的批处理方法在训练和推理方面分别显示了高达1.59倍和1.37倍的显著加速。
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.