{"title":"RTop-K:用于神经网络的超快行向 Top-K 算法和 GPU 实现","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":null,"url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\ncomputing and information retrieval to big data and neural network model\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\nSearch-based approach to optimize resource allocation and provides a scalable\nsolution that significantly accelerates top-k operations. We perform a\ntheoretical analysis of the effects of early stopping in our algorithm,\ndemonstrating that it maintains the accuracy of neural network models while\nenhancing performance. Comprehensive tests show that our GPU implementation of\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\nachieves speed increases ranging from 4.245$\\times$ to 9.506$\\times$ with early\nstopping, and 3.936$\\times$ without early stopping, compared to\nstate-of-the-art implementations. The proposed methods offer significant\nimprovements in the training and inference of Graph Neural Networks (GNNs),\naddressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks\",\"authors\":\"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding\",\"doi\":\"arxiv-2409.00822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Top-k algorithms are essential in various applications, from high-performance\\ncomputing and information retrieval to big data and neural network model\\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\\nSearch-based approach to optimize resource allocation and provides a scalable\\nsolution that significantly accelerates top-k operations. We perform a\\ntheoretical analysis of the effects of early stopping in our algorithm,\\ndemonstrating that it maintains the accuracy of neural network models while\\nenhancing performance. Comprehensive tests show that our GPU implementation of\\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\\nachieves speed increases ranging from 4.245$\\\\times$ to 9.506$\\\\times$ with early\\nstopping, and 3.936$\\\\times$ without early stopping, compared to\\nstate-of-the-art implementations. The proposed methods offer significant\\nimprovements in the training and inference of Graph Neural Networks (GNNs),\\naddressing critical challenges in latency and throughput on GPU platforms.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"268 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks
Top-k algorithms are essential in various applications, from high-performance
computing and information retrieval to big data and neural network model
training. This paper introduces RTop-K, a highly efficient parallel row-wise
top-k selection algorithm designed for GPUs. RTop-K employs a Binary
Search-based approach to optimize resource allocation and provides a scalable
solution that significantly accelerates top-k operations. We perform a
theoretical analysis of the effects of early stopping in our algorithm,
demonstrating that it maintains the accuracy of neural network models while
enhancing performance. Comprehensive tests show that our GPU implementation of
RTop-K outperforms other row-wise top-k GPU implementations, with minimal
impact on testing accuracy when early stopping is applied. Notably, RTop-K
achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early
stopping, and 3.936$\times$ without early stopping, compared to
state-of-the-art implementations. The proposed methods offer significant
improvements in the training and inference of Graph Neural Networks (GNNs),
addressing critical challenges in latency and throughput on GPU platforms.