FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3460373

Thomas Randall, Tyler N. Allen, Rong Ge

{"title":"FULL-W2V: fully exploiting data reuse for W2V on GPU-accelerated systems","authors":"Thomas Randall, Tyler N. Allen, Rong Ge","doi":"10.1145/3447818.3460373","DOIUrl":null,"url":null,"abstract":"Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

查看原文本刊更多论文

FULL-W2V:在gpu加速系统上充分利用W2V的数据重用

Word2Vec是自然语言处理(NLP)领域最具影响力的创新之一，它用低维的密集向量表示人类文本中潜在的语法和句法信息。由于Word2Vec算法固有的顺序性、密集的内存访问以及它所代表的大量词汇表，它的计算成本很高。虽然先前的研究已经调查了探索并行性和提高内存系统性能的技术，但它们很难在强大的gpu上有效地获得吞吐量。我们将内存数据访问和延迟确定为gpu先前工作的主要瓶颈，这阻碍了高度优化的内核达到架构的峰值性能。我们提出了一种新的算法，FULL-W2V，它最大限度地利用了W2V算法中数据重用的机会，并利用GPU架构和资源来减少对低内存级别的访问并改善时间局部性。与之前最先进的GPU实现相比，FULL-W2V能够显著减少对GPU全局内存的访问，例如减少89%以上，从而在连续几代硬件上实现显著的性能提升。当从Nvidia Pascal P100移植到Volta V100卡时，我们的原型实现实现了2.97倍的加速，并且在相同嵌入质量的V100卡上比最先进的速度高出5.72倍。深入分析表明，通过寄存器和共享内存缓存减少内存访问以及高吞吐量共享内存减少可以显著提高算法强度。FULL-W2V可以为自然语言处理和其他领域的许多应用带来潜在的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量