Best-effort semantic document search on GPUs

GPGPU-3 Pub Date : 2010-03-14 DOI:10.1145/1735688.1735705

S. Byna, Jiayuan Meng, A. Raghunathan, S. Chakradhar, S. Cadambi

{"title":"Best-effort semantic document search on GPUs","authors":"S. Byna, Jiayuan Meng, A. Raghunathan, S. Chakradhar, S. Cadambi","doi":"10.1145/1735688.1735705","DOIUrl":null,"url":null,"abstract":"Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the \"forgiving nature\" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GPGPU-3","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1735688.1735705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.

查看原文本刊更多论文

gpu上的最佳语义文档搜索

语义索引是一种流行的技术，用于访问和组织大量非结构化文本数据。我们描述了在多核GPU平台上语义索引和文档搜索的优化实现。我们观察到，在128核Tesla C870 GPU上并行实现语义索引仅比在Intel Xeon 2.4GHz处理器上串行实现快2.4倍。我们将这种不太显著的加速归因于语义索引的工作负载特征与gpu的独特架构特征之间的不匹配。与已经成功移植到gpu上的常规数值计算相比，我们的语义索引算法(最近提出的监督语义索引算法，称为SSI)具有有趣的特征——每个训练实例中的并行度依赖于数据，每次迭代都涉及密集矩阵与稀疏向量的乘积，导致随机内存访问模式。因此，我们观察到基线GPU实现显著地低估了GPU平台的硬件资源(处理元素和内存带宽)。然而，SSI算法也表现出独特的特性，我们统称为算法的“宽恕性”。这些独特的特征允许新的优化，而不是努力保持每个训练迭代与顺序实现的数值等效。特别是，我们考虑了尽力而为的计算技术，例如依赖性放松和计算下降，以适当地改变SSI的工作负载特征，以利用GPU的独特架构特征。我们还表明，在GPU上实现依赖放松和计算放弃概念与在多核CPU上实现这些概念的方式完全不同，这主要是由于GPU支持的独特架构特性。我们的新技术极大地提高了并行工作负载的数量，从而在GPU上实现了更高的性能。通过优化CPU和GPU之间的数据传输，并通过减少GPU内核调用开销，我们实现了进一步的性能提升。我们在维基百科超过180万份文档的数据库上评估了我们新的gpu加速语义文档搜索实现。通过应用我们新颖的性能增强策略，我们在128核Tesla C870上的GPU实现与在相同GPU上的基线并行实现相比，实现了5.5倍的加速。与双插槽四核Intel至强多核CPU(8核)上的基准并行TBB实现相比，增强的GPU实现速度快11倍。与同样使用数据依赖性放松和丢弃计算技术的多核CPU上的并行实现相比，我们的增强GPU实现速度快5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

GPGPU-3

自引率

0.00%

发文量