Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Kyle Berney, Nodari Sitchinava
{"title":"Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs","authors":"Kyle Berney, Nodari Sitchinava","doi":"10.1109/IPDPS47924.2020.00119","DOIUrl":null,"url":null,"abstract":"Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"1133-1142"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.
gpu上成对归并排序的工程最坏情况输入
目前,gpu上最快的基于比较的排序实现是使用并行成对归并排序算法(Thrust库)实现的。为了实现快速运行,对N个元素的输入进行排序的线程数t对每一代Nvidia gpu进行了实验微调,使得每个线程在每个合并轮中访问的元素数E = N/t导致少量(经验测量的)共享内存争用,称为银行冲突,同时通过线程超额订阅/占用来平衡全局内存访问和延迟隐藏的数量。在本文中,我们证明了对于E < w的每一个选择,使得E和w是共素数,存在一个输入置换,在这个输入置换上,由于银行冲突导致共享内存访问的顺序化,使得推力归并排序的w个线程的每一个warp都有效地减少到至多使用(w/E)个线程。请注意,这与由于任何warp访问wE连续共享内存位置的内存争用而导致的并行性损失的平凡最坏情况边界相匹配。我们的证明是建设性的,也就是说,我们能够为每个e的值自动构造这样的排列。我们还在实践中表明,与现代GPU硬件上随机输入的性能相比,这种构造的输入导致高达50%的减速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信