An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype

The Sixth Distributed Memory Computing Conference, 1991. Proceedings Pub Date : 1991-04-28 DOI:10.1109/DMCC.1991.633213

Marc Baber

{"title":"An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype","authors":"Marc Baber","doi":"10.1109/DMCC.1991.633213","DOIUrl":null,"url":null,"abstract":"This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem of sorting on hypercube architectures has been the subject of several papers in the last few years. Felten et. al.[2,31 devised a distributed version of the Quicksort algorithm, sometimes called “hyperquicksort” [9 1 that utilizes global splitting points to partition keys into successively smaller subcubes until each range of keys is stored on a single node. Since each node stores a distinct range of keys, no global merge is necessary. After each node applies a local quicksort (or other sequential sort) to its data, the sort is complete. Seidel and George 171 explored three variations of the binsort algorithm. Each method begins by assigning a subrange of keys to each node (based on assumed even distribution or on the distribution observed in a sample of the data), and then breaking up each node’s data into subsets destined for every other node. All messages are then sent simultaneously from initial sources to final destinations, using all the dimensions of the hypercube in one step. Each node then applies the quicksort algorithm to its local sub-range of keys. 458 0-8186-2290-3/91/0oO0/0458/$01 .oO Q 1991 IEEE Li and Tung [5] comlpared the performance of three different sorting algorithm on a Symult 2010 and found that the parallel Quicksort outperformed both the Bitonic and Shell sort algorithms for larger problem si~zes (more than 64p where p is the nuimber of processors 01 nodes). Abali, et al, [ 11 developed a load balanced variation on the distributed quicksort algorithm similar to :Seide1 and George’s [7], except that each node performs a quicksort of its own data before the sub-ranges are assigned to each node. This allows the nocles to determine the exact keys that most equally divide the data. An n-way merge is performed on each node after it receives sorted packages of keys in its subrange from the other nodes. Tang [8] implemented a sorting algorithm based on a local Quicksort of each node followed by a global Shell merge. All of these papers concentrate on genecal-purpose sorting algorithms for sorting lists of unknown data distributions. The radix algorithm tends to sacrifice memory efficiency and time for uneven data distributions, but, as a specialized sort algorithm, it can sort data of known distributions very quickly. The Radix Algoritlhm The sequential radix sorting algorithm [4], can be implemented in parallel on a hypercube or any multicomputer with a number of processors that is a power of two, as follows: 1. Allocate a proximately equal numbers of unsorted keys to each of 2 nodes. 2. Each node allocates a section of memory to buffer outgoing keys to be sent tcb every other node (and to itself). 3. Each key from the original data is placed in the buffer for the node whose node identifier is equal to the least significant d bits of tlhe key where d is the dimension of the hypercube. 4. If a buffer is full, it is marked iincomplete and sent ahead to its destination before another key is stored in the buffer. 5. After all the keys are thus partitioned, all the buffers are marked complete and sent to their destinations. An empty buffer is still sent because the receiving node requires a “complete” indication before proceeding to messages from the next sending node. 6. Each node then processes buffeirs received from all nodes, including itself, from node id 0 through 2d-1. The buffers must be handled in order of increasing originating node id to preserve the ordlering of less significant bits. For the same reason all buffers sent from node A b node B in each pass through the datal must be handed in the order in which they were sent. The last buffer from node k-1 must be processed before the lbst buffer from node k can be processed. A consequence of these constraints is that this implementation of the radix sort is stable. B 7. The keys are again placed into buckets corresponding to their destination node ids, but this time the next least significant d bits are used for the node id. Steps 4 through 6 are repeated until all bits of the keys have been used If d does not divide the length of the key evenly, the bits considered in the last pass should overlap the bits considered in the previous pass to avoid having any bits that are constantly zero which would lead to load imbalance. 8. The keys are sorted. To enumerate them, one need simply visit nodes 0 through 2d-1, in order, and print the contents of received buffers in originating node id order. A variation of this algorithm uses virtual node ids to decouple the number of bits processed in each step from the hypercube dimension. For example, to process 8 bits in each step on a 32-node system, each physical node would emulate 8 virtual nodes for a total of 256 (or 28) virtual nodes. This variation was necessary to obtain realistic speed-up measurements.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DMCC.1991.633213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem of sorting on hypercube architectures has been the subject of several papers in the last few years. Felten et. al.[2,31 devised a distributed version of the Quicksort algorithm, sometimes called “hyperquicksort” [9 1 that utilizes global splitting points to partition keys into successively smaller subcubes until each range of keys is stored on a single node. Since each node stores a distinct range of keys, no global merge is necessary. After each node applies a local quicksort (or other sequential sort) to its data, the sort is complete. Seidel and George 171 explored three variations of the binsort algorithm. Each method begins by assigning a subrange of keys to each node (based on assumed even distribution or on the distribution observed in a sample of the data), and then breaking up each node’s data into subsets destined for every other node. All messages are then sent simultaneously from initial sources to final destinations, using all the dimensions of the hypercube in one step. Each node then applies the quicksort algorithm to its local sub-range of keys. 458 0-8186-2290-3/91/0oO0/0458/$01 .oO Q 1991 IEEE Li and Tung [5] comlpared the performance of three different sorting algorithm on a Symult 2010 and found that the parallel Quicksort outperformed both the Bitonic and Shell sort algorithms for larger problem si~zes (more than 64p where p is the nuimber of processors 01 nodes). Abali, et al, [ 11 developed a load balanced variation on the distributed quicksort algorithm similar to :Seide1 and George’s [7], except that each node performs a quicksort of its own data before the sub-ranges are assigned to each node. This allows the nocles to determine the exact keys that most equally divide the data. An n-way merge is performed on each node after it receives sorted packages of keys in its subrange from the other nodes. Tang [8] implemented a sorting algorithm based on a local Quicksort of each node followed by a global Shell merge. All of these papers concentrate on genecal-purpose sorting algorithms for sorting lists of unknown data distributions. The radix algorithm tends to sacrifice memory efficiency and time for uneven data distributions, but, as a specialized sort algorithm, it can sort data of known distributions very quickly. The Radix Algoritlhm The sequential radix sorting algorithm [4], can be implemented in parallel on a hypercube or any multicomputer with a number of processors that is a power of two, as follows: 1. Allocate a proximately equal numbers of unsorted keys to each of 2 nodes. 2. Each node allocates a section of memory to buffer outgoing keys to be sent tcb every other node (and to itself). 3. Each key from the original data is placed in the buffer for the node whose node identifier is equal to the least significant d bits of tlhe key where d is the dimension of the hypercube. 4. If a buffer is full, it is marked iincomplete and sent ahead to its destination before another key is stored in the buffer. 5. After all the keys are thus partitioned, all the buffers are marked complete and sent to their destinations. An empty buffer is still sent because the receiving node requires a “complete” indication before proceeding to messages from the next sending node. 6. Each node then processes buffeirs received from all nodes, including itself, from node id 0 through 2d-1. The buffers must be handled in order of increasing originating node id to preserve the ordlering of less significant bits. For the same reason all buffers sent from node A b node B in each pass through the datal must be handed in the order in which they were sent. The last buffer from node k-1 must be processed before the lbst buffer from node k can be processed. A consequence of these constraints is that this implementation of the radix sort is stable. B 7. The keys are again placed into buckets corresponding to their destination node ids, but this time the next least significant d bits are used for the node id. Steps 4 through 6 are repeated until all bits of the keys have been used If d does not divide the length of the key evenly, the bits considered in the last pass should overlap the bits considered in the previous pass to avoid having any bits that are constantly zero which would lead to load imbalance. 8. The keys are sorted. To enumerate them, one need simply visit nodes 0 through 2d-1, in order, and print the contents of received buffers in originating node id order. A variation of this algorithm uses virtual node ids to decouple the number of bits processed in each step from the hypercube dimension. For example, to process 8 bits in each step on a 32-node system, each physical node would emulate 8 virtual nodes for a total of 256 (or 28) virtual nodes. This variation was necessary to obtain realistic speed-up measurements.

查看原文本刊更多论文

基于Touchstone Delta原型的基数排序算法的实现

基数排序算法的这种实现将多计算机的节点视为桶，用于接收与其节点标识符对应的密钥。对30位密钥列表进行排序需要在32节点超立方体上进行6次传递，因为每次传递要考虑5位。当桶的数量等于处理器的数量时，会获得超线性的加速，因为除了为每个节点分配更小的数据子集之外，每次传递中考虑更多的比特时，所需的传递次数会减少。通过允许每个处理器有多个bucket，使bucket的数量独立于处理器的数量，这样小的超立方体就可以模拟大的超立方体在每次通过数据时考虑更多位的能力，从而观察到接近线性的真正速度提升。在iPSCl860和Touchstone Delta原型系统上的实验表明，该算法非常适合多计算机体系结构，并且可以很好地扩展密钥的随机分布。基数排序算法对n个长度为m位的密钥的时间复杂度为mO(n)。这种时间复杂度优于大多数流行的O(n log n)算法，因此，基数通常是选择的方法。在并行机器的环境中，只要密钥的分布几乎是平坦的，这一点仍然是正确的。在多计算机上，与直接基数排序[6]相关的开销是它需要多个全对全消息交换。交换的数量可以达到双节点上单个密钥的位数*部分支持:国防高级研究计划局信息科学与技术办公室并发计算系统研究ARPA第6402.6402-1号命令;项目代码8E20和9E20由DARPNCMO根据合同号MDA-972-89-C-0034发布，每个节点有一个桶。在Touchstone Delta原型系统上，使用5个12或29个处理节点，这种直接基数排序的实现在每次传递数据时处理9位，因此一个32位整数在4次传递中完全排序，只需要4次全对全的消息交换。基数算法对键的不均匀分布很敏感。如果密钥的位模式偏离随机、均匀的分布太远，那么一些节点将需要不成比例的内存。在实践中，大多数分布在低阶位上比在高阶位上更具随机性。因此，该实现使用直接基数排序[6]，或基数算法的最低有效位数[4]变化，以便将任何负载不平衡推迟到最后一次通过数据。基数交换排序或基数算法的最高有效位数实现只需要一次全对全消息交换，然后在每个节点上进行本地排序，但是由于负载不平衡，该方法可能更容易导致性能下降。在过去几年中，超立方体架构上的排序问题一直是几篇论文的主题。Felten等人[2,31]设计了一种分布式版本的快速排序算法，有时被称为“超快速排序”[9]，它利用全局分裂点将键划分为连续较小的子立方体，直到每个键范围存储在单个节点上。由于每个节点存储不同范围的键，因此不需要进行全局合并。在每个节点对其数据应用本地快速排序(或其他顺序排序)之后，排序就完成了。Seidel和George 171研究了三种不同的binsort算法。每种方法首先为每个节点分配一个子键范围(基于假设的均匀分布或在数据样本中观察到的分布)，然后将每个节点的数据分解为用于每个其他节点的子集。然后将所有消息同时从初始源发送到最终目的地，在一个步骤中使用超立方体的所有维度。然后，每个节点将快速排序算法应用于其本地子键范围。IEEE Li和Tung[5]在Symult 2010上比较了三种不同排序算法的性能，发现并行快速排序算法在较大的问题规模(大于64p，其中p是处理器01个节点的数量)上的性能优于Bitonic和Shell排序算法。Abali等人[11]在分布式快速排序算法的基础上开发了一种负载均衡的变体，类似于:Seide1和George[7]，不同之处是每个节点在将子范围分配给每个节点之前对自己的数据进行快速排序。这允许节点确定最平均地划分数据的确切键。当每个节点从其他节点接收到其子范围中的键的排序包后，对每个节点执行n向合并。Tang[8]实现了一种基于对每个节点进行局部快速排序，然后进行全局Shell合并的排序算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Sixth Distributed Memory Computing Conference, 1991. Proceedings

自引率

0.00%

发文量