{"title":"An Implementation of the Radix Sorting Algorithm on the Touchstone Delta Prototype","authors":"Marc Baber","doi":"10.1109/DMCC.1991.633213","DOIUrl":null,"url":null,"abstract":"This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem of sorting on hypercube architectures has been the subject of several papers in the last few years. Felten et. al.[2,31 devised a distributed version of the Quicksort algorithm, sometimes called “hyperquicksort” [9 1 that utilizes global splitting points to partition keys into successively smaller subcubes until each range of keys is stored on a single node. Since each node stores a distinct range of keys, no global merge is necessary. After each node applies a local quicksort (or other sequential sort) to its data, the sort is complete. Seidel and George 171 explored three variations of the binsort algorithm. Each method begins by assigning a subrange of keys to each node (based on assumed even distribution or on the distribution observed in a sample of the data), and then breaking up each node’s data into subsets destined for every other node. All messages are then sent simultaneously from initial sources to final destinations, using all the dimensions of the hypercube in one step. Each node then applies the quicksort algorithm to its local sub-range of keys. 458 0-8186-2290-3/91/0oO0/0458/$01 .oO Q 1991 IEEE Li and Tung [5] comlpared the performance of three different sorting algorithm on a Symult 2010 and found that the parallel Quicksort outperformed both the Bitonic and Shell sort algorithms for larger problem si~zes (more than 64p where p is the nuimber of processors 01 nodes). Abali, et al, [ 11 developed a load balanced variation on the distributed quicksort algorithm similar to :Seide1 and George’s [7], except that each node performs a quicksort of its own data before the sub-ranges are assigned to each node. This allows the nocles to determine the exact keys that most equally divide the data. An n-way merge is performed on each node after it receives sorted packages of keys in its subrange from the other nodes. Tang [8] implemented a sorting algorithm based on a local Quicksort of each node followed by a global Shell merge. All of these papers concentrate on genecal-purpose sorting algorithms for sorting lists of unknown data distributions. The radix algorithm tends to sacrifice memory efficiency and time for uneven data distributions, but, as a specialized sort algorithm, it can sort data of known distributions very quickly. The Radix Algoritlhm The sequential radix sorting algorithm [4], can be implemented in parallel on a hypercube or any multicomputer with a number of processors that is a power of two, as follows: 1. Allocate a proximately equal numbers of unsorted keys to each of 2 nodes. 2. Each node allocates a section of memory to buffer outgoing keys to be sent tcb every other node (and to itself). 3. Each key from the original data is placed in the buffer for the node whose node identifier is equal to the least significant d bits of tlhe key where d is the dimension of the hypercube. 4. If a buffer is full, it is marked iincomplete and sent ahead to its destination before another key is stored in the buffer. 5. After all the keys are thus partitioned, all the buffers are marked complete and sent to their destinations. An empty buffer is still sent because the receiving node requires a “complete” indication before proceeding to messages from the next sending node. 6. Each node then processes buffeirs received from all nodes, including itself, from node id 0 through 2d-1. The buffers must be handled in order of increasing originating node id to preserve the ordlering of less significant bits. For the same reason all buffers sent from node A b node B in each pass through the datal must be handed in the order in which they were sent. The last buffer from node k-1 must be processed before the lbst buffer from node k can be processed. A consequence of these constraints is that this implementation of the radix sort is stable. B 7. The keys are again placed into buckets corresponding to their destination node ids, but this time the next least significant d bits are used for the node id. Steps 4 through 6 are repeated until all bits of the keys have been used If d does not divide the length of the key evenly, the bits considered in the last pass should overlap the bits considered in the previous pass to avoid having any bits that are constantly zero which would lead to load imbalance. 8. The keys are sorted. To enumerate them, one need simply visit nodes 0 through 2d-1, in order, and print the contents of received buffers in originating node id order. A variation of this algorithm uses virtual node ids to decouple the number of bits processed in each step from the hypercube dimension. For example, to process 8 bits in each step on a 32-node system, each physical node would emulate 8 virtual nodes for a total of 256 (or 28) virtual nodes. This variation was necessary to obtain realistic speed-up measurements.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DMCC.1991.633213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
This implementation of the radix sorting algorithm considers the nodes of the multicomputer to be buckets for receiving keys that correspond with their node identijiers. Sorting a list of 30-bit keys requires six passes on a 32-node hypercube, because five bits are considered in each pass. When the number of buckets is equal to the number of processors, superlinear speedups are obtained because, in addition to assigning smaller subsets of the data to each node, the number of passes required decreases when more bits are considered in each pass. True speed ups close to linear are observed when the number of buckets is made independent of the number of processors by permitting multiple buckets per processor so that a small hypercube can emulate a larger hypercube’s ability to consider more bits during each pass through the daa. Experiments on an iPSCl860 and the Touchstone Delta Prototype system show that the algorithm is well suited to multicomputer architectures and that i t scales well for random distributions of keys. Introduction The radix sorting algorithm has a time complexity mO(n) for n keys, each m bits in length. This time complexity compares favorably with most of the popular O(n log n) algorithms and so, radix is often the method of choice. In the context of a parallel machine, this continues to be true, as long as the distribution of keys is nearly flat. On a multicomputer, the overhead associated with the straight radix sort [6] is that it requires more than one allto-all message exchange. The number of exchanges can be up to the number of bits in a single key on a two-node * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPNCMO under Contract #MDA-972-89-C-0034 system with a single bucket per node. On the Touchstone Delta prototype system, using 5 12 or 29 processing nodes, this implementation of the straight radix sort processes 9 bits in each pass through the data, so a 32-bit integer is fully sorted in four passes and only four all-to-all message exchanges are required. The radix algorithm is sensitive to uneven distributions of keys. If the bit patterns of the keys deviate too far from a random, even distribution, then some node(s) will require disproportionate amounts of memory. Most distributions, in practice, are more random in the low order bits than the high order bits. Therefore, this implementation uses the straight radix sort [6] , or least signiticant digit [4] variation of the radix algorithm in order to postpone any load imbalances until the last pass through the data. A radix exchange sort, or most significant digit implementation of the radix algorithm would require only one all-to-all message exchange, followed by a local sort on each node, but the method could be more prone to performance degradation due to load imbalance. Related Work The problem of sorting on hypercube architectures has been the subject of several papers in the last few years. Felten et. al.[2,31 devised a distributed version of the Quicksort algorithm, sometimes called “hyperquicksort” [9 1 that utilizes global splitting points to partition keys into successively smaller subcubes until each range of keys is stored on a single node. Since each node stores a distinct range of keys, no global merge is necessary. After each node applies a local quicksort (or other sequential sort) to its data, the sort is complete. Seidel and George 171 explored three variations of the binsort algorithm. Each method begins by assigning a subrange of keys to each node (based on assumed even distribution or on the distribution observed in a sample of the data), and then breaking up each node’s data into subsets destined for every other node. All messages are then sent simultaneously from initial sources to final destinations, using all the dimensions of the hypercube in one step. Each node then applies the quicksort algorithm to its local sub-range of keys. 458 0-8186-2290-3/91/0oO0/0458/$01 .oO Q 1991 IEEE Li and Tung [5] comlpared the performance of three different sorting algorithm on a Symult 2010 and found that the parallel Quicksort outperformed both the Bitonic and Shell sort algorithms for larger problem si~zes (more than 64p where p is the nuimber of processors 01 nodes). Abali, et al, [ 11 developed a load balanced variation on the distributed quicksort algorithm similar to :Seide1 and George’s [7], except that each node performs a quicksort of its own data before the sub-ranges are assigned to each node. This allows the nocles to determine the exact keys that most equally divide the data. An n-way merge is performed on each node after it receives sorted packages of keys in its subrange from the other nodes. Tang [8] implemented a sorting algorithm based on a local Quicksort of each node followed by a global Shell merge. All of these papers concentrate on genecal-purpose sorting algorithms for sorting lists of unknown data distributions. The radix algorithm tends to sacrifice memory efficiency and time for uneven data distributions, but, as a specialized sort algorithm, it can sort data of known distributions very quickly. The Radix Algoritlhm The sequential radix sorting algorithm [4], can be implemented in parallel on a hypercube or any multicomputer with a number of processors that is a power of two, as follows: 1. Allocate a proximately equal numbers of unsorted keys to each of 2 nodes. 2. Each node allocates a section of memory to buffer outgoing keys to be sent tcb every other node (and to itself). 3. Each key from the original data is placed in the buffer for the node whose node identifier is equal to the least significant d bits of tlhe key where d is the dimension of the hypercube. 4. If a buffer is full, it is marked iincomplete and sent ahead to its destination before another key is stored in the buffer. 5. After all the keys are thus partitioned, all the buffers are marked complete and sent to their destinations. An empty buffer is still sent because the receiving node requires a “complete” indication before proceeding to messages from the next sending node. 6. Each node then processes buffeirs received from all nodes, including itself, from node id 0 through 2d-1. The buffers must be handled in order of increasing originating node id to preserve the ordlering of less significant bits. For the same reason all buffers sent from node A b node B in each pass through the datal must be handed in the order in which they were sent. The last buffer from node k-1 must be processed before the lbst buffer from node k can be processed. A consequence of these constraints is that this implementation of the radix sort is stable. B 7. The keys are again placed into buckets corresponding to their destination node ids, but this time the next least significant d bits are used for the node id. Steps 4 through 6 are repeated until all bits of the keys have been used If d does not divide the length of the key evenly, the bits considered in the last pass should overlap the bits considered in the previous pass to avoid having any bits that are constantly zero which would lead to load imbalance. 8. The keys are sorted. To enumerate them, one need simply visit nodes 0 through 2d-1, in order, and print the contents of received buffers in originating node id order. A variation of this algorithm uses virtual node ids to decouple the number of bits processed in each step from the hypercube dimension. For example, to process 8 bits in each step on a 32-node system, each physical node would emulate 8 virtual nodes for a total of 256 (or 28) virtual nodes. This variation was necessary to obtain realistic speed-up measurements.