{"title":"A Cost-Effective and Scalable Merge Sorter Tree on FPGAs","authors":"T. Usui, Thiem Van Chu, Kenji Kise","doi":"10.1109/CANDAR.2016.0023","DOIUrl":null,"url":null,"abstract":"Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.