A Cost-Effective and Scalable Merge Sorter Tree on FPGAs

2016 Fourth International Symposium on Computing and Networking (CANDAR) Pub Date : 2016-11-01 DOI:10.1109/CANDAR.2016.0023

T. Usui, Thiem Van Chu, Kenji Kise

{"title":"A Cost-Effective and Scalable Merge Sorter Tree on FPGAs","authors":"T. Usui, Thiem Van Chu, Kenji Kise","doi":"10.1109/CANDAR.2016.0023","DOIUrl":null,"url":null,"abstract":"Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.

查看原文本刊更多论文

fpga上具有成本效益和可扩展的合并排序树

排序是一种重要的计算内核，应用于图像处理、数据压缩和数据库操作等许多领域。已经有很多尝试使用fpga来加速排序。它们大多基于归并排序算法。合并排序树是用于大规模排序的树状结构体系结构。如果输入K个叶子的归并排序树合并N个元素，则归并阶段是递归地进行的，因此其时间复杂度为O(NlogK(N))。因此，为了获得更高的排序性能，增加输入叶数K是有效的，但是，硬件资源的使用是O(K)。具有多个输入叶的合并排序树很难有效地实现。Ito等人最近提出了一种算法，可以将具有K个输入叶的归并排序树的硬件复杂度从O(K)降低到O(log(K))。但是，他们只在K为8岁和16岁时报告评价结果。在本文中，我们提出了一种经济高效且可扩展的合并排序树架构。我们表明，与硬件复杂度为O(K)的传统设计相比，我们的设计实现了几乎相同的性能。我们在Xilinx XC7VX485T-2 FPGA上实现了具有1,024个输入叶的合并排序树，并表明与传统设计相比，所提出的架构具有52.4倍的逻辑片利用率，而性能下降仅为1.31倍。我们成功地实现了一个非常大的合并排序树，它有4096个输入叶子，这是使用传统设计无法实现的。该树实现了每秒1.49亿个64位元的合并吞吐量，同时使用了FPGA的1.72%的切片和7.48%的块ram。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 Fourth International Symposium on Computing and Networking (CANDAR)

自引率

0.00%

发文量