Sorting Large Data Sets with FPGA-Accelerated Samplesort

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI:10.1109/FCCM.2019.00067

Han Chen, S. Madaminov, M. Ferdman, Peter Milder

{"title":"Sorting Large Data Sets with FPGA-Accelerated Samplesort","authors":"Han Chen, S. Madaminov, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2019.00067","DOIUrl":null,"url":null,"abstract":"Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping \"buckets,\" sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping "buckets," sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.

查看原文本刊更多论文

用fpga加速采样排序大型数据集

排序是许多应用程序(如数据库、搜索和社交网络)中的基本操作。尽管fpga在对芯片上的数据大小进行排序方面已经被证明是有效的，但通过在芯片上和芯片外变换数据来对更大的数据集进行排序的系统通常会受到昂贵的合并操作或数据传输时间的瓶颈。我们提出了一种新的方法，通过使用带有pcie连接的FPGA的服务器加速采样排序算法来对大型数据集进行排序。Samplesort通过随机抽样来确定如何将数据划分为大小大致相等且不重叠的“桶”，对每个桶进行排序，并将结果连接起来。尽管samplesort可以将一个大问题划分为适合FPGA片上内存的小问题，但在软件中划分速度很慢。我们的系统使用了一种新型的并行硬件分区器，它只受可用FPGA硬件资源的限制而限制数据集的大小。分区后，使用并行排序硬件对每个桶进行排序。CPU负责对数据进行采样，清除由存储桶大小变化引起的任何潜在问题，并在输入集大于FPGA可以排序时通过执行初始粗粒度分区来提供可伸缩性。我们使用Amazon Web Services FPGA实例进行原型设计，该实例将Xilinx Virtex UltraScale+ FPGA与高性能服务器配对。我们的实验表明，当排序2^23条键值记录时，比GNU并行排序加快17.1倍，排序2^30条记录时加快4.2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量