FIFO-Based Hardware Sorters for High Bandwidth Memory

K. Nakano, Yasuaki Ito, J. Bordim
{"title":"FIFO-Based Hardware Sorters for High Bandwidth Memory","authors":"K. Nakano, Yasuaki Ito, J. Bordim","doi":"10.1109/IPDPSW.2019.00112","DOIUrl":null,"url":null,"abstract":"The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).
基于fifo的高带宽内存硬件分选器
本文的主要贡献是展示了基于fifo的高效硬件排序器,该排序器以w位对n个元素进行排序,每个元素存储在高带宽内存中,具有适度的访问延迟。我们假设高带宽存储器的每个地址可以存储p个元素,每个元素w位,可以同时读取或写入。假设高带宽内存的访问延迟l需要l个时钟周期来访问指定地址中的p个元素。此外,支持突发模式,k(≥1)个连续地址可以在k+l-1时钟周期内以管道方式访问。但是,如果k个地址不是连续的,则需要k个时钟周期才能访问所有地址。显然,所有n个元素都可以在2(n/p+l-1)个时钟周期内复制n/p个地址。我们提出了两种类型的硬件排序器,它们对存储在高带宽内存r×c矩阵中的n=rc元素进行排序。我们首先开发了three - pass - sort和four - pass - sort,分别通过三次读取和四次写入对r×c矩阵进行排序。我们使用基于fifo的合并来实现这两种算法,该合并可以配置为成对模式和滑动模式。我们基于三通排序的硬件排序器运行在6n/p+3c^2/p^2l+O(c/p(l+log r)+r)时钟周期,使用大小为O(rwp)的电路,前提是r≥c^2。此外,我们基于四通排序的硬件排序器使用大小为O(rw)的电路,以8n/p+2c^2l+O(cl+log r+p)时钟周期运行。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信