基于fifo的高带宽内存硬件分选器

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI:10.1109/IPDPSW.2019.00112

K. Nakano, Yasuaki Ito, J. Bordim

{"title":"基于fifo的高带宽内存硬件分选器","authors":"K. Nakano, Yasuaki Ito, J. Bordim","doi":"10.1109/IPDPSW.2019.00112","DOIUrl":null,"url":null,"abstract":"The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FIFO-Based Hardware Sorters for High Bandwidth Memory\",\"authors\":\"K. Nakano, Yasuaki Ito, J. Bordim\",\"doi\":\"10.1109/IPDPSW.2019.00112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).\",\"PeriodicalId\":292054,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2019.00112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文的主要贡献是展示了基于fifo的高效硬件排序器，该排序器以w位对n个元素进行排序，每个元素存储在高带宽内存中，具有适度的访问延迟。我们假设高带宽存储器的每个地址可以存储p个元素，每个元素w位，可以同时读取或写入。假设高带宽内存的访问延迟l需要l个时钟周期来访问指定地址中的p个元素。此外，支持突发模式，k(≥1)个连续地址可以在k+l-1时钟周期内以管道方式访问。但是，如果k个地址不是连续的，则需要k个时钟周期才能访问所有地址。显然，所有n个元素都可以在2(n/p+l-1)个时钟周期内复制n/p个地址。我们提出了两种类型的硬件排序器，它们对存储在高带宽内存r×c矩阵中的n=rc元素进行排序。我们首先开发了three - pass - sort和four - pass - sort，分别通过三次读取和四次写入对r×c矩阵进行排序。我们使用基于fifo的合并来实现这两种算法，该合并可以配置为成对模式和滑动模式。我们基于三通排序的硬件排序器运行在6n/p+3c^2/p^2l+O(c/p(l+log r)+r)时钟周期，使用大小为O(rwp)的电路，前提是r≥c^2。此外，我们基于四通排序的硬件排序器使用大小为O(rw)的电路，以8n/p+2c^2l+O(cl+log r+p)时钟周期运行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FIFO-Based Hardware Sorters for High Bandwidth Memory

The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量