基于MinHash草图的高度并行近内存加速器

2022 IEEE 35th International System-on-Chip Conference (SOCC) Pub Date : 2022-09-05 DOI:10.1109/SOCC56010.2022.9908115

Aman Sinha, Jhih-Yong Mai, B. Lai

{"title":"基于MinHash草图的高度并行近内存加速器","authors":"Aman Sinha, Jhih-Yong Mai, B. Lai","doi":"10.1109/SOCC56010.2022.9908115","DOIUrl":null,"url":null,"abstract":"Genome Assembly is an important Big Data analytics which involves massive computations for similarity searches on sequence databases. Being major component of runtime, similarity searches require careful design for scalable performance. MinHash Sketching is an extensively used data structure in Long-read genome assembly pipelines, which involves generating, randomizing and minimizing a set of hashes for all the k-mers in genome sequences. Compute-hungry MinHash sketch processing on commercially available multi-threaded CPUs suffer from the limited bandwidth of the L1-cache, which causes the CPUs to stall. Near-Data Processing (NDP) is an emerging trend in data-bound Big Data analytics to harness the low-latency, highbandwidth available within the Dual In-line Memory Modules (DIMMs). While NDP architectures have generally been utilized for memory-bound computations, MinHash sketching is a potential application that can gain massive throughput by exploiting memory Banks as higher bandwidth L1-cache.In this work, we propose MSIM, a distributed, highly parallel and efficient hardware-software co-design for accelerating MinHash Sketch processing on light-weight components placed on the DRAM hierarchy. Multiple ASIC-based Processing Engines (PEs) placed at the bank-group-level in MSIM provide highparallelism for low-latency computations. The PEs sequentially access data from all Banks within their bank-group with the help of a dedicated Address calculator, which utilizes an optimal data mapping scheme. The PEs are controlled by a custom Arbiter, which is directly activated by the host CPU using general DDR commands, without requiring any modification to the memory controller or the DIMM standard buses. MSIM requires limited area and power overheads, while displaying up-to 384.9x speedup and 1088.4x energy reduction compared to the baseline multithreaded software solution in our experiments. MSIM achieves 4.26x speedup over high-end GPU, while consuming 26.4x lesser energy. Moreover, MSIM design is highly scalable and extendable in nature.","PeriodicalId":431451,"journal":{"name":"2022 IEEE 35th International System-on-Chip Conference (SOCC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MSIM: A Highly Parallel Near-Memory Accelerator for MinHash Sketch\",\"authors\":\"Aman Sinha, Jhih-Yong Mai, B. Lai\",\"doi\":\"10.1109/SOCC56010.2022.9908115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genome Assembly is an important Big Data analytics which involves massive computations for similarity searches on sequence databases. Being major component of runtime, similarity searches require careful design for scalable performance. MinHash Sketching is an extensively used data structure in Long-read genome assembly pipelines, which involves generating, randomizing and minimizing a set of hashes for all the k-mers in genome sequences. Compute-hungry MinHash sketch processing on commercially available multi-threaded CPUs suffer from the limited bandwidth of the L1-cache, which causes the CPUs to stall. Near-Data Processing (NDP) is an emerging trend in data-bound Big Data analytics to harness the low-latency, highbandwidth available within the Dual In-line Memory Modules (DIMMs). While NDP architectures have generally been utilized for memory-bound computations, MinHash sketching is a potential application that can gain massive throughput by exploiting memory Banks as higher bandwidth L1-cache.In this work, we propose MSIM, a distributed, highly parallel and efficient hardware-software co-design for accelerating MinHash Sketch processing on light-weight components placed on the DRAM hierarchy. Multiple ASIC-based Processing Engines (PEs) placed at the bank-group-level in MSIM provide highparallelism for low-latency computations. The PEs sequentially access data from all Banks within their bank-group with the help of a dedicated Address calculator, which utilizes an optimal data mapping scheme. The PEs are controlled by a custom Arbiter, which is directly activated by the host CPU using general DDR commands, without requiring any modification to the memory controller or the DIMM standard buses. MSIM requires limited area and power overheads, while displaying up-to 384.9x speedup and 1088.4x energy reduction compared to the baseline multithreaded software solution in our experiments. MSIM achieves 4.26x speedup over high-end GPU, while consuming 26.4x lesser energy. Moreover, MSIM design is highly scalable and extendable in nature.\",\"PeriodicalId\":431451,\"journal\":{\"name\":\"2022 IEEE 35th International System-on-Chip Conference (SOCC)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 35th International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SOCC56010.2022.9908115\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 35th International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCC56010.2022.9908115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基因组组装是一种重要的大数据分析方法，需要对序列数据库进行大量的相似性搜索。作为运行时的主要组成部分，相似性搜索需要仔细设计以实现可伸缩的性能。MinHash草图是长读基因组组装管道中广泛使用的数据结构，它涉及为基因组序列中的所有k-mers生成，随机化和最小化一组哈希。在商用多线程cpu上进行需要大量计算的MinHash草图处理会受到l1缓存带宽的限制，这会导致cpu停止运行。近数据处理(NDP)是数据绑定大数据分析的一个新兴趋势，它利用了双内联内存模块(dimm)内的低延迟、高带宽。虽然NDP架构通常用于内存约束计算，但MinHash草图是一个潜在的应用程序，可以通过利用内存库作为更高带宽的l1缓存来获得大量吞吐量。在这项工作中，我们提出了MSIM，一种分布式，高度并行和高效的软硬件协同设计，用于加速放在DRAM层次结构上的轻量级组件的MinHash草图处理。MSIM中位于银行组级别的多个基于asic的处理引擎(pe)为低延迟计算提供了高并行性。pe在专用地址计算器的帮助下依次访问其银行集团内所有银行的数据，该计算器利用最佳数据映射方案。pe由自定义Arbiter控制，主机CPU使用通用DDR命令直接激活该仲裁器，不需要修改内存控制器或DIMM标准总线。在我们的实验中，与基线多线程软件解决方案相比，MSIM需要有限的面积和功耗开销，同时显示高达384.9倍的加速和1088.4倍的能耗降低。与高端GPU相比，MSIM实现了4.26倍的加速，同时能耗降低26.4倍。此外，MSIM设计在本质上具有高度可扩展性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MSIM: A Highly Parallel Near-Memory Accelerator for MinHash Sketch

Genome Assembly is an important Big Data analytics which involves massive computations for similarity searches on sequence databases. Being major component of runtime, similarity searches require careful design for scalable performance. MinHash Sketching is an extensively used data structure in Long-read genome assembly pipelines, which involves generating, randomizing and minimizing a set of hashes for all the k-mers in genome sequences. Compute-hungry MinHash sketch processing on commercially available multi-threaded CPUs suffer from the limited bandwidth of the L1-cache, which causes the CPUs to stall. Near-Data Processing (NDP) is an emerging trend in data-bound Big Data analytics to harness the low-latency, highbandwidth available within the Dual In-line Memory Modules (DIMMs). While NDP architectures have generally been utilized for memory-bound computations, MinHash sketching is a potential application that can gain massive throughput by exploiting memory Banks as higher bandwidth L1-cache.In this work, we propose MSIM, a distributed, highly parallel and efficient hardware-software co-design for accelerating MinHash Sketch processing on light-weight components placed on the DRAM hierarchy. Multiple ASIC-based Processing Engines (PEs) placed at the bank-group-level in MSIM provide highparallelism for low-latency computations. The PEs sequentially access data from all Banks within their bank-group with the help of a dedicated Address calculator, which utilizes an optimal data mapping scheme. The PEs are controlled by a custom Arbiter, which is directly activated by the host CPU using general DDR commands, without requiring any modification to the memory controller or the DIMM standard buses. MSIM requires limited area and power overheads, while displaying up-to 384.9x speedup and 1088.4x energy reduction compared to the baseline multithreaded software solution in our experiments. MSIM achieves 4.26x speedup over high-end GPU, while consuming 26.4x lesser energy. Moreover, MSIM design is highly scalable and extendable in nature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE 35th International System-on-Chip Conference (SOCC)

自引率

0.00%

发文量