通过近数据处理加速链表遍历

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2967958

B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim

{"title":"通过近数据处理加速链表遍历","authors":"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim","doi":"10.1145/2967938.2967958","DOIUrl":null,"url":null,"abstract":"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Accelerating linked-list traversal through near-data processing\",\"authors\":\"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim\",\"doi\":\"10.1145/2967938.2967958\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.\",\"PeriodicalId\":407717,\"journal\":{\"name\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"volume\":\"11 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2967938.2967958\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

存储系统设计的最新技术进步，以及3D堆叠，使得近数据处理(NDP)更加可行，可以加速不同的工作负载。在这项工作中，我们探索了一个基本操作-链表遍历(LLT)的近数据处理。我们提出了一种新的NDP架构，它不改变现有的顺序编程模型，也不需要对处理器微架构进行任何修改。相反，我们利用内核和内存模块之间的封装接口来卸载NDP的LLT。我们利用一个具有多个内存模块(例如，混合内存立方体(HMC)模块)与内存网络互连的系统，我们的初步评估表明，简单地将LLT计算卸载到近内存实际上会降低性能，因为额外的片外内存网络通道遍历。因此，我们首先提出了ndp感知数据本地化来利用局部性-包括单个内存模块和内存库中的局部性-以最大限度地减少延迟并提高能源效率。为了提高整体吞吐量和最大化并行性，我们建议将多个LLT操作批处理在一起，通过利用NDP处理单元的高度并行执行和3D堆叠DRAM的高带宽来摊销NDP的成本。与主机处理相比，ndp感知数据本地化和批处理的组合可以显著提高性能和能源效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating linked-list traversal through near-data processing

Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量