B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim
{"title":"通过近数据处理加速链表遍历","authors":"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim","doi":"10.1145/2967938.2967958","DOIUrl":null,"url":null,"abstract":"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Accelerating linked-list traversal through near-data processing\",\"authors\":\"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim\",\"doi\":\"10.1145/2967938.2967958\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.\",\"PeriodicalId\":407717,\"journal\":{\"name\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"volume\":\"11 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2967938.2967958\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accelerating linked-list traversal through near-data processing
Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.