Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.51

Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu

{"title":"Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM","authors":"Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu","doi":"10.1109/PACT.2015.51","DOIUrl":null,"url":null,"abstract":"Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"114","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 114

Abstract

Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).

查看原文本刊更多论文

解耦直接内存访问:通过利用双数据端口DRAM隔离CPU和IO流量

对于在大型数据集上运行的高度并行处理单元的现代系统，内存通道争用是一个关键的性能瓶颈。内存通道不仅受到来自不同用户应用程序(CPU访问)的请求的争夺，而且还受到系统对外围数据(IO访问)的请求的争夺，这些请求通常由直接内存访问(DMA)引擎控制。在这项工作中，我们的目标是通过消除CPU访问和IO访问之间的内存通道争用来提高系统性能。为此，我们提出了一种硬件软件协同数据传输机制——去耦DMA (DDMA)，它为IO访问提供了一个专门的低成本存储通道。在我们的DDMA设计中，主存有两个独立的数据通道，其中一个连接到处理器(CPU通道)，另一个连接到IO设备(IO通道)，使CPU和IO访问可以在不同的通道上服务。系统软件或编译器确定应该在IO通道上处理哪些请求，并将其与DDMA引擎通信，然后DDMA引擎启动IO通道上的传输。通过这样做，我们的建议增加了有效的内存通道带宽，从而加速系统组件之间的数据传输，或者提供使用IO性能增强技术(例如，积极的IO预取)而不干扰CPU访问的机会。我们在两种情况下证明了我们的DDMA框架的有效性:(i) CPU- gpu通信和(ii)内存通信(主内存内的批量数据复制/初始化)。通过有效地将CPU- gpu通信和内存通信的访问与CPU访问解耦，我们基于ddma的设计在各种系统配置中实现了显著的性能改进(例如，在典型的2通道2级内存系统上平均性能提高20%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量