Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu
{"title":"Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM","authors":"Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu","doi":"10.1109/PACT.2015.51","DOIUrl":null,"url":null,"abstract":"Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"114","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 114
Abstract
Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).