iCHAT: Inter-cache Hardware-Assistant Data Transfer for Heterogeneous Chip Multiprocessors

2014 9th IEEE International Conference on Networking, Architecture, and Storage Pub Date : 2014-08-06 DOI:10.1109/NAS.2014.43

Junli Gu, Bradford M. Beckmann, Ting Cao, Yu Hu

{"title":"iCHAT: Inter-cache Hardware-Assistant Data Transfer for Heterogeneous Chip Multiprocessors","authors":"Junli Gu, Bradford M. Beckmann, Ting Cao, Yu Hu","doi":"10.1109/NAS.2014.43","DOIUrl":null,"url":null,"abstract":"Modern heterogeneous multiprocessors integrate CPU and GPU together to provide a boost to computational performance. Data sharing and communication between CPU and GPU has been a critical issue for the final speedup. With tighter integration of CPU and GPU, it has the advantage of sharing and moving data more efficiently in order to leverage the computational power that a GPU can provide. Initially, DMA or PCIe devices were used to transfer data between CPU and GPU with low efficiency and little flexibility. Recently a single address space and coherent cache hierarchies are being adopted in heterogeneous architectures to share data more efficiently. Thus it poses new challenge to understand the communication overheads in this new context and to improve communication efficiencies for these architectures. This paper proposes a novel approach called iCHAT (inter-Cache Hardware-Assistant data Transfer) to manage data transfer between the CPU cache and the GPU cache efficiently. The iCHAT technique proposed in this paper detects the communication patterns and eagerly evicts data from the owner's caches and prepares for the requestor's demand. We implement the iCHAT design in a simulator based on gem5 and an AMD in-house GPU simulator. Experimental results show that the communication related eviction traffic is reduced by an average of 40% and the total directory traffic is reduced by 8% on average. We implement a bounding experiment that provides a quantitative evaluation of inter CPU-GPU transfers and requests to communication data, which indicates that iCHAT could achieve on average 1.4x speedup for Rodinia benchmark suite and 1.2x speedup for AMD SDK APPs.","PeriodicalId":186621,"journal":{"name":"2014 9th IEEE International Conference on Networking, Architecture, and Storage","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 9th IEEE International Conference on Networking, Architecture, and Storage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAS.2014.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Modern heterogeneous multiprocessors integrate CPU and GPU together to provide a boost to computational performance. Data sharing and communication between CPU and GPU has been a critical issue for the final speedup. With tighter integration of CPU and GPU, it has the advantage of sharing and moving data more efficiently in order to leverage the computational power that a GPU can provide. Initially, DMA or PCIe devices were used to transfer data between CPU and GPU with low efficiency and little flexibility. Recently a single address space and coherent cache hierarchies are being adopted in heterogeneous architectures to share data more efficiently. Thus it poses new challenge to understand the communication overheads in this new context and to improve communication efficiencies for these architectures. This paper proposes a novel approach called iCHAT (inter-Cache Hardware-Assistant data Transfer) to manage data transfer between the CPU cache and the GPU cache efficiently. The iCHAT technique proposed in this paper detects the communication patterns and eagerly evicts data from the owner's caches and prepares for the requestor's demand. We implement the iCHAT design in a simulator based on gem5 and an AMD in-house GPU simulator. Experimental results show that the communication related eviction traffic is reduced by an average of 40% and the total directory traffic is reduced by 8% on average. We implement a bounding experiment that provides a quantitative evaluation of inter CPU-GPU transfers and requests to communication data, which indicates that iCHAT could achieve on average 1.4x speedup for Rodinia benchmark suite and 1.2x speedup for AMD SDK APPs.

查看原文本刊更多论文

异构芯片多处理器的缓存间硬件辅助数据传输

现代异构多处理器将CPU和GPU集成在一起，以提高计算性能。CPU和GPU之间的数据共享和通信是最终加速的关键问题。由于CPU和GPU的紧密集成，它具有更有效地共享和移动数据的优势，以便利用GPU可以提供的计算能力。最初，使用DMA或PCIe设备在CPU和GPU之间传输数据，效率低，灵活性差。最近，为了更有效地共享数据，异构架构中采用了单一地址空间和一致的缓存层次结构。因此，理解这种新环境中的通信开销并提高这些体系结构的通信效率提出了新的挑战。本文提出了一种名为iCHAT (inter-Cache Hardware-Assistant data Transfer)的新方法来有效地管理CPU缓存和GPU缓存之间的数据传输。本文提出的iCHAT技术检测通信模式，并从所有者的缓存中主动取出数据，为请求者的需求做准备。我们在基于gem5的模拟器和AMD内部GPU模拟器中实现了iCHAT设计。实验结果表明，通信相关的驱逐流量平均减少了40%，总目录流量平均减少了8%。我们实施了一个边界实验，对CPU-GPU之间的传输和对通信数据的请求进行了定量评估，结果表明iCHAT可以在Rodinia基准套件上实现1.4倍的平均加速，在AMD SDK应用程序上实现1.2倍的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 9th IEEE International Conference on Networking, Architecture, and Storage

自引率

0.00%

发文量