高效调度内存密集型GPGPU工作负载

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE) Pub Date : 2014-03-24 DOI:10.7873/DATE.2014.032

Seokwoo Song, Minseok Lee, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu

{"title":"高效调度内存密集型GPGPU工作负载","authors":"Seokwoo Song, Minseok Lee, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu","doi":"10.7873/DATE.2014.032","DOIUrl":null,"url":null,"abstract":"High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"61 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Energy-efficient scheduling for memory-intensive GPGPU workloads\",\"authors\":\"Seokwoo Song, Minseok Lee, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu\",\"doi\":\"10.7873/DATE.2014.032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.\",\"PeriodicalId\":6550,\"journal\":{\"name\":\"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"volume\":\"61 1\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.7873/DATE.2014.032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7873/DATE.2014.032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

通过最大化并行性和充分利用可用资源来获得GPGPU工作负载的高性能。然而，这并不一定是节能的，特别是对于内存密集型的GPGPU工作负载。在这项工作中，我们提出节流CTA(合作线程阵列)调度(TCS)，其中我们利用两种类型的节流-节流活动内核的数量和节流内核中的warp执行-来提高内存密集型GPGPU工作负载的能源效率。该算法要求全局CTA或线程块调度器减少分配线程块的内核数量，同时利用本地warp调度器限制某些内核的内存请求，以进一步降低功耗。提出的TCS调度不需要离线分析，但可以在执行过程中动态完成。我们利用内存访问延迟度量来确定工作负载的内存强度，而不是依赖于诸如每千指令缺失量(MPKI)之类的传统度量。我们的评估表明，在不同的内存密集型工作负载中，TCS最多可减少48%(平均38%)的能耗，而对计算密集型工作负载的性能影响很小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Energy-efficient scheduling for memory-intensive GPGPU workloads

High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)

自引率

0.00%

发文量