W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout
{"title":"在集群缓存架构上订阅不足的线程","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1109/HPCA.2014.6835975","DOIUrl":null,"url":null,"abstract":"Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Undersubscribed threading on clustered cache architectures\",\"authors\":\"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout\",\"doi\":\"10.1109/HPCA.2014.6835975\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.\",\"PeriodicalId\":164587,\"journal\":{\"name\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2014.6835975\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835975","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25
摘要
最近的许多核心处理器,如英特尔的Xeon Phi和gpgpu,专注于以高性能运行高度可扩展的并行应用程序,同时将能效作为一阶设计约束。传统的信念是,充分利用所有可用的内核也可以转化为尽可能高的性能。在本文中,我们研究了缓存容量冲突和共享片外带宽竞争的影响;并表明,不充分订阅或不使用所有核心,通常会显著提高性能和能源效率。基于详细的共享工作集分析,我们将集群缓存架构作为利用数据共享和欠订阅的有效设计点,同时在多核处理器中提供低延迟和易于实现。然后,我们提出了ClusteR-aware undersubscribe Scheduling of Threads (CRUST),它动态匹配应用程序的工作集大小和片外带宽需求,以及可用的片上缓存容量和片外带宽。在NPB和SPEC OMP基准测试中,CRUST可将应用性能和能源效率平均提高15%,最高可提高50%。此外,我们对未来多核架构的设计提出了建议,并表明考虑到订阅不足的使用模型可以将核心与缓存区域权衡下的最佳性能移动到具有更多核心和更少缓存的设计点。
Undersubscribed threading on clustered cache architectures
Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.