改进数据预取的机器学习技术

5th International Conference on Energy Aware Computing Systems & Applications Pub Date : 2015-03-24 DOI:10.1109/ICEAC.2015.7352208

D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna

{"title":"改进数据预取的机器学习技术","authors":"D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna","doi":"10.1109/ICEAC.2015.7352208","DOIUrl":null,"url":null,"abstract":"With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.","PeriodicalId":334594,"journal":{"name":"5th International Conference on Energy Aware Computing Systems & Applications","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Machine learning techniques for improved data prefetching\",\"authors\":\"D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna\",\"doi\":\"10.1109/ICEAC.2015.7352208\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.\",\"PeriodicalId\":334594,\"journal\":{\"name\":\"5th International Conference on Energy Aware Computing Systems & Applications\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"5th International Conference on Energy Aware Computing Systems & Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEAC.2015.7352208\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"5th International Conference on Energy Aware Computing Systems & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEAC.2015.7352208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

随着单协处理器和多核设计上的万亿次浮点运算的出现，通过保持核心数据的供应来充分利用计算能力的技术是非常需要的。数据预取已经成为一种流行的方法，通过在处理器需要数据之前主动获取数据来隐藏内存延迟。将数据提前从内存子系统提取到更快的缓存中可以减少处理器端可观察到的延迟或等待时间，从而提高总体程序执行时间。我们研究了在61核Intel Xeon Phi协处理器上可用的两种预取技术，即各种工作负载上的软件(编译器引导)预取和硬件预取。使用机器学习技术，我们使用来自硬件计数器的原始性能数据(如内存带宽、缺失率、发出的预取等)合成工作负载阶段和阶段模式序列。此外，我们使用来自不同预取器设置下具有不同影响和行为的工作负载的性能数据。我们的贡献可以在以下方面为未来的预取设计提供帮助:(1)识别工作负载中具有不同特征和行为的阶段，并帮助动态修改预取类型和强度以适应阶段;(2)自动设置预取旋钮，无需用户费力;(3)影响未来处理器的软硬件预取交互设计;(4)在许多领域使用有价值的见解和性能数据，例如为大型集群中的节点提供电源，以最大限度地提高能源和性能效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Machine learning techniques for improved data prefetching

With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

5th International Conference on Energy Aware Computing Systems & Applications

自引率

0.00%

发文量