D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna
{"title":"改进数据预取的机器学习技术","authors":"D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna","doi":"10.1109/ICEAC.2015.7352208","DOIUrl":null,"url":null,"abstract":"With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.","PeriodicalId":334594,"journal":{"name":"5th International Conference on Energy Aware Computing Systems & Applications","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Machine learning techniques for improved data prefetching\",\"authors\":\"D. Guttman, M. Kandemir, Meenakshi Arunachalam, R. Khanna\",\"doi\":\"10.1109/ICEAC.2015.7352208\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.\",\"PeriodicalId\":334594,\"journal\":{\"name\":\"5th International Conference on Energy Aware Computing Systems & Applications\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"5th International Conference on Energy Aware Computing Systems & Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEAC.2015.7352208\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"5th International Conference on Energy Aware Computing Systems & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEAC.2015.7352208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Machine learning techniques for improved data prefetching
With the advent of teraflop-scale computing on both a single coprocessor and many-core designs, there is tremendous need for techniques to fully utilize the compute power by keeping cores fed with data. Data prefetching has been used as a popular method to hide memory latencies by fetching data proactively before the processor needs the data. Fetching data ahead of time from the memory subsystem into faster caches reduces observable latencies or wait times on the processor end and this improves overall program execution times. We study two types of prefetching techniques that are available on a 61-core Intel Xeon Phi co-processor, namely software (compiler-guided) prefetching and hardware prefetching on a variety of workloads. Using machine learning techniques, we synthesize workload phases and the sequence of phase patterns using raw performance data from hardware counters such as memory bandwidth, miss ratios, prefetches issued, etc. Furthermore, we use performance data from workloads with different impacts and behaviors under various prefetcher settings. Our contribution can help in future prefetching design in the following ways: (1) to identify phases within workloads that have different characteristics and behaviors and help dynamically modify prefetch types and intensities to suit the phase; (2) to manage auto setting of prefetcher knobs without great effort from the user; (3) to influence software and hardware prefetching interaction designs in future processors; and (4) to use valuable insights and performance data in many areas such as power provisioning for the nodes in a large cluster to maximize both energy and performance efficiencies.