{"title":"Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing","authors":"Sridutt Bhalachandra, Allan Porterfield, J. Prins","doi":"10.1109/IPDPSW.2015.144","DOIUrl":null,"url":null,"abstract":"Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the \"fast\" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's \"near optimal\" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.144","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
Power is increasingly the limiting factor in High Performance Computing (HPC). Growing core counts in each generation increase power and energy demands. In the future, strict power and energy budgets will be used to control the operating costs of supercomputer centers. Every node needs to use energy wisely. Energy efficiency can either be improved by taking less time or running at lower power. In this paper, we use Dynamic Duty Cycle Modulation (DDCM) to improve energy efficiency by improving performance under a power bound. When the power is not capped, DDCM reduces processor power, saving energy and reducing processor temperature. DDCM allows the clock frequency to be controlled for each individual core with very low overhead. Any situation where the individual threads on a processor are exhibiting imbalance, a more balanced execution can be obtained by slowing the "fast" threads. We use time between MPI collectives and the waiting time at the collective to determine a thread's "near optimal" frequency. All changes are within the MPI library, introducing no user code changes or additional communication/synchronization. To test DDCM, a set of synthetic MPI programs with load imbalance were created. In addition, a couple of HPC MPI benchmarks with load imbalance were examined. In our experiments, DDCM saves up to 13.5% processor energy on one node and 20.8% on 16 nodes. By applying a power cap, DDCM effectively shifts power consumption between cores and improves overall performance. Performance improvements of 6.0% and 5.6% on one and 16 nodes, respectively, were observed.