Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer
{"title":"Delay sensitivity-driven congestion mitigation for HPC systems","authors":"Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer","doi":"10.1145/3447818.3460362","DOIUrl":null,"url":null,"abstract":"Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"116 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.