Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy
{"title":"理解GaaS云中横向扩展通过gpu的虚拟化“税”:一项实证研究","authors":"Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy","doi":"10.1109/HPCA.2015.7056038","DOIUrl":null,"url":null,"abstract":"Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"102 1","pages":"259-270"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Understanding the virtualization \\\"Tax\\\" of scale-out pass-through GPUs in GaaS clouds: An empirical study\",\"authors\":\"Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy\",\"doi\":\"10.1109/HPCA.2015.7056038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.\",\"PeriodicalId\":6593,\"journal\":{\"name\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"102 1\",\"pages\":\"259-270\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2015.7056038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2015.7056038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Understanding the virtualization "Tax" of scale-out pass-through GPUs in GaaS clouds: An empirical study
Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.