理解GaaS云中横向扩展通过gpu的虚拟化“税”:一项实证研究

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2015-03-09 DOI:10.1109/HPCA.2015.7056038

Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy

{"title":"理解GaaS云中横向扩展通过gpu的虚拟化“税”:一项实证研究","authors":"Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy","doi":"10.1109/HPCA.2015.7056038","DOIUrl":null,"url":null,"abstract":"Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"102 1","pages":"259-270"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Understanding the virtualization \\\"Tax\\\" of scale-out pass-through GPUs in GaaS clouds: An empirical study\",\"authors\":\"Ming Liu, Tao Li, Neo Jia, Andrew Currid, Vladimir Troy\",\"doi\":\"10.1109/HPCA.2015.7056038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.\",\"PeriodicalId\":6593,\"journal\":{\"name\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"102 1\",\"pages\":\"259-270\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2015.7056038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2015.7056038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

直通技术使虚拟机能够以独占模式直接访问硬件GPU资源，为GaaS云中的客户端用户提供非凡的图形性能。但是，直通GPU的虚拟化开销可能会通过降低GPU工作队列的占用率来降低图形工作负载的帧速率。在这项工作中，我们首次尝试表征在不同整合场景中运行的直通gpu，并揭示这些开销的根本原因。为此，我们建立了配备NVIDIA GRID gpu的最先进的经验平台，并在GaaS云中执行图形密集型工作负载。我们首先演示了虚拟化开销的存在，它会降低GPU命令生成速率。与裸金属系统相比，直通gpu在单个VM和8个VM下的性能分别下降9.0%和21.5%。我们分析了Windows显示驱动模型的工作流程和VMEXIT事件分布，确定了导致性能下降的四个因素(即HLT指令和空闲域、外部中断交付、IOMMU和内存子系统)。我们的评估结果表明:(1)由HLT指令引起的VM-VMM上下文切换和空闲域的唤醒中断注入导致66。单个直通GPU的空闲时间为7%;(2)外部中断传递和微线程处理造成额外的开销。8台虚拟机合并时，中断下发处理时间和中断频率分别上升30.7%和127.3%;(3)现有的IOMMU设计可以很好地扩展直通gpu;(4)域来宾软件栈的交互影响了硬件预取机制，使得当增加更多的直通GPU虚拟机时，硬件预取机制无法补偿快速增长的LLC缺失率。据我们所知，这是第一个描述直通GPU虚拟化开销和潜在原因的工作。这项研究强调了提高未来虚拟化GPU系统性能的有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Understanding the virtualization "Tax" of scale-out pass-through GPUs in GaaS clouds: An empirical study

Pass-through techniques enable virtual machines to directly access hardware GPU resources in an exclusive mode, delivering extraordinary graphics performance for client users in GaaS clouds. However, the virtualization overheads of pass-through GPUs may decrease the frame rate of graphics workloads by reducing the occupancy rate of the GPU working queue. In this work, we make the first attempt to characterize pass-through GPUs running in different consolidation scenarios and uncover the root causes of these overheads. Towards this end, we set up state-of-the-art empirical platforms equipped with NVIDIA GRID GPUs and execute graphics intensive workloads running in GaaS clouds. We first demonstrate the existence of virtualization overheads, which can slow down the GPU command generation rate. Compared with a bare-metal system, the performance of pass-through GPUs degrades 9.0% and 21.5% under a single VM and 8-VMs respectively. We analyze the workflow of Windows display driver model and VMEXIT events distribution and identify four factors (i.e. HLT instruction and idle domain, external interrupt delivery, IOMMU, and memory subsystem) that contribute to the performance degradation. Our evaluation results show that: (1) the VM-VMM context switch caused by a HLT instruction and wake-up interrupt injection of an idle domain result in 66. 7% idle time for a single pass-through GPU; (2) the external interrupt delivery and tasklet processing cause additional overheads. When 8 VMs are consolidated, the interrupt delivery processing time and interrupt frequency rise 30.7% and 127.3%, respectively; (3) the existing IOMMU design scales well with pass-through GPUs; and (4) interactions of domain guest's software stacks impact the hardware prefetching mechanism so that it fails to compensate the rapidly growing LLC miss rate when more pass-through GPU VMs are added. To the best of our knowledge, this is the first work that characterizes pass-through GPU virtualization overheads and underlying reasons. This study highlights valuable insights for improving the performance of future virtualized GPU systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量