Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI:10.1145/3431379.3460641

Hang Huang, J. Rao, Song Wu, Hai Jin, Hong Jiang, Hao Che, Xiaofeng Wu

{"title":"Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription","authors":"Hang Huang, J. Rao, Song Wu, Hai Jin, Hong Jiang, Hao Che, Xiaofeng Wu","doi":"10.1145/3431379.3460641","DOIUrl":null,"url":null,"abstract":"Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources. In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431379.3460641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources. In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.

查看原文本刊更多论文

通过高效线程超额订阅开发CPU弹性

弹性是云计算的一个基本特性，它允许用户动态地添加或删除资源以响应工作负载的变化。然而，构建真正利用弹性的应用程序并非易事。传统的应用程序需要修改，以有效地利用可变资源。本文探讨了线程超额订阅，即提供比可用内核更多的线程，以利用云中的CPU弹性。虽然保持足够的并发性允许应用程序在有更多可用cpu时利用额外的cpu，但人们普遍认为，由于过多的上下文切换、局域性丢失和共享资源上的争用，线程过度订阅会带来令人望而却步的开销。在本文中，我们对线程超额订阅的开销进行了全面的研究。我们发现1)上下文切换的直接成本(即在现代处理器上为1-2 μs)不会导致大多数应用程序的明显性能下降;2)过度订阅对CPU缓存和TLB的性能既具有建设性又具有破坏性。我们确定了两个先前未被充分研究的问题，它们是导致许多应用程序在超额订阅情况下急剧减速的原因。首先，操作系统内核中现有的线程睡眠和唤醒进程在处理超额订阅线程方面效率低下。其次，程序代码中普遍存在的忙碌等待操作会浪费CPU并饿死关键线程。为此，我们设计了两种操作系统机制，虚拟阻塞和忙碌等待检测，以实现高效的线程超额订阅，而无需更改程序代码。实验结果表明，我们的方法在保留扩展到更多cpu的能力的同时，可以达到接近于订阅不足场景的效率。与普通Linux相比，阻塞应用程序的性能提升高达77%，而基于忙碌等待的应用程序的性能提升高达19倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量