J. Lawall, Himadri Chhaya-Shailesh, Jean-Pierre Lozi, Baptiste Lepers, W. Zwaenepoel, Gilles Muller
{"title":"OS scheduling with nest: keeping tasks close together on warm cores","authors":"J. Lawall, Himadri Chhaya-Shailesh, Jean-Pierre Lozi, Baptiste Lepers, W. Zwaenepoel, Gilles Muller","doi":"10.1145/3492321.3519585","DOIUrl":"https://doi.org/10.1145/3492321.3519585","url":null,"abstract":"To best support highly parallel applications, Linux's CFS scheduler tends to spread tasks across the machine on task creation and wakeup. It has been observed, however, that in a server environment, such a strategy leads to tasks being unnecessarily placed on long-idle cores that are running at lower frequencies, reducing performance, and to tasks being unnecessarily distributed across sockets, consuming more energy. In this paper, we propose to exploit the principle of core reuse, by constructing a nest of cores to be used in priority for task scheduling, thus obtaining higher frequencies and using fewer sockets. We implement the Nest scheduler in the Linux kernel. While performance and energy usage are comparable to CFS for highly parallel applications, for a range of applications using fewer tasks than cores, Nest improves performance 10%--2× and can reduce energy usage.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121513607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Shahriar Iqbal, R. Krishna, Mohammad Ali Javidian, Baishakhi Ray, Pooyan Jamshidi
{"title":"Unicorn: reasoning about configurable system performance through the lens of causality","authors":"Md Shahriar Iqbal, R. Krishna, Mohammad Ali Javidian, Baishakhi Ray, Pooyan Jamshidi","doi":"10.1145/3492321.3519575","DOIUrl":"https://doi.org/10.1145/3492321.3519575","url":null,"abstract":"Modern computer systems are highly configurable, with the total variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, over a vast and variable space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) may produce incorrect explanations. To tackle this, we propose a new method, called Unicorn, which (i) captures intricate interactions between configuration options across the software-hardware stack and (ii) describes how such interactions can impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance debugging and optimization methods in finding effective repairs for performance faults and finding configurations with near-optimal performance. Further, unlike the existing methods, the learned causal performance models reliably predict performance for new environments.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131228218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, R. Ramjee, Nipun Kwatra
{"title":"Varuna: scalable, low-cost training of massive deep learning models","authors":"Sanjith Athlur, Nitika Saran, Muthian Sivathanu, R. Ramjee, Nipun Kwatra","doi":"10.1145/3492321.3519584","DOIUrl":"https://doi.org/10.1145/3492321.3519584","url":null,"abstract":"Systems for training massive deep learning models (billions of parameters) today assume and require specialized \"hyperclusters\": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyperclusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyperclusters. In this paper, we present Varuna a new system that enables training massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user's training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage \"low-priority\" VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper \"spot VMs\", while maintaining high training throughput. Varuna improves end-to-end training time for language models like BERT and GPT-2 by up to 18x compared to other model-parallel approaches and up to 26% compared to other pipeline parallel approaches on commodity VMs. The code for Varuna is available at https://github.com/microsoft/varuna.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125395328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, A. Abbasi, Thorsten Holz
{"title":"Nyx-net: network fuzzing with incremental snapshots","authors":"Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, A. Abbasi, Thorsten Holz","doi":"10.1145/3492321.3519591","DOIUrl":"https://doi.org/10.1145/3492321.3519591","url":null,"abstract":"Coverage-guided fuzz testing (\"fuzzing\") has become mainstream and we have observed lots of progress in this research area recently. However, it is still challenging to efficiently test network services with existing coverage-guided fuzzing methods. In this paper, we introduce the design and implementation of Nyx-Net, a novel snapshot-based fuzzing approach that can successfully fuzz a wide range of targets spanning servers, clients, games, and even Firefox's Inter-Process Communication (IPC) interface. Compared to state-of-the-art methods, Nyx-Net improves test throughput by up to 300x and coverage found by up to 70%. Additionally, Nyx-Net is able to find crashes in two of ProFuzzBench's targets that no other fuzzer found previously. When using Nyx-Net to play the game Super Mario, Nyx-Net shows speedups of 10--30x compared to existing work. Moreover, Nyx-Net is able to find previously unknown bugs in servers such as Lighttpd, clients such as MySQL client, and even Firefox's IPC mechanism---demonstrating the strength and versatility of the proposed approach. Lastly, our prototype implementation was awarded a $20.000 bug bounty for enabling fuzzing on previously unfuzzable code in Firefox and solving a long-standing problem at Mozilla.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126288797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimum viable device drivers for ARM trustzone","authors":"Liwei Guo, F. Lin","doi":"10.1145/3492321.3519565","DOIUrl":"https://doi.org/10.1145/3492321.3519565","url":null,"abstract":"While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices. Driverlets address two key challenges: correctness and expressiveness, for which they build on a key construct called interaction template. The interaction template ensures faithful reproduction of recorded IO jobs (albeit on new IO data); it accepts dynamic input values; it tolerates nondeterministic device behaviors. We demonstrate driverlets on a series of sophisticated devices, making them accessible to Trust-Zone for the first time to our knowledge. Our experiments show that driverlets are secure, easy to build, and incur acceptable overhead (1.4×-2.7× compared to native drivers). Driverlets fill a critical gap in the TrustZone TEE, realizing its long-promised vision of secure IO.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121227747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiqing Ma, Han Tian, Xudong Liao, Junxue Zhang, Weiyan Wang, Kai Chen, Xin Jin
{"title":"Multi-objective congestion control","authors":"Yiqing Ma, Han Tian, Xudong Liao, Junxue Zhang, Weiyan Wang, Kai Chen, Xin Jin","doi":"10.1145/3492321.3519593","DOIUrl":"https://doi.org/10.1145/3492321.3519593","url":null,"abstract":"Decades of research on Internet congestion control (CC) have produced a plethora of algorithms that optimize for different performance objectives. Applications face the challenge of choosing the most suitable algorithm based on their needs, and it takes tremendous efforts and expertise to customize CC algorithms when new demands emerge. In this paper, we explore a basic question: can we design a single CC algorithm to satisfy different objectives? We propose MOCC, the first multi-objective congestion control algorithm that attempts to address this question. The core of MOCC is a novel multi-objective reinforcement learning framework for CC to automatically learn the correlations between different application requirements and the corresponding optimal control policies. Under this framework, MOCC further applies transfer learning to transfer the knowledge from past experience to new applications, quickly adapting itself to a new objective even if it is unforeseen. We provide both user-space and kernel-space implementation of MOCC. Real-world Internet experiments and extensive simulations show that MOCC supports well multi-objective, competing or outperforming the best existing CC algorithms on each individual objectives, and quickly adapting to new application objectives in 288 seconds (14.2× faster than prior work) without compromising old ones.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"226 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132567427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Danezis, Eleftherios Kokoris-Kogias, A. Sonnino, A. Spiegelman
{"title":"Narwhal and Tusk: a DAG-based mempool and efficient BFT consensus","authors":"G. Danezis, Eleftherios Kokoris-Kogias, A. Sonnino, A. Spiegelman","doi":"10.1145/3492321.3519594","DOIUrl":"https://doi.org/10.1145/3492321.3519594","url":null,"abstract":"We propose separating the task of reliable transaction dissemination from transaction ordering, to enable high-performance Byzantine fault-tolerant quorum-based consensus. We design and evaluate a mempool protocol, Narwhal, specializing in high-throughput reliable dissemination and storage of causal histories of transactions. Narwhal tolerates an asynchronous network and maintains high performance despite failures. Narwhal is designed to easily scale-out using multiple workers at each validator, and we demonstrate that there is no foreseeable limit to the throughput we can achieve. Composing Narwhal with a partially synchronous consensus protocol (Narwhal-HotStuff) yields significantly better throughput even in the presence of faults or intermittent loss of liveness due to asynchrony. However, loss of liveness can result in higher latency. To achieve overall good performance when faults occur we design Tusk, a zero-message overhead asynchronous consensus protocol, to work with Narwhal. We demonstrate its high performance under a variety of configurations and faults. As a summary of results, on a WAN, Narwhal-Hotstuff achieves over 130,000 tx/sec at less than 2-sec latency compared with 1,800 tx/sec at 1-sec latency for Hotstuff. Additional workers increase throughput linearly to 600,000 tx/sec without any latency increase. Tusk achieves 160,000 tx/sec with about 3 seconds latency. Under faults, both protocols maintain high throughput, but Narwhal-HotStuff suffers from increased latency.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116488543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas C. Wanninger, Josh Bowden, K. Shetty, Ayush Garg, Kyle C. Hale
{"title":"Isolating functions at the hardware limit with virtines","authors":"Nicholas C. Wanninger, Josh Bowden, K. Shetty, Ayush Garg, Kyle C. Hale","doi":"10.1145/3492321.3519553","DOIUrl":"https://doi.org/10.1145/3492321.3519553","url":null,"abstract":"An important class of applications, including programs that leverage third-party libraries, programs that use user-defined functions in databases, and serverless applications, benefit from isolating the execution of untrusted code at the granularity of individual functions or function invocations. However, existing isolation mechanisms were not designed for this use case; rather, they have been adapted to it. We introduce virtines, a new abstraction designed specifically for function granularity isolation, and describe how we build virtines from the ground up by pushing hardware virtualization to its limits. Virtines give developers fine-grained control in deciding which functions should run in isolated environments, and which should not. The virtine abstraction is a general one, and we demonstrate a prototype that adds extensions to the C language. We present a detailed analysis of the overheads of running individual functions in isolated VMs, and guided by those findings, we present Wasp, an embeddable hypervisor that allows programmers to easily use virtines. We describe several representative scenarios that employ individual function isolation, and demonstrate that virtines can be applied in these scenarios with only a few lines of changes to existing codebases and with acceptable slowdowns.","PeriodicalId":196414,"journal":{"name":"Proceedings of the Seventeenth European Conference on Computer Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128452907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}