{"title":"Efficient LLVM-based dynamic binary translation","authors":"A. Engelke, Dominik Okwieka, M. Schulz","doi":"10.1145/3453933.3454022","DOIUrl":"https://doi.org/10.1145/3453933.3454022","url":null,"abstract":"Emulation of other or newer processor architectures is necessary for a wide variety of use cases, from ensuring compatibility to offering a vehicle for computer architecture research. This problem is usually approached using dynamic binary translation, where machine code is translated, on the fly, to the host architecture during program execution. Existing systems, like QEMU, usually focus on translation performance rather than the overall program execution, and extensions, like HQEMU, are limited by their underlying implementation. Conversely, performance-focused systems are typically used for binary instrumentation. E.g., DynamoRIO reuses original instructions where possible, while Instrew utilizes the LLVM compiler infrastructure, but only supports same-architecture code generation. In this short paper, we generalize Instrew to support different guest and host architectures by refactoring the lifter and by implementing target-independent optimizations to re-use host hardware features for emulated code. We demonstrate this flexibility by adding support for RISC-V as guest architecture and AArch64 as host architecture. Our performance results on SPEC CPU2017 show significant improvements compared to QEMU, HQEMU as well as the original Instrew.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114898529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to design a library OS for practical containers?","authors":"Hajime Tazaki, Akira Moroo, Yohei Kuga, Ryo Nakamura","doi":"10.1145/3453933.3454011","DOIUrl":"https://doi.org/10.1145/3453933.3454011","url":null,"abstract":"Container engines with operating-system virtualization have been widely used and now offer extensions to replace core functionalities that are derived from the host kernel. Because such extensions with an alternate kernel, which is often implemented in a library operating system (libOS), can be designed to have free choice, developers are tempted to take a clean-slate approach, i.e., implement the kernels from scratch. However, this design decision makes it difficult to cover broad features of the original Linux kernel, and some application programs may not work on such kernels. Precise emulation of the huge codebase and rich feature set of the Linux kernel is not easily possible. In this paper, we have tried to improve the level of compatibility in a libOS by using the source code of the Linux kernel as the container kernel. We present µKontainer, an alternate container kernel based on a libOS by extending the existing open-source software, Linux Kernel Library, while preserving the lightweight property of conventional containers. We have studied the level of compatibility with the conformance tests of network protocol implementation of nine different libOSs, and µKontainer performs identically like the Linux kernel. The network-related benchmark shows mostly comparable results with a conventional container and a native Linux host; in the best case, the goodput of the short-sized packet is up to 84% faster than that of a native Linux host. This paper sheds light on the design space of the libOS when we introduced the extended container kernel.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115207291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis
{"title":"Automatically exploiting the memory hierarchy of GPUs through just-in-time compilation","authors":"Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis","doi":"10.1145/3453933.3454014","DOIUrl":"https://doi.org/10.1145/3453933.3454014","url":null,"abstract":"Although Graphics Processing Units (GPUs) have become pervasive for data-parallel workloads, the efficient exploitation of their tiered memory hierarchy requires explicit programming. The efficient utilization of different GPU memory tiers can yield higher performance at the expense of programmability since developers must have extended knowledge of the architectural details in order to utilize them. In this paper, we propose an alternative approach based on Just-In-Time (JIT) compilation to automatically and transparently exploit local memory allocation and data locality on GPUs. In particular, we present a set of compiler extensions that allow arbitrary Java programs to utilize local memory on GPUs without explicit programming. We prototype and evaluate our proposed solution in the context of TornadoVM against a set of benchmarks and GPU architectures, showcasing performance speedups of up to 2.5x compared to equivalent baseline implementations that do not utilize local memory or data locality. In addition, we compare our proposed solution against hand-written optimized OpenCL code to assess the upper bound of performance improvements that can be transparently achieved by JIT compilation without trading programmability. The results showcase that the proposed extensions can achieve up to 94% of the performance of the native code, highlighting the efficiency of the generated code.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"350 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114823593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stella Bitchebe, Djob Mvondo, Laurent Réveillère, Noël de Palma, A. Tchana
{"title":"Extending Intel PML for hardware-assisted working set size estimation of VMs","authors":"Stella Bitchebe, Djob Mvondo, Laurent Réveillère, Noël de Palma, A. Tchana","doi":"10.1145/3453933.3454018","DOIUrl":"https://doi.org/10.1145/3453933.3454018","url":null,"abstract":"Intel page modification logging (PML) is a hardware feature introduced in 2015 for tracking modified memory pages of virtual machines (VMs). Although initially designed to improve VMs checkpointing and live migration, we present in this paper how we can take advantage of this virtualization technology to efficiently estimate the working set size (WSS) of a VM. To this end, we first conduct a study of PML with the Xen hypervisor to investigate its performance impact on VMs and the accuracy of a WSS estimation system that relies on the current version of PML. Our three main findings are as follows. (1) PML reduces by up to 10.18% the time of both VM live migration and checkpointing. (2) PML slightly reduces the negative impact of live migration on application performance by up to 0.95%. (3) A WSS estimation system based on the current version of PML provides inaccurate results. Moreover, our experiments show that write-intensive applications are negatively impacted, with up to 34.9% of performance degradation, when using PML to estimate the WSS of a VM that runs these applications. Based on the aforementioned findings, we introduce page reference logging (PRL), an extended version of PML that allows both read and write memory accesses to be tracked without impacting user VMs, thus more suitable for WSS estimation. We propose a WSS estimation system that leverages PRL and show how it can be used in a data center exploiting memory overcommitment. We implement PRL and the underlying WSS estimation system under Gem5, a popular open-source computer architecture simulator. Evaluation results validate the accuracy of the WSS estimation system and show that PRL does not incur more performance degradation on user’s VMs.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122628665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boris Teabe, Peterson Yuhala, A. Tchana, Fabien Hermenier, D. Hagimont, Gilles Muller
{"title":"(No)Compromis: paging virtualization is not a fatality","authors":"Boris Teabe, Peterson Yuhala, A. Tchana, Fabien Hermenier, D. Hagimont, Gilles Muller","doi":"10.1145/3453933.3454013","DOIUrl":"https://doi.org/10.1145/3453933.3454013","url":null,"abstract":"Nested/Extended Page Table (EPT) is the current hardware solution for virtualizing memory in virtualized systems. It induces a significant performance overhead due to the 2D page walk it requires, thus 24 memory accesses on a TLB miss (instead of 4 memory accesses in a native system). This 2D page walk constraint comes from the utilization of paging for managing virtual machine (VM) memory. This paper shows that paging is not necessary in the hypervisor. Our solution Compromis, a novel Memory Management Unit, uses direct segments for VM memory management combined with paging for VM's processes. This is the first time that a direct segment based solution is shown to be applicable to the entire VM memory while keeping applications unchanged. Relying on the 310 studied datacenter traces, the paper shows that it is possible to provision up to 99.99% of the VMs using a single memory segment. The paper presents a systematic methodology for implementing Compromis in the hardware, the hypervisor and the datacenter scheduler. Evaluation results show that Compromis outperforms the two popular memory virtualization solutions: shadow paging and EPT by up to 30% and 370% respectively.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127524422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Sha, Yi Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang
{"title":"Swift shadow paging (SSP): no write-protection but following TLB flushing","authors":"Sai Sha, Yi Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang","doi":"10.1145/3453933.3454012","DOIUrl":"https://doi.org/10.1145/3453933.3454012","url":null,"abstract":"Virtualization is a key technique for supporting cloud services and memory virtualization is a major component of virtualization technology. Common memory virtualization mechanisms include shadow paging and hardware-assisted paging. The shadow paging model needs to synchronize shadow/guest page tables whenever there is a guest page table update. In the design of traditional shadow paging (TSP), the guest page table pages are write-protected so the updates can be intercepted by the hypervisor to ensure synchronization. Frequent page table updates cause lots of VM_Exits. Researchers have developed hardware-assisted paging to eliminate this overhead. However, address translation needs to walk a two-dimensional page table. This design significantly increases the overhead of page walk. This paper proposes SSP, a Swift Shadow Paging model which leverages the privileged hardware mode. In this design, the write protection mechanism is no longer needed. Rather, SSP accomplishes lazy page table synchronization by intercepting TLB flushing, which must be initiated by the guest OS when there is a page table update. The hardware mode, such as RISC-V’s machine mode and Sunway’s hardware mode, with the highest privilege, opens a new door for communication between the host OS and a guest OS. In addition, by using a shadow page table base address buffer, SSP eliminates the VM_Exits generated by guest process context switching. SSP inherits the advantage of TSP as it remains as a software-only solution and does not incur the excessive overhead of page walk when compared to hardware-assisted paging. We implement SSP in a Sunway machine. Our evaluation demonstrates SSP’s advantage for multiple workloads. Compared with TSP, SSP reduces VM_Exits caused by memory virtualization by 23%-56%. And the virtualization overhead of SSP is less than 5.5% for all workloads.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128939832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective exploitation of SIMD resources in cross-ISA virtualization","authors":"Jin Wu, Jian Dong, Ruili Fang, Ziyi Zhao, Xiaoli Gong, Wenwen Wang, Decheng Zuo","doi":"10.1145/3453933.3454016","DOIUrl":"https://doi.org/10.1145/3453933.3454016","url":null,"abstract":"System virtualization is a fundamental technology that enables many important applications. However, existing virtualization techniques suffer from a critical limitation due to their limited exploitation of host SIMD hardware resources, especially when a guest application does not have inherently fine-grained data-level parallelism. To bridge this utilization gap and unleash the full potential of host SIMD resources, this paper proposes an effective and unconventional SIMD exploitation technique. The proposed exploitation takes advantage of ample host SIMD registers and powerful host SIMD instructions to generate more efficient host binary code for guest applications even without any fine-grained data-level parallelism. It also mitigates the shortage of general-purpose registers on the host platform, as well as improves the efficiency of accessing guest registers. We have implemented the exploitation in an extensively-used virtualization platform, QEMU. Experimental results on a comprehensive list of benchmarks from PARSEC, SPEC-CPU2017, and Google Octane JavaScript benchmark suite show that an average of 2.2X performance speedup can be achieved for AArch64 binaries on an x86-64 host machine. We believe the proposed technique will provide a new perspective for our community to rethink the exploitation of SIMD hardware resources.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115726924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated bug localization in JIT compilers","authors":"HeuiChan Lim, S. Debray","doi":"10.1145/3453933.3454021","DOIUrl":"https://doi.org/10.1145/3453933.3454021","url":null,"abstract":"Many widely-deployed modern programming systems use just-in-time (JIT) compilers to improve performance. The size and complexity of JIT-based systems, combined with the dynamic nature of JIT-compiler optimizations, make it challenging to locate and fix JIT compiler bugs quickly. At the same time, JIT compiler bugs can result in exploitable security vulnerabilities, making rapid bug localization important. Existing work on automated bug localization focuses on static code, i.e., code that is not generated at runtime, and so cannot handle bugs in JIT compilers that generate incorrect code during optimization. This paper describes an approach to automated bug localization in JIT compilers, down to the level of distinct optimization phases, starting with a single initial Proof-of-Concept (PoC) input that demonstrates the bug. Experiments using a prototype implementation of our ideas on Google’s V8 JavaScript interpreter and TurboFan JIT compiler demonstrates that it can successfully identify buggy optimization phases.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114891712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive live migration of virtual machines under limited network bandwidth","authors":"Handong Li, Guangrong Xiao, Yulei Zhang, Ping Gao, Q. Lu, Jianguo Yao","doi":"10.1145/3453933.3454017","DOIUrl":"https://doi.org/10.1145/3453933.3454017","url":null,"abstract":"Live migration is a crucial feature in existing virtualization platforms. Since memory is dirtied rapidly during the execution of a virtual machine (VM), boosting memory migration speed becomes a significant factor in guaranteeing a high-level success ratio and efficiency. However, the statically-configured migration strategy cannot cope with various workloads running in VMs, resulting in frequently aborted migration processes and low success ratio. This paper proposed a one-for-all migration architecture called Adaptive Live Migration (AdaMig) to address these issues. This QEMU-based solution dynamically switches migration methods and tunes related parameters by monitoring the run-time statistics from the migration process and the physical host. Once AdaMig detects the tendency that migration cannot converge, it will switch to another migration method to synchronize remaining dirty pages. During the whole process, AdaMig also dynamically tunes migration parameters according to current resources available in the physical host and migration efficiency. Experimental results reflect that AdaMig improves the success ratio from 26.7% to 93.3% over various workloads, and migration time is reduced by up to 45.5% in comparison with the original solution in QEMU.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133380803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kele Huang, Fuxing Zhang, Cun Li, Gen Niu, Junrong Wu, Tianyi Liu
{"title":"BTMMU: an efficient and versatile cross-ISA memory virtualization","authors":"Kele Huang, Fuxing Zhang, Cun Li, Gen Niu, Junrong Wu, Tianyi Liu","doi":"10.1145/3453933.3454015","DOIUrl":"https://doi.org/10.1145/3453933.3454015","url":null,"abstract":"Full system dynamic binary translation (DBT) has many important applications, but it is typically much slower than the native host. One major overhead in full system DBT comes from cross-ISA memory virtualization, where multi-level memory address translation is needed to map guest virtual address into host physical address. Like the SoftMMU used in the popular open-source emulator QEMU, software-based memory virtualization solutions are not efficient. Meanwhile, mature techniques for same-ISA virtualization such as shadow page table or second level address translation are not directly applicable due to cross-ISA difficulties. Some previous studies achieved significant speedup by utilizing existing hardware (TLB or virtualization hardware) of the host. However, since the hardware is not designed with cross-ISA in mind, those solutions had some limitations that were hard to overcome. Most of them only supported guests with smaller virtual address space than the host. Some supported only guests with the same page size. And some did not support privileged memory accesses. This paper proposes a new solution named BTMMU (Binary Translation Memory Management Unit). BTMMU composes of a low-cost hardware extension of host MMU, a kernel module and a patched QEMU version. BTMMU is able to solve most known limitations of previous hardware-assisted solutions and thus versatile enough for real deployments. Meanwhile, BTMMU achieves high efficiency by directly accessing guest address space, implementing shadow page table in kernel module, utilizing dedicated entrance for guest-related MMU exceptions and various software optimizations. Evaluations on SPEC CINT2006 benchmark suite and some real-world applications show that BTMMU achieves 1.40x and 1.36x speedup on IA32-to-MIPS64 and X86_64-to-MIPS64 configurations respectively when comparing with the base QEMU version. The result is compared to a representative previous work and shows its advantage.","PeriodicalId":322034,"journal":{"name":"Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133070726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}