Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, P. Yew
{"title":"Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation","authors":"Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, P. Yew","doi":"10.1109/CGO51591.2021.9370312","DOIUrl":null,"url":null,"abstract":"Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load - Link / Store - Conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO51591.2021.9370312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load - Link / Store - Conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme.
动态二进制转换(DBT)是跨isa仿真、系统虚拟化、运行时检测和许多其他重要应用程序的关键支持因素。在DBT的几个关键需求中,重要的是为原子同步指令(如Load - Link / Store - Conditional (LL/SC))提供等效语义,这些指令主要包含在精简指令集体系结构(RISC)和比较与交换(CAS)中,它们主要存在于复杂指令集体系结构(CISC)中。然而,由于性能问题,最先进的DBT工具通常不能提供这些原子指令的完全正确的转换,特别是从RISC原子指令(即LL/SC)到CISC原子指令(即CAS)。因此,有些可能会导致众所周知的ABA问题,这可能导致错误的结果或程序崩溃。在我们对QEMU的实验研究中,QEMU是一种最先进的DBT,在Intel x86平台(即使用CAS)上运行使用ARM指令集(即使用LL/SC)实现的多线程无锁堆栈操作,它经常在2秒内崩溃。尽管已经尝试为这些原子指令提供正确的模拟,但它们要么导致沉重的执行开销,要么需要额外的硬件支持。在本文中,我们提出了几个方案来解决这些问题,并在QEMU上实现它们,以评估它们的性能开销。结果表明,所提方案均能提供正确的仿真结果,且与现有的最佳软件方案相比,最优方案的最小、最大和几何速度分别提高1.25倍、3.21倍和2.03倍。