Limits of region-based dynamic binary parallelization

International Conference on Virtual Execution Environments Pub Date : 2013-03-16 DOI:10.1145/2451512.2451518

T. Koch, Björn Franke

{"title":"Limits of region-based dynamic binary parallelization","authors":"T. Koch, Björn Franke","doi":"10.1145/2451512.2451518","DOIUrl":null,"url":null,"abstract":"Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":" 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Virtual Execution Environments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2451512.2451518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.

查看原文本刊更多论文

基于区域的动态二进制并行化的限制

在由许多小内核组成的芯片多处理器(cmp)上有效地执行顺序遗留二进制文件是当今最紧迫的问题之一。由于cmp的单核性能较低，单线程执行是一个次优选择，而多线程执行依赖于先前的并行化，这受到针对单核目标编译和优化的应用程序的低级二进制表示的严重阻碍。解决这个问题的最新技术是动态二进制并行化(Dynamic Binary Parallelization, DBP)，它创建一个虚拟执行环境(Virtual Execution Environment, VEE)，利用底层多核主机透明地并行化顺序二进制可执行文件。虽然DBP还处于起步阶段，但它已经在研究界引起了广泛的兴趣。DBP和线程级推测(TLS)的组合使用已被提议作为一种技术来加速现代cmp上的遗留单处理器代码。在本文中，我们研究了DBP的限制，并试图了解导致这些限制的因素以及实现DBP的成本和管理费用。我们使用可参数化DBP系统进行了广泛的评估，目标是具有轻量级体系结构TLS支持的CMP。我们证明，遗留SPEC CPU2006基准测试的关键路径上的指令数量有显著减少高达54%的空间。然而，我们表明，将这些节省转化为实际的性能改进要困难得多，实际的硬件支持实现平均实现1.09的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Virtual Execution Environments

自引率

0.00%

发文量