Precise Runahead Execution

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00040

Ajeya Naithani, Josué Feliu, Almutaz Adileh, L. Eeckhout

{"title":"Precise Runahead Execution","authors":"Ajeya Naithani, Josué Feliu, Almutaz Adileh, L. Eeckhout","doi":"10.1109/HPCA47549.2020.00040","DOIUrl":null,"url":null,"abstract":"Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests. Unfortunately, all prior runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when entering runahead mode and then need to refill the pipeline to restart normal operation. Moreover, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that leads to the same long-latency load. We propose precise runahead execution (PRE) which builds on the key observation that when entering runahead mode, the processor has enough issue queue and physical register file resources to speculatively execute instructions. This mitigates the need to release and re-fill processor state in the ROB, issue queue, and physical register file. In addition, PRE pre-executes only those instructions in runahead mode that lead to full-window stalls, using a novel register renaming mechanism to quickly free physical registers in runahead mode, further improving efficiency and effectiveness. Finally, PRE optionally buffers decoded runahead micro-ops in the frontend to save energy. Our experimental evaluation using a set of memory-intensive applications shows that PRE achieves an additional 18.2% performance improvement over the recent runahead proposals while at the same time reducing energy consumption by 6.8%.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests. Unfortunately, all prior runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when entering runahead mode and then need to refill the pipeline to restart normal operation. Moreover, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that leads to the same long-latency load. We propose precise runahead execution (PRE) which builds on the key observation that when entering runahead mode, the processor has enough issue queue and physical register file resources to speculatively execute instructions. This mitigates the need to release and re-fill processor state in the ROB, issue queue, and physical register file. In addition, PRE pre-executes only those instructions in runahead mode that lead to full-window stalls, using a novel register renaming mechanism to quickly free physical registers in runahead mode, further improving efficiency and effectiveness. Finally, PRE optionally buffers decoded runahead micro-ops in the frontend to save energy. Our experimental evaluation using a set of memory-intensive applications shows that PRE achieves an additional 18.2% performance improvement over the recent runahead proposals while at the same time reducing energy consumption by 6.8%.

查看原文本刊更多论文

精确的提前执行

运行前执行通过准确地预取长延迟内存访问来提高处理器性能。当长延迟负载导致指令窗口被填满并停止管道时，处理器进入提前运行模式，并继续推测执行代码以触发准确的预取。最近的改进跟踪导致长延迟负载的指令链，将其存储在运行前缓冲区中，并在运行前执行期间仅执行该链，目的是生成更多的预取请求。不幸的是，所有之前的运行提前方案都有缺点，限制了性能和能源效率，因为它们在进入运行提前模式时释放处理器状态，然后需要重新填充管道以重新启动正常操作。此外，运行前缓冲区通过只跟踪导致相同长延迟负载的单个指令链来限制预取覆盖范围。我们提出精确的运行前执行(PRE)，它建立在关键观察的基础上，当进入运行前模式时，处理器有足够的问题队列和物理寄存器文件资源来推测执行指令。这减少了在ROB、问题队列和物理寄存器文件中释放和重新填充处理器状态的需要。此外，PRE只预执行那些在运行前模式下导致全窗口停顿的指令，使用一种新颖的寄存器重命名机制来快速释放运行前模式下的物理寄存器，进一步提高了效率和有效性。最后，PRE可选地在前端缓冲解码的超前微操作以节省能量。我们使用一组内存密集型应用程序进行的实验评估表明，PRE比最近的预跑方案实现了18.2%的性能提升，同时降低了6.8%的能耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量