Late-binding: enabling unordered load-store queues

Proceedings. International Symposium on Computer Architecture Pub Date : 2007-06-09 DOI:10.1145/1250662.1250705

S. Sethumadhavan, Franziska Roesner, J. Emer, D. Burger, S. Keckler

{"title":"Late-binding: enabling unordered load-store queues","authors":"S. Sethumadhavan, Franziska Roesner, J. Emer, D. Burger, S. Keckler","doi":"10.1145/1250662.1250705","DOIUrl":null,"url":null,"abstract":"Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue (\"late binding\"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"33 1","pages":"347-357"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1250662.1250705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.

查看原文本刊更多论文

后绑定:启用无序负载存储队列

传统的负载/存储队列(LSQs)不利于在超标量处理器中高效执行，也不利于扩展到大窗口设计。在本文中，我们提出了一些技术，通过在发出指令时(“延迟绑定”)而不是在分派指令时分配条目来提高LSQs的面积和功率效率。这种方法可以降低占用率，从而减少lsql。然而，延迟绑定LSQ的有效实现要求LSQ中的条目相对于年龄是无序的。在本文中，我们展示了如何在无序设计中提供完整的LSQ功能，仅增加很小的复杂性和可忽略不计的性能损失。我们表明，延迟绑定的无序lsql可以很好地用于小窗口的超标量处理器，但也可以通过将lsql分解为地址交错的组来有效地扩展到大的千窗口处理器。为了处理不断增加的溢出，我们将经典的网络流量控制技术应用于处理器微网络，实现了低开销的银行溢出恢复机制。我们评估了三种这样的机制:指令重播，滑缓冲区，在片上存储器网络中的虚拟通道缓冲。我们展示了对于一个80条指令的窗口，LSQ可以减少到32个条目。对于1024条指令的窗口，无序的、延迟绑定的LSQ可以很好地处理四组(每组48个条目)。通过应用Bloom滤波器，该设计实现了1024条指令窗口的完全硬件内存消歧，同时需要低平均每负载功率和8和12个CAM条目的存储访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. International Symposium on Computer Architecture

自引率

0.00%

发文量