LiteTM: Reducing transactional state overhead

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI:10.1109/HPCA.2010.5416653

Syed Ali, Raza Jafri, Mithuna Thottethodi, T. N. Vijaykumar

{"title":"LiteTM: Reducing transactional state overhead","authors":"Syed Ali, Raza Jafri, Mithuna Thottethodi, T. N. Vijaykumar","doi":"10.1109/HPCA.2010.5416653","DOIUrl":null,"url":null,"abstract":"Transactional memory (TM) has been proposed to address some of the programmability issues of chip multiprocessors. Hardware implementations of transactional memory (HTMs) have made significant progress in providing support for features such as long transactions that spill out of the cache, and context switches, page and thread migration in the middle of transactions. While essential for the adoption of HTMs in real products, supporting these features has resulted in significant state overhead. For instance, TokenTM adds at least 16 bits per block in the caches which is significant in absolute terms, and steals 16 of 64 (25%) memory ECC bits per block, weakening error protection. Also, the state bits nearly double the tag array size. These significant and practical concerns may impede the adoption of HTMs, squandering the progress achieved by HTMs. The overhead comes from tracking the thread identifier and the transactional read-sharer count at the L1-block granularity. The thread identifier is used to identify the transaction, if only one, to which an L1-evicted block belongs. The read-sharer count is used to identify conflicts involving multiple readers (i.e., write to a block with non-zero count). To reduce this overhead, we observe that the thread identifiers and read-sharer counts are not needed in a majority of cases. (1) Repeated misses to the same blocks are rare within a transaction (i.e., locality holds). (2) Transactional read-shared blocks that both are evicted from multiple sharers' L1s and are involved in conflicts are rare. Exploiting these observations, we propose a novel HTM, called LiteTM, which completely eliminates the count and identifier and uses software to infer the lost information. Using simulations of the STAMP benchmarks running on 8 cores, we show that LiteTM reduces TokenTM's state overhead by about 87% while performing within 4%, on average, and 10%, in the worst case, of To ke nTM.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2010.5416653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Transactional memory (TM) has been proposed to address some of the programmability issues of chip multiprocessors. Hardware implementations of transactional memory (HTMs) have made significant progress in providing support for features such as long transactions that spill out of the cache, and context switches, page and thread migration in the middle of transactions. While essential for the adoption of HTMs in real products, supporting these features has resulted in significant state overhead. For instance, TokenTM adds at least 16 bits per block in the caches which is significant in absolute terms, and steals 16 of 64 (25%) memory ECC bits per block, weakening error protection. Also, the state bits nearly double the tag array size. These significant and practical concerns may impede the adoption of HTMs, squandering the progress achieved by HTMs. The overhead comes from tracking the thread identifier and the transactional read-sharer count at the L1-block granularity. The thread identifier is used to identify the transaction, if only one, to which an L1-evicted block belongs. The read-sharer count is used to identify conflicts involving multiple readers (i.e., write to a block with non-zero count). To reduce this overhead, we observe that the thread identifiers and read-sharer counts are not needed in a majority of cases. (1) Repeated misses to the same blocks are rare within a transaction (i.e., locality holds). (2) Transactional read-shared blocks that both are evicted from multiple sharers' L1s and are involved in conflicts are rare. Exploiting these observations, we propose a novel HTM, called LiteTM, which completely eliminates the count and identifier and uses software to infer the lost information. Using simulations of the STAMP benchmarks running on 8 cores, we show that LiteTM reduces TokenTM's state overhead by about 87% while performing within 4%, on average, and 10%, in the worst case, of To ke nTM.

查看原文本刊更多论文

LiteTM:减少事务性状态开销

事务性内存(TM)是为了解决芯片多处理器的一些可编程性问题而提出的。事务性内存(html)的硬件实现在支持从缓存溢出的长事务、事务中间的上下文切换、页面和线程迁移等特性方面取得了重大进展。虽然在实际产品中采用html是必不可少的，但支持这些特性会带来巨大的状态开销。例如，TokenTM在缓存中每个块至少添加16位，这在绝对意义上是重要的，并且每个块窃取64位(25%)内存ECC位中的16位，从而削弱了错误保护。此外，状态位几乎是标记数组大小的两倍。这些重大而实际的问题可能会阻碍html的采用，浪费html所取得的进展。开销来自于以l1块粒度跟踪线程标识符和事务性读共享器计数。线程标识符用于标识被l1驱逐块所属的事务(如果只有一个)。读共享器计数用于识别涉及多个读取器的冲突(例如，写入计数不为零的块)。为了减少这种开销，我们注意到在大多数情况下不需要线程标识符和读共享器计数。(1)在一个事务中，对相同块的重复丢失是罕见的(即，局部性持有)。(2)事务性读共享块从多个共享者的l15中被驱逐并且涉及冲突的情况很少。利用这些观察结果，我们提出了一种新的HTM，称为LiteTM，它完全消除了计数和标识符，并使用软件来推断丢失的信息。通过在8核上运行STAMP基准测试的模拟，我们发现LiteTM将tokenm的状态开销减少了大约87%，而在平均情况下，执行在tokenm的4%以内，在最坏的情况下，执行在tokenm的10%以内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture

自引率

0.00%

发文量