Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI:10.1109/HPCA.1999.744351

Ye Zhang, Lawrence Rauchwerger, J. Torrellas

{"title":"Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors","authors":"Ye Zhang, Lawrence Rauchwerger, J. Torrellas","doi":"10.1109/HPCA.1999.744351","DOIUrl":null,"url":null,"abstract":"Recently, we introduced a novel framework for speculative parallelization in hardware (Y. Zhang et al., 1998). The scheme is based on a software based run time parallelization scheme that we proposed earlier (L. Rauchwerger and D. Padue, 1995). The idea is to execute the code (loops) speculatively in parallel. As parallel execution proceeds, extra hardware added to the directory based cache coherence of the DSM machine detects if there is a dependence violation. If such a violation occurs, execution is interrupted, the state is rolled back in software to the most recent safe state, and the code is re-executed serially from that point. The safe state is typically established at the beginning of the loop. Such a scheme is somewhat related to speculative parallelization inside a multiprocessor chip, which also relies on extending the cache coherence protocol to detect dependence violations. Our scheme, however, is targeted to large scale DSM parallelism. In addition, it does not have some of the limitations of the proposed chip-multiprocessor schemes. Such limitations include the need to bound the size of the speculative state to fit in a buffer or L1 cache, and a strict in-order task commit policy that may result in load imbalance among processors. Unfortunately, our scheme has higher recovery costs if a dependence violation is detected, because execution has to backtrack to a safe state that is usually the beginning of the loop. Therefore, the aim of the paper is to extend our previous hardware scheme to effectively handle codes (loops) with a modest number of cross-iteration dependences.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.1999.744351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

Recently, we introduced a novel framework for speculative parallelization in hardware (Y. Zhang et al., 1998). The scheme is based on a software based run time parallelization scheme that we proposed earlier (L. Rauchwerger and D. Padue, 1995). The idea is to execute the code (loops) speculatively in parallel. As parallel execution proceeds, extra hardware added to the directory based cache coherence of the DSM machine detects if there is a dependence violation. If such a violation occurs, execution is interrupted, the state is rolled back in software to the most recent safe state, and the code is re-executed serially from that point. The safe state is typically established at the beginning of the loop. Such a scheme is somewhat related to speculative parallelization inside a multiprocessor chip, which also relies on extending the cache coherence protocol to detect dependence violations. Our scheme, however, is targeted to large scale DSM parallelism. In addition, it does not have some of the limitations of the proposed chip-multiprocessor schemes. Such limitations include the need to bound the size of the speculative state to fit in a buffer or L1 cache, and a strict in-order task commit policy that may result in load imbalance among processors. Unfortunately, our scheme has higher recovery costs if a dependence violation is detected, because execution has to backtrack to a safe state that is usually the beginning of the loop. Therefore, the aim of the paper is to extend our previous hardware scheme to effectively handle codes (loops) with a modest number of cross-iteration dependences.

查看原文本刊更多论文

DSM多处理器中部分并行循环推测并行化的硬件

最近，我们引入了一种新的硬件推测并行框架(Y. Zhang et al.， 1998)。该方案基于我们之前提出的基于软件的运行时并行化方案(L. Rauchwerger和D. Padue, 1995)。其思想是推测性地并行执行代码(循环)。随着并行执行的进行，添加到目录的额外硬件将基于DSM机器的缓存一致性检测是否存在依赖冲突。如果发生这样的冲突，执行将被中断，状态将在软件中回滚到最近的安全状态，并从该点开始连续重新执行代码。安全状态通常在循环开始时建立。这种方案在某种程度上与多处理器芯片内部的推测并行化有关，它也依赖于扩展缓存一致性协议来检测依赖违反。然而，我们的方案是针对大规模的DSM并行性。此外，它也不存在现有芯片多处理器方案的一些局限性。这些限制包括需要将推测状态的大小绑定到缓冲区或L1缓存中，以及严格的按顺序任务提交策略，这可能导致处理器之间的负载不平衡。不幸的是，如果检测到依赖冲突，我们的方案有更高的恢复成本，因为执行必须回溯到通常是循环开始的安全状态。因此，本文的目的是扩展我们以前的硬件方案，以有效地处理具有适度数量的交叉迭代依赖的代码(循环)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings Fifth International Symposium on High-Performance Computer Architecture

自引率

0.00%

发文量