嵌入式教程ET1:优于最坏情况的时间设计

2015 28th International Conference on VLSI Design Pub Date : 2015-02-05 DOI:10.1109/VLSID.2015.118

A. Singh

{"title":"嵌入式教程ET1:优于最坏情况的时间设计","authors":"A. Singh","doi":"10.1109/VLSID.2015.118","DOIUrl":null,"url":null,"abstract":"Achieving high performance within stringent power budgets is emerging to be one of the most difficult challenges in the design of current generation digital systems. In synchronous systems, switching signals are typically allowed a fixed amount of time to settle within each clock cycle, with the clock period appropriately selected to accommodate the worst-case switching delay. Some additional timing margin, typically 10-20% of the clock period, is allowed beyond the nominal critical path delays to accommodate timing uncertainties introduced by process, voltage and temperature (PVT) variations; these appear to be increasing significantly in highly scaled CMOS technologies. Unfortunately, despite the lack of switching activity, the circuit continues to consume significant static power during these timing margins, which consequently result in unwanted loss of both power and performance. Furthermore, since worst case signal paths in CMOS are highly input dependent and generally not activated in every clock cycle, this wasteful window of circuit inactivity in a typical cycle is often longer than just the timing margin. This is particularly true for circuits with a wide distribution of path delays, where the few long paths are infrequently activated; the computation completes with signals stabilizing quite early in most clock cycles. Clearly, significantly higher computational throughput and power efficiency could be achieved if the resulting window of circuit inactivity during the remainder of the clock cycles could be eliminated or even minimized. Asynchronous and data flow designs and architectures have long tried to exploit this statistical variability in delays in circuit functional blocks by building in a capability for signaling the completion of each operation. This can potentially allow execution to proceed as soon as a functional result is available, instead of waiting out the worst case delay for each functional block. An early and classic example is carry completion signaling in ripple carry adders which provides an indication as soon as the carry signals have stabilized and the result is valid, following application of each new set of inputs. Unfortunately, the efficient design of fully asynchronous and data flow systems has proved extremely challenging. Consequently, elements of asynchronous operation have sometimes been incorporated into traditional clock based designs using some form of a handshaking control protocol. Typically such designs dynamically allow functional units a varying number of system clock periods to complete their operation, thereby avoiding worst case delays in every instance. The mechanisms employed to ensure that a functional block gets sufficient time to correctly complete its operation broadly take three forms. (1) Completion signaling, where the function is designed with redundant outputs (or output coding) which directly indicates when the result is valid. (2) Input based timing prediction, where (a subset of) the inputs are decoded to quickly determine if for those inputs the circuit will need one, two or more cycles. And (3) error detection based recovery, where error detection circuits check the results at the end of every clock cycle and initiate a recovery, requiring additional cycles, in case of an error caused by the aggressive clock timing. In this presentation we discuss a number of better-than-worst-case design approaches that have been proposed in the literature. We not only focus on the various low cost error detection and recovery techniques that have been proposed, but also address other major challenges in implementing such designs. Key among them is addressing potential flip-flop meta-stability which can occur if the flip-flop inputs are allowed to arrive at arbitrary times relative to the clock signal, as is the case in such overclocked designs. Another challenge associated with the commonly used flip-flop duplication based timing error detection approach are false error indications from the activation of short paths. Mitigation of this problem using path buffering and hold latches is discussed. Recent results from experimental circuits prototyped by ARM and Intel are also presented. Finally, we discuss promising new research, including efficient application of the better-than-worst-case design concept to arithmetic circuits.","PeriodicalId":123635,"journal":{"name":"2015 28th International Conference on VLSI Design","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Embedded Tutorial ET1: Better-than-Worst-Case Timing Designs\",\"authors\":\"A. Singh\",\"doi\":\"10.1109/VLSID.2015.118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Achieving high performance within stringent power budgets is emerging to be one of the most difficult challenges in the design of current generation digital systems. In synchronous systems, switching signals are typically allowed a fixed amount of time to settle within each clock cycle, with the clock period appropriately selected to accommodate the worst-case switching delay. Some additional timing margin, typically 10-20% of the clock period, is allowed beyond the nominal critical path delays to accommodate timing uncertainties introduced by process, voltage and temperature (PVT) variations; these appear to be increasing significantly in highly scaled CMOS technologies. Unfortunately, despite the lack of switching activity, the circuit continues to consume significant static power during these timing margins, which consequently result in unwanted loss of both power and performance. Furthermore, since worst case signal paths in CMOS are highly input dependent and generally not activated in every clock cycle, this wasteful window of circuit inactivity in a typical cycle is often longer than just the timing margin. This is particularly true for circuits with a wide distribution of path delays, where the few long paths are infrequently activated; the computation completes with signals stabilizing quite early in most clock cycles. Clearly, significantly higher computational throughput and power efficiency could be achieved if the resulting window of circuit inactivity during the remainder of the clock cycles could be eliminated or even minimized. Asynchronous and data flow designs and architectures have long tried to exploit this statistical variability in delays in circuit functional blocks by building in a capability for signaling the completion of each operation. This can potentially allow execution to proceed as soon as a functional result is available, instead of waiting out the worst case delay for each functional block. An early and classic example is carry completion signaling in ripple carry adders which provides an indication as soon as the carry signals have stabilized and the result is valid, following application of each new set of inputs. Unfortunately, the efficient design of fully asynchronous and data flow systems has proved extremely challenging. Consequently, elements of asynchronous operation have sometimes been incorporated into traditional clock based designs using some form of a handshaking control protocol. Typically such designs dynamically allow functional units a varying number of system clock periods to complete their operation, thereby avoiding worst case delays in every instance. The mechanisms employed to ensure that a functional block gets sufficient time to correctly complete its operation broadly take three forms. (1) Completion signaling, where the function is designed with redundant outputs (or output coding) which directly indicates when the result is valid. (2) Input based timing prediction, where (a subset of) the inputs are decoded to quickly determine if for those inputs the circuit will need one, two or more cycles. And (3) error detection based recovery, where error detection circuits check the results at the end of every clock cycle and initiate a recovery, requiring additional cycles, in case of an error caused by the aggressive clock timing. In this presentation we discuss a number of better-than-worst-case design approaches that have been proposed in the literature. We not only focus on the various low cost error detection and recovery techniques that have been proposed, but also address other major challenges in implementing such designs. Key among them is addressing potential flip-flop meta-stability which can occur if the flip-flop inputs are allowed to arrive at arbitrary times relative to the clock signal, as is the case in such overclocked designs. Another challenge associated with the commonly used flip-flop duplication based timing error detection approach are false error indications from the activation of short paths. Mitigation of this problem using path buffering and hold latches is discussed. Recent results from experimental circuits prototyped by ARM and Intel are also presented. Finally, we discuss promising new research, including efficient application of the better-than-worst-case design concept to arithmetic circuits.\",\"PeriodicalId\":123635,\"journal\":{\"name\":\"2015 28th International Conference on VLSI Design\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 28th International Conference on VLSI Design\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/VLSID.2015.118\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 28th International Conference on VLSI Design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VLSID.2015.118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在严格的功率预算内实现高性能是当前一代数字系统设计中最困难的挑战之一。在同步系统中，通常允许开关信号在每个时钟周期内有固定的时间沉淀，并适当选择时钟周期以适应最坏情况下的开关延迟。为了适应工艺、电压和温度(PVT)变化带来的时序不确定性，在标称关键路径延迟之外，允许一些额外的时序裕度，通常为时钟周期的10-20%;这些似乎在高规模的CMOS技术中显着增加。不幸的是，尽管缺乏开关活动，电路在这些时间裕度期间继续消耗显著的静态功率，从而导致不必要的功率和性能损失。此外，由于CMOS中的最坏情况信号路径高度依赖于输入，并且通常不会在每个时钟周期中激活，因此在典型周期中这种浪费的电路不活动窗口通常比定时余量长。对于具有广泛路径延迟分布的电路尤其如此，其中少数长路径很少被激活;在大多数时钟周期中，计算完成时信号稳定得相当早。显然，如果可以消除甚至最小化时钟周期剩余时间内的电路不活动窗口，则可以实现更高的计算吞吐量和功率效率。长期以来，异步和数据流设计和体系结构一直试图通过构建每个操作完成的信号能力来利用电路功能块延迟的统计可变性。这可能允许在功能结果可用时立即执行，而不是等待每个功能块的最坏情况延迟。早期和经典的例子是纹波进位加法器中的进位完成信号，一旦进位信号稳定并且结果有效，在每个新输入集的应用之后，它就会提供指示。不幸的是，完全异步和数据流系统的高效设计已被证明极具挑战性。因此，异步操作的元素有时被合并到传统的基于时钟的设计中，使用某种形式的握手控制协议。通常，这样的设计动态地允许功能单元有不同数量的系统时钟周期来完成它们的操作，从而避免在每个实例中出现最坏情况的延迟。用于确保功能块有足够时间正确完成其操作的机制大致有三种形式。(1)完成信令，该函数设计了冗余输出(或输出编码)，直接指示结果何时有效。(2)基于输入的时序预测，其中(一个子集)输入被解码，以快速确定这些输入是否需要一个，两个或更多的周期。(3)基于错误检测的恢复，其中错误检测电路在每个时钟周期结束时检查结果并启动恢复，如果由侵略性时钟定时引起的错误，则需要额外的周期。在这次演讲中，我们讨论了一些在文献中提出的比最坏情况更好的设计方法。我们不仅关注已经提出的各种低成本错误检测和恢复技术，而且还解决了实现此类设计的其他主要挑战。其中的关键是解决潜在的触发器元稳定性，如果触发器输入被允许在相对于时钟信号的任意时间到达，就像在这种超频设计中的情况一样。与常用的基于触发器复制的定时错误检测方法相关的另一个挑战是来自短路径激活的错误错误指示。本文讨论了使用路径缓冲和保持锁存器来缓解这个问题。本文还介绍了由ARM和Intel制作的实验电路的最新结果。最后，我们讨论了有前途的新研究，包括比最坏情况好设计概念在算术电路中的有效应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Embedded Tutorial ET1: Better-than-Worst-Case Timing Designs

Achieving high performance within stringent power budgets is emerging to be one of the most difficult challenges in the design of current generation digital systems. In synchronous systems, switching signals are typically allowed a fixed amount of time to settle within each clock cycle, with the clock period appropriately selected to accommodate the worst-case switching delay. Some additional timing margin, typically 10-20% of the clock period, is allowed beyond the nominal critical path delays to accommodate timing uncertainties introduced by process, voltage and temperature (PVT) variations; these appear to be increasing significantly in highly scaled CMOS technologies. Unfortunately, despite the lack of switching activity, the circuit continues to consume significant static power during these timing margins, which consequently result in unwanted loss of both power and performance. Furthermore, since worst case signal paths in CMOS are highly input dependent and generally not activated in every clock cycle, this wasteful window of circuit inactivity in a typical cycle is often longer than just the timing margin. This is particularly true for circuits with a wide distribution of path delays, where the few long paths are infrequently activated; the computation completes with signals stabilizing quite early in most clock cycles. Clearly, significantly higher computational throughput and power efficiency could be achieved if the resulting window of circuit inactivity during the remainder of the clock cycles could be eliminated or even minimized. Asynchronous and data flow designs and architectures have long tried to exploit this statistical variability in delays in circuit functional blocks by building in a capability for signaling the completion of each operation. This can potentially allow execution to proceed as soon as a functional result is available, instead of waiting out the worst case delay for each functional block. An early and classic example is carry completion signaling in ripple carry adders which provides an indication as soon as the carry signals have stabilized and the result is valid, following application of each new set of inputs. Unfortunately, the efficient design of fully asynchronous and data flow systems has proved extremely challenging. Consequently, elements of asynchronous operation have sometimes been incorporated into traditional clock based designs using some form of a handshaking control protocol. Typically such designs dynamically allow functional units a varying number of system clock periods to complete their operation, thereby avoiding worst case delays in every instance. The mechanisms employed to ensure that a functional block gets sufficient time to correctly complete its operation broadly take three forms. (1) Completion signaling, where the function is designed with redundant outputs (or output coding) which directly indicates when the result is valid. (2) Input based timing prediction, where (a subset of) the inputs are decoded to quickly determine if for those inputs the circuit will need one, two or more cycles. And (3) error detection based recovery, where error detection circuits check the results at the end of every clock cycle and initiate a recovery, requiring additional cycles, in case of an error caused by the aggressive clock timing. In this presentation we discuss a number of better-than-worst-case design approaches that have been proposed in the literature. We not only focus on the various low cost error detection and recovery techniques that have been proposed, but also address other major challenges in implementing such designs. Key among them is addressing potential flip-flop meta-stability which can occur if the flip-flop inputs are allowed to arrive at arbitrary times relative to the clock signal, as is the case in such overclocked designs. Another challenge associated with the commonly used flip-flop duplication based timing error detection approach are false error indications from the activation of short paths. Mitigation of this problem using path buffering and hold latches is discussed. Recent results from experimental circuits prototyped by ARM and Intel are also presented. Finally, we discuss promising new research, including efficient application of the better-than-worst-case design concept to arithmetic circuits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 28th International Conference on VLSI Design

自引率

0.00%

发文量