不算数的核

Proceedings of the Workshop on Hot Topics in Operating Systems Pub Date : 2021-06-01 DOI:10.1145/3458336.3465297

P. Hochschild, Paul Turner, J. Mogul, R. Govindaraju, Parthasarathy Ranganathan, D. Culler, A. Vahdat

{"title":"不算数的核","authors":"P. Hochschild, Paul Turner, J. Mogul, R. Govindaraju, Parthasarathy Ranganathan, D. Culler, A. Vahdat","doi":"10.1145/3458336.3465297","DOIUrl":null,"url":null,"abstract":"We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often \"silent\" - the only symptom is an erroneous computation. We refer to a core that develops such behavior as \"mercurial.\" Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.","PeriodicalId":224944,"journal":{"name":"Proceedings of the Workshop on Hot Topics in Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":"{\"title\":\"Cores that don't count\",\"authors\":\"P. Hochschild, Paul Turner, J. Mogul, R. Govindaraju, Parthasarathy Ranganathan, D. Culler, A. Vahdat\",\"doi\":\"10.1145/3458336.3465297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often \\\"silent\\\" - the only symptom is an erroneous computation. We refer to a core that develops such behavior as \\\"mercurial.\\\" Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.\",\"PeriodicalId\":224944,\"journal\":{\"name\":\"Proceedings of the Workshop on Hot Topics in Operating Systems\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Hot Topics in Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3458336.3465297\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Hot Topics in Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3458336.3465297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

摘要

我们习惯于认为计算机是故障停止的，尤其是执行指令的核心，大多数系统软件都隐含地依赖于这一假设。在VLSI时代的大部分时间里，通过制造测试并在规范内运行的处理器使我们免受这种虚构的影响。随着制造向更小的特征尺寸和更复杂的计算结构推进，以及越来越多的专门的指令硅配对被引入以提高性能，我们已经观察到在制造测试中未检测到的短暂计算错误。这些缺陷不能总是通过诸如微码更新之类的技术来减轻，并且可能与处理器内的特定组件相关，允许小的代码更改影响可靠性的大变化。更糟糕的是，这些故障通常是“无声的”——唯一的症状是错误的计算。我们把发展出这种行为的核心称为“水银”。水银内核非常罕见，但在大型服务器中，我们可以观察到它们造成的破坏，通常足以将它们视为一个明显的问题——需要硬件设计师、处理器供应商和系统软件架构师之间的协作。本文为系统研究开辟了一个新的研究热点;我们推测了几种基于软件的汞核方法，从更好的检测和隔离机制，到容忍它们引起的静默数据损坏的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cores that don't count

We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent" - the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial." Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Hot Topics in Operating Systems

自引率

0.00%

发文量