通过FreeFault平衡可靠性、成本和性能

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2015-03-09 DOI:10.1109/HPCA.2015.7056053

Dong-Wan Kim, M. Erez

{"title":"通过FreeFault平衡可靠性、成本和性能","authors":"Dong-Wan Kim, M. Erez","doi":"10.1109/HPCA.2015.7056053","DOIUrl":null,"url":null,"abstract":"Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"42 1","pages":"439-450"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Balancing reliability, cost, and performance tradeoffs with FreeFault\",\"authors\":\"Dong-Wan Kim, M. Erez\",\"doi\":\"10.1109/HPCA.2015.7056053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability.\",\"PeriodicalId\":6593,\"journal\":{\"name\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"42 1\",\"pages\":\"439-450\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2015.7056053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2015.7056053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

内存错误一直是系统故障的主要来源，并且随着内存的不断扩展，故障率可能会进一步上升。这种不断增加的故障率，特别是与集成封装内存储器的出现相结合时，可能会超出传统容错机制的能力，或者显著增加它们的开销。在本文中，我们将FreeFault作为一种完全在处理器内实现的纯硬件、透明和几乎免费的弹性机制，并且可以容忍大多数DRAM故障。FreeFault将最后一级缓存的部分重新用于存储退役的内存区域，并增加了一个硬件内存清除器来监视内存运行状况并帮助做出退役决策。因为FreeFault依赖于现有的结构(缓存关联性)来进行退役/重新映射类型修复，所以FreeFault基本上没有硬件开销。因为FreeFault只需要很小一部分缓存(小至8KB)来覆盖大部分DRAM故障，所以FreeFault对性能几乎没有影响。我们解释了FreeFault如何通过延迟甚至完全避免调用软件在内存容量、性能和可靠性之间做出权衡决定，在高可靠性和高可用性系统的整体弹性方案中添加一个有吸引力的层。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Balancing reliability, cost, and performance tradeoffs with FreeFault

Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量