在测试延迟约束下的多核系统中基于软件的自测试期间系统可用性的探索

2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) Pub Date : 2014-11-24 DOI:10.1109/DFT.2014.6962088

M. Skitsas, C. Nicopoulos, M. Michael

{"title":"在测试延迟约束下的多核系统中基于软件的自测试期间系统可用性的探索","authors":"M. Skitsas, C. Nicopoulos, M. Michael","doi":"10.1109/DFT.2014.6962088","DOIUrl":null,"url":null,"abstract":"As technology scales, the increased vulnerability of modern systems due to unreliable components becomes a major problem in the era of multi-/many-core architectures. Recently, several on-line testing techniques have been proposed, aiming towards error detection of wear-out/aging-related defects that can appear during the lifetime of a system. In this work, we investigate the relation between system test latency and testtime overhead in multi-/many-core systems with shared LastLevel Cache (LLC) for periodic Software-Based Self-Testing (SBST), under different test scheduling policies. The investigated scheduling policies primarily vary the number of cores concurrently under test in the overall system testing session. Our extensive, workload-driven dynamic exploration reveals that there is an inverse relation between the two test measures; as the number of cores concurrently under test increases, system test latency decreases, but at the cost of significantly increased test time, which sacrifices system availability for running normal workloads. Under given system test latency constraints, which should be utilized in order to be able to control system recovery time in the event of an error detection, our exploration framework identifies the scheduling policy under which overall test time overhead is minimized and, hence, system availability is maximized. Without any loss of generality, a 16-core system is explored in a full-system, execution-driven simulation framework running multi-threaded PARSEC workloads [1].","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Exploration of system availability during software-based self-testing in many-core systems under test latency constraints\",\"authors\":\"M. Skitsas, C. Nicopoulos, M. Michael\",\"doi\":\"10.1109/DFT.2014.6962088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As technology scales, the increased vulnerability of modern systems due to unreliable components becomes a major problem in the era of multi-/many-core architectures. Recently, several on-line testing techniques have been proposed, aiming towards error detection of wear-out/aging-related defects that can appear during the lifetime of a system. In this work, we investigate the relation between system test latency and testtime overhead in multi-/many-core systems with shared LastLevel Cache (LLC) for periodic Software-Based Self-Testing (SBST), under different test scheduling policies. The investigated scheduling policies primarily vary the number of cores concurrently under test in the overall system testing session. Our extensive, workload-driven dynamic exploration reveals that there is an inverse relation between the two test measures; as the number of cores concurrently under test increases, system test latency decreases, but at the cost of significantly increased test time, which sacrifices system availability for running normal workloads. Under given system test latency constraints, which should be utilized in order to be able to control system recovery time in the event of an error detection, our exploration framework identifies the scheduling policy under which overall test time overhead is minimized and, hence, system availability is maximized. Without any loss of generality, a 16-core system is explored in a full-system, execution-driven simulation framework running multi-threaded PARSEC workloads [1].\",\"PeriodicalId\":414665,\"journal\":{\"name\":\"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DFT.2014.6962088\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT.2014.6962088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

随着技术的发展，由于不可靠的组件而增加的现代系统的脆弱性成为多核/多核架构时代的一个主要问题。最近，人们提出了几种在线测试技术，旨在对系统生命周期中可能出现的磨损/老化相关缺陷进行错误检测。在这项工作中，我们研究了在不同的测试调度策略下，在具有共享LastLevel Cache (LLC)的多核/多核系统中用于周期性基于软件的自测(SBST)的系统测试延迟和测试时间开销之间的关系。所研究的调度策略主要是改变整个系统测试会话中并发测试的核心数量。我们广泛的，工作负载驱动的动态探索揭示了两个测试度量之间的反比关系;随着测试中的并发核数的增加，系统测试延迟减少，但代价是显著增加了测试时间，从而牺牲了运行正常工作负载的系统可用性。在给定的系统测试延迟约束下(应该利用这些约束来控制发生错误检测时的系统恢复时间)，我们的探索框架确定了调度策略，在该策略下，总体测试时间开销最小，因此系统可用性最大化。在没有任何一般性损失的情况下，我们在一个运行多线程PARSEC工作负载的全系统、执行驱动的仿真框架中探索了一个16核系统[1]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploration of system availability during software-based self-testing in many-core systems under test latency constraints

As technology scales, the increased vulnerability of modern systems due to unreliable components becomes a major problem in the era of multi-/many-core architectures. Recently, several on-line testing techniques have been proposed, aiming towards error detection of wear-out/aging-related defects that can appear during the lifetime of a system. In this work, we investigate the relation between system test latency and testtime overhead in multi-/many-core systems with shared LastLevel Cache (LLC) for periodic Software-Based Self-Testing (SBST), under different test scheduling policies. The investigated scheduling policies primarily vary the number of cores concurrently under test in the overall system testing session. Our extensive, workload-driven dynamic exploration reveals that there is an inverse relation between the two test measures; as the number of cores concurrently under test increases, system test latency decreases, but at the cost of significantly increased test time, which sacrifices system availability for running normal workloads. Under given system test latency constraints, which should be utilized in order to be able to control system recovery time in the event of an error detection, our exploration framework identifies the scheduling policy under which overall test time overhead is minimized and, hence, system availability is maximized. Without any loss of generality, a 16-core system is explored in a full-system, execution-driven simulation framework running multi-threaded PARSEC workloads [1].

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

自引率

0.00%

发文量