D. Tang, P. Carruthers, Zuheir Totari, Michael W. Shapiro
{"title":"硬件故障时内存页退役对系统RAS影响的评估","authors":"D. Tang, P. Carruthers, Zuheir Totari, Michael W. Shapiro","doi":"10.1109/DSN.2006.13","DOIUrl":null,"url":null,"abstract":"The Solaris 10 operating system includes a number of new features for predictive self-healing. One such feature is the ability of the fault management software to diagnose memory errors and drive automatic memory page retirement (MPR), intended to reduce the negative impact of permanent memory faults that generate either correctable or uncorrectable errors on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors and relocatable clean pages suffering from uncorrectable errors to be removed from use in the virtual memory system without interrupting user applications. It also allows relocatable dirty pages associated with uncorrectable errors to be isolated with limited impact on affected user processes, avoiding an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system self-healing technique on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and mid-range server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability","PeriodicalId":228470,"journal":{"name":"International Conference on Dependable Systems and Networks (DSN'06)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults\",\"authors\":\"D. Tang, P. Carruthers, Zuheir Totari, Michael W. Shapiro\",\"doi\":\"10.1109/DSN.2006.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Solaris 10 operating system includes a number of new features for predictive self-healing. One such feature is the ability of the fault management software to diagnose memory errors and drive automatic memory page retirement (MPR), intended to reduce the negative impact of permanent memory faults that generate either correctable or uncorrectable errors on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors and relocatable clean pages suffering from uncorrectable errors to be removed from use in the virtual memory system without interrupting user applications. It also allows relocatable dirty pages associated with uncorrectable errors to be isolated with limited impact on affected user processes, avoiding an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system self-healing technique on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and mid-range server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability\",\"PeriodicalId\":228470,\"journal\":{\"name\":\"International Conference on Dependable Systems and Networks (DSN'06)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Dependable Systems and Networks (DSN'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2006.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Dependable Systems and Networks (DSN'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2006.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 73
摘要
Solaris 10操作系统包括许多预测性自修复的新特性。其中一个特性是故障管理软件诊断内存错误和驱动自动内存页面退役(MPR)的能力,其目的是减少永久内存故障对系统可靠性、可用性和可服务性(RAS)产生的可纠正或不可纠正错误的负面影响。MPR技术允许在不中断用户应用程序的情况下,从虚拟内存系统中删除具有可纠正错误的内存页和具有不可纠正错误的可重新定位的干净页。它还允许隔离与不可纠正错误相关联的可重定位脏页面,对受影响的用户进程的影响有限,从而避免整个系统的中断。本研究应用分析模型,通过现场经验校准参数,量化这种操作系统自修复技术对系统中断、年度停机时间和硬件永久故障引入的服务数量的减少,适用于典型的中低端服务器系统。结果表明,通过部署MPR能力,可以显著改善这三个系统RAS指标
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults
The Solaris 10 operating system includes a number of new features for predictive self-healing. One such feature is the ability of the fault management software to diagnose memory errors and drive automatic memory page retirement (MPR), intended to reduce the negative impact of permanent memory faults that generate either correctable or uncorrectable errors on system reliability, availability, and serviceability (RAS). The MPR technique allows memory pages suffering from correctable errors and relocatable clean pages suffering from uncorrectable errors to be removed from use in the virtual memory system without interrupting user applications. It also allows relocatable dirty pages associated with uncorrectable errors to be isolated with limited impact on affected user processes, avoiding an outage for the entire system. This study applies analytical models, with parameters calibrated by field experience, to quantify the reduction that can be made by this operating system self-healing technique on the system interruptions, yearly downtime, and number of services introduced by hardware permanent faults, for typical low-end and mid-range server systems. The results show that significant improvements can be made on these three system RAS metrics by deploying the MPR capability