网络化Windows NT系统现场故障数据分析

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing Pub Date : 1999-12-16 DOI:10.1109/PRDC.1999.816227

Jun Xu, Z. Kalbarczyk, R. Iyer

{"title":"网络化Windows NT系统现场故障数据分析","authors":"Jun Xu, Z. Kalbarczyk, R. Iyer","doi":"10.1109/PRDC.1999.816227","DOIUrl":null,"url":null,"abstract":"This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":"{\"title\":\"Networked Windows NT system field failure data analysis\",\"authors\":\"Jun Xu, Z. Kalbarczyk, R. Iyer\",\"doi\":\"10.1109/PRDC.1999.816227\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.\",\"PeriodicalId\":389294,\"journal\":{\"name\":\"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing\",\"volume\":\"124 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"154\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRDC.1999.816227\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.1999.816227","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 154

摘要

本文基于在生产环境中运行了四个月的503台服务器的NT系统日志中收集的现场数据，对网络化的Windows NT系统进行了基于测量的可靠性研究。手头的事件日志只包含系统重启信息。我们研究单个服务器故障和域行为，以表征故障行为并探索服务器之间的错误传播。本研究的主要观察结果如下:(1)系统软件和硬件故障是导致系统总停机时间(22%和10%)的两个主要原因;(2)应用软件故障的恢复通常很快;(3)在许多情况下，需要多次重新启动才能从故障中恢复;(4)单个服务器的平均可用性超过99%;(5)有很强的错误依赖或错误在网络中传播的迹象;(6)大多数(58%)重启是未分类的，表明需要更好的日志记录技术;(7)维护和配置占系统停机时间的24%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Networked Windows NT system field failure data analysis

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

自引率

0.00%

发文量