Networked Windows NT system field failure data analysis

Jun Xu, Z. Kalbarczyk, R. Iyer
{"title":"Networked Windows NT system field failure data analysis","authors":"Jun Xu, Z. Kalbarczyk, R. Iyer","doi":"10.1109/PRDC.1999.816227","DOIUrl":null,"url":null,"abstract":"This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.1999.816227","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 154

Abstract

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.
网络化Windows NT系统现场故障数据分析
本文基于在生产环境中运行了四个月的503台服务器的NT系统日志中收集的现场数据,对网络化的Windows NT系统进行了基于测量的可靠性研究。手头的事件日志只包含系统重启信息。我们研究单个服务器故障和域行为,以表征故障行为并探索服务器之间的错误传播。本研究的主要观察结果如下:(1)系统软件和硬件故障是导致系统总停机时间(22%和10%)的两个主要原因;(2)应用软件故障的恢复通常很快;(3)在许多情况下,需要多次重新启动才能从故障中恢复;(4)单个服务器的平均可用性超过99%;(5)有很强的错误依赖或错误在网络中传播的迹象;(6)大多数(58%)重启是未分类的,表明需要更好的日志记录技术;(7)维护和配置占系统停机时间的24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信