日志汇总和异常检测,用于故障排除分布式系统

D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf
{"title":"日志汇总和异常检测,用于故障排除分布式系统","authors":"D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf","doi":"10.1109/GRID.2007.4354137","DOIUrl":null,"url":null,"abstract":"Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.","PeriodicalId":304508,"journal":{"name":"2007 8th IEEE/ACM International Conference on Grid Computing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":"{\"title\":\"Log summarization and anomaly detection for troubleshooting distributed systems\",\"authors\":\"D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf\",\"doi\":\"10.1109/GRID.2007.4354137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.\",\"PeriodicalId\":304508,\"journal\":{\"name\":\"2007 8th IEEE/ACM International Conference on Grid Computing\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"50\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 8th IEEE/ACM International Conference on Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GRID.2007.4354137\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 8th IEEE/ACM International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2007.4354137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 50

摘要

如今的系统监控工具能够近乎实时地检测系统故障,如主机故障、操作系统错误和网络分区。不幸的是,端到端分布式软件堆栈的情况并非如此。任何给定的操作(例如,可靠地传输一个文件目录)都可能涉及跨多个软件部分的广泛复杂且相互关联的操作:检查用户证书和权限、获取所有文件的详细信息、执行第三方传输、理解重试策略决策等。我们提供了一种用于排除复杂中间件故障的基础结构,一种用于配置日志摘要的通用技术,以及一种可以在运行中的网格中间件上近乎实时地工作的异常检测技术。我们展示了使用该基础设施从运行在Emulab测试台上的仪表化网格中间件和应用程序中收集到的结果。根据这些结果,我们分析了几种算法在准确检测各种性能异常方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Log summarization and anomaly detection for troubleshooting distributed systems
Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信