日志汇总和异常检测，用于故障排除分布式系统

2007 8th IEEE/ACM International Conference on Grid Computing Pub Date : 2007-08-01 DOI:10.1109/GRID.2007.4354137

D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf

{"title":"日志汇总和异常检测，用于故障排除分布式系统","authors":"D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf","doi":"10.1109/GRID.2007.4354137","DOIUrl":null,"url":null,"abstract":"Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.","PeriodicalId":304508,"journal":{"name":"2007 8th IEEE/ACM International Conference on Grid Computing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":"{\"title\":\"Log summarization and anomaly detection for troubleshooting distributed systems\",\"authors\":\"D. Gunter, B. Tierney, Aaron Brown, D. M. Swany, J. Bresnahan, J. Schopf\",\"doi\":\"10.1109/GRID.2007.4354137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.\",\"PeriodicalId\":304508,\"journal\":{\"name\":\"2007 8th IEEE/ACM International Conference on Grid Computing\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"50\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 8th IEEE/ACM International Conference on Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GRID.2007.4354137\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 8th IEEE/ACM International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2007.4354137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

摘要

如今的系统监控工具能够近乎实时地检测系统故障，如主机故障、操作系统错误和网络分区。不幸的是，端到端分布式软件堆栈的情况并非如此。任何给定的操作(例如，可靠地传输一个文件目录)都可能涉及跨多个软件部分的广泛复杂且相互关联的操作:检查用户证书和权限、获取所有文件的详细信息、执行第三方传输、理解重试策略决策等。我们提供了一种用于排除复杂中间件故障的基础结构，一种用于配置日志摘要的通用技术，以及一种可以在运行中的网格中间件上近乎实时地工作的异常检测技术。我们展示了使用该基础设施从运行在Emulab测试台上的仪表化网格中间件和应用程序中收集到的结果。根据这些结果，我们分析了几种算法在准确检测各种性能异常方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Log summarization and anomaly detection for troubleshooting distributed systems

Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 8th IEEE/ACM International Conference on Grid Computing

自引率

0.00%

发文量