Megan Hickman, Dakota Fulp, Elisabeth Baseman, S. Blanchard, Hugh Greenberg, William M. Jones, Nathan Debardeleben
{"title":"通过在源代码中识别消息来源来增强高性能计算系统日志分析","authors":"Megan Hickman, Dakota Fulp, Elisabeth Baseman, S. Blanchard, Hugh Greenberg, William M. Jones, Nathan Debardeleben","doi":"10.1109/ISSREW.2018.00-23","DOIUrl":null,"url":null,"abstract":"Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.","PeriodicalId":321448,"journal":{"name":"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code\",\"authors\":\"Megan Hickman, Dakota Fulp, Elisabeth Baseman, S. Blanchard, Hugh Greenberg, William M. Jones, Nathan Debardeleben\",\"doi\":\"10.1109/ISSREW.2018.00-23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.\",\"PeriodicalId\":321448,\"journal\":{\"name\":\"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"volume\":\"110 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSREW.2018.00-23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSREW.2018.00-23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
超级计算机、高性能计算机和集群由大量独立的操作系统组成,这些操作系统生成自己的系统日志。消息在每台主机上本地生成,通常被转移到中央日志基础设施,该基础设施保留了整个系统的主记录。在洛斯阿拉莫斯国家实验室(Los Alamos National Laboratory, LANL),使用了一组开源云工具,每天从十几个这样的系统中记录超过1亿条系统日志消息。当系统管理员对这些复杂的系统进行故障排除时,了解创建这些消息的源代码非常有用,因为它可以深入了解子系统(磁盘、网络等)甚至源代码的行数。通常,调试超级计算机是在由于安全考虑而不能向所有个人提供开放访问的环境中进行的。因此,提供在系统日志消息和源代码行之间传递信息的方法允许系统管理员和源代码开发人员或超级计算机供应商之间进行通信。在这项工作中,我们展示了一个原型工具,旨在提供这样一个专家系统。我们利用了ElasticSearch (LANL部署的开源云工具之一)的功能,并使用我们自己的指标开发了一种方法,可以高可信度地正确匹配源代码行和文件。我们讨论了置信度指标,并表明在我们的实验中,92%的syslog日志行是正确匹配的。对于任何未来的样本,我们以95%的置信度预测,在88.2%到95.8%的时间内将检测到正确的文件。最后,我们讨论了正在进行的改进,以改进工具并在更大的数据集上研究它。
Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code
Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.