Proceedings of the Fourth International Workshop on HPC User Support Tools最新文献

筛选
英文 中文
Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs 基于马尔可夫链模型的高性能计算系统日志异常检测
Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152559
Abida Haque, Alexandra DeLucia, Elisabeth Baseman
{"title":"Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs","authors":"Abida Haque, Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3152493.3152559","DOIUrl":"https://doi.org/10.1145/3152493.3152559","url":null,"abstract":"As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115487493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
ITALC: Interactive Tool for Application-Level Checkpointing 用于应用程序级检查点的交互式工具
Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152558
R. Arora, Trung Nguyen Ba
{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":"https://doi.org/10.1145/3152493.3152558","url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters Testpilot:一个以用户为中心的HPC集群测试的灵活框架
Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152555
K. Colby, A. Maji, Jason Rahman, J. Bottum
{"title":"Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters","authors":"K. Colby, A. Maji, Jason Rahman, J. Bottum","doi":"10.1145/3152493.3152555","DOIUrl":"https://doi.org/10.1145/3152493.3152555","url":null,"abstract":"HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Nix as HPC package management system Nix作为HPC包管理系统
Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152556
B. Bzeznik, O. Henriot, Valentin Reis, Olivier Richard, Laure Tavard
{"title":"Nix as HPC package management system","authors":"B. Bzeznik, O. Henriot, Valentin Reis, Olivier Richard, Laure Tavard","doi":"10.1145/3152493.3152556","DOIUrl":"https://doi.org/10.1145/3152493.3152556","url":null,"abstract":"Modern High Performance Computing systems are becoming larger and more heterogeneous. The proper management of software for the users of such systems poses a significant challenge. These users run very diverse applications that may be compiled with proprietary tools for specialized hardware. Moreover, the application life-cycle of these software may exceed the lifetime of the HPC systems themselves. These difficulties motivate the use of specialized package management systems. In this paper, we outline an approach to HPC package development, deployment, management, sharing, and reuse based on the Nix functional package manager. We report our experience with this approach inside the GRICAD HPC center[GRICAD 2017a] in Grenoble over a 12 month period and compare it to other existing approaches.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121832563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
An Edge Service for Managing HPC Workflows 管理HPC工作流的边缘服务
Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152557
J. T. Childers, T. Uram, D. Benjamin, T. LeCompte, M. Papka
{"title":"An Edge Service for Managing HPC Workflows","authors":"J. T. Childers, T. Uram, D. Benjamin, T. LeCompte, M. Papka","doi":"10.1145/3152493.3152557","DOIUrl":"https://doi.org/10.1145/3152493.3152557","url":null,"abstract":"Large experimental collaborations, such as those at the Large Hadron Collider at CERN, have developed large job management systems running hundreds of thousands of jobs across worldwide computing grids. HPC facilities are becoming more important to these data-intensive workflows and integrating them into the experiment job management system is non-trivial due to increased security and heterogeneous computing environments. The following article describes a common edge service developed and deployed on DOE supercomputers for both small users and large collaborations. This edge service provides a uniform interaction across many different supercomputers. Example users are described with the related performance.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129910639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Proceedings of the Fourth International Workshop on HPC User Support Tools 第四届HPC用户支持工具国际研讨会论文集
{"title":"Proceedings of the Fourth International Workshop on HPC User Support Tools","authors":"","doi":"10.1145/3152493","DOIUrl":"https://doi.org/10.1145/3152493","url":null,"abstract":"","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信