使用轻量级ML技术识别数据密集型工作流的执行异常

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286139

Cong Wang, G. Papadimitriou, M. Kiran, A. Mandal, E. Deelman

{"title":"使用轻量级ML技术识别数据密集型工作流的执行异常","authors":"Cong Wang, G. Papadimitriou, M. Kiran, A. Mandal, E. Deelman","doi":"10.1109/HPEC43674.2020.9286139","DOIUrl":null,"url":null,"abstract":"Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques\",\"authors\":\"Cong Wang, G. Papadimitriou, M. Kiran, A. Mandal, E. Deelman\",\"doi\":\"10.1109/HPEC43674.2020.9286139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments.\",\"PeriodicalId\":168544,\"journal\":{\"name\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC43674.2020.9286139\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

今天的计算科学应用越来越依赖于许多复杂的、数据密集型的分布式数据集操作，这些数据集来自各种科学仪器和存储库。为了管理这种复杂性，创建了科学工作流来自动执行这些计算和数据传输任务，这大大提高了科学生产力。随着工作流规模的迅速增加，检测工作流执行中的异常行为对于确保及时准确的科学产品变得至关重要。在本文中，我们提出了一组轻量级的基于机器学习的技术，包括监督和无监督算法，以识别异常的工作流行为。我们在分布式云测试平台上对从实际工作流执行中收集的工作流级和任务级数据集进行异常分析。结果表明，采用k-means聚类的工作流级分析可以准确地将异常(即易故障和性能差的工作流)聚到统计上相似的类中，聚类质量合理，归一化互信息和完整性得分超过0.7。这些结果肯定了工作流异常分析中工作流级特征的选择。对于任务级分析，决策树分类器达到了>80%的准确率，而其他经过测试的分类器在大多数情况下可以达到>50%的准确率。我们相信这些有希望的结果可以为未来在生产环境中运行的科学工作流的异常检测和故障预测研究奠定基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques

Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量