Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW) Pub Date : 2019-11-01 DOI:10.1109/PDSW49588.2019.00008

Bing Xie, Zilong Tan, P. Carns, J. Chase, K. Harms, J. Lofstead, S. Oral, Sudharshan S. Vazhkudai, Feiyi Wang

{"title":"Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems","authors":"Bing Xie, Zilong Tan, P. Carns, J. Chase, K. Harms, J. Lofstead, S. Oral, Sudharshan S. Vazhkudai, Feiyi Wang","doi":"10.1109/PDSW49588.2019.00008","DOIUrl":null,"url":null,"abstract":"In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms — Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility — to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.","PeriodicalId":130430,"journal":{"name":"2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW49588.2019.00008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms — Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility — to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.

查看原文本刊更多论文

应用机器学习理解大规模并行文件系统的写性能

在高性能计算(HPC)中，I/O性能预测提供了提高科学计算效率的潜力。特别是，准确的预测可以使运行时估计更加精确，指导用户采用最佳检查点策略，并更好地通知设施供应和调度策略。HPC I/O性能是出了名的难以预测和建模的，然而，在很大程度上是因为固有的可变性和组成存储系统组件的行为缺乏透明度。在这项工作中，我们试图通过(1)对平均性能建模来解决高可变性，(2)从写入模式、系统架构和系统配置中导出模型特征，以及(3)使用Lasso回归模型来提高模型准确性，来推进HPC I/O性能预测的最新技术。我们通过将该方法应用于常见HPC I/O主题的一个关键子集，即每个进程文件检查点写工作负载，来证明该方法的有效性。我们在两个不同的生产HPC平台上进行实验——橡树岭领导计算设施的Titan和阿贡领导计算设施的Cetus——来训练和评估我们的模型。我们发现，在这些平台上，我们的测试集中的92.79%和99.64%的样本分别可以达到≤30%的相对误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)

自引率

0.00%

发文量