Reduction of Workflow Resource Consumption Using a Density-based Clustering Model

2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) Pub Date : 2018-11-01 DOI:10.1109/WORKS.2018.00006

Qimin Zhang, Nathaniel Kremer-Herman, Benjamín Tovar, D. Thain

{"title":"Reduction of Workflow Resource Consumption Using a Density-based Clustering Model","authors":"Qimin Zhang, Nathaniel Kremer-Herman, Benjamín Tovar, D. Thain","doi":"10.1109/WORKS.2018.00006","DOIUrl":null,"url":null,"abstract":"Often times, a researcher running a scientific workflow will ask for orders of magnitude too few or too many resources to run their workflow. If the resource requisition is too small, the job may fail due to resource exhaustion; if it is too large, resources will be wasted though job may succeed. It would be ideal to achieve a near-optimal number of resources the workflow runs to ensure all jobs succeed and minimize resource waste. We present a strategy for solving the resource allocation problem: (1) resources consumed by each job are recorded by a resource monitor tool; (2) a density-based clustering model is proposed for discovering clusters in all jobs; (3) a maximal resource requisition is calculated as the ideal number of each cluster. We ran experiments with a synthetic workflow of homogeneous tasks as well as the bioinformatics tools Lifemapper, SHRIMP, BWA and BWA-GATK to capture the inherent nature of resource consumption of a workflow, the clustering allowed by the model, and its usefulness in real workflows. In Lifemapper, the least time saving, cores saving, memory saving, and disk saving are 13.82%, 16.62%, 49.15%, and 93.89%, respectively. In SHRIMP, BWA, and BWA-GATK, the least cores saving, memory saving and disk saving are 50%, 90.14%, and 51.82%, respectively. Compared with fixed resource allocation strategy, our approach provide a noticeable reduction of workflow resource consumption.","PeriodicalId":154317,"journal":{"name":"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WORKS.2018.00006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Often times, a researcher running a scientific workflow will ask for orders of magnitude too few or too many resources to run their workflow. If the resource requisition is too small, the job may fail due to resource exhaustion; if it is too large, resources will be wasted though job may succeed. It would be ideal to achieve a near-optimal number of resources the workflow runs to ensure all jobs succeed and minimize resource waste. We present a strategy for solving the resource allocation problem: (1) resources consumed by each job are recorded by a resource monitor tool; (2) a density-based clustering model is proposed for discovering clusters in all jobs; (3) a maximal resource requisition is calculated as the ideal number of each cluster. We ran experiments with a synthetic workflow of homogeneous tasks as well as the bioinformatics tools Lifemapper, SHRIMP, BWA and BWA-GATK to capture the inherent nature of resource consumption of a workflow, the clustering allowed by the model, and its usefulness in real workflows. In Lifemapper, the least time saving, cores saving, memory saving, and disk saving are 13.82%, 16.62%, 49.15%, and 93.89%, respectively. In SHRIMP, BWA, and BWA-GATK, the least cores saving, memory saving and disk saving are 50%, 90.14%, and 51.82%, respectively. Compared with fixed resource allocation strategy, our approach provide a noticeable reduction of workflow resource consumption.

查看原文本刊更多论文

基于密度的聚类模型降低工作流资源消耗

通常情况下，一个研究人员运行一个科学的工作流程会要求数量级太少或太多的资源来运行他们的工作流程。如果请求的资源太少，作业可能会因为资源耗尽而失败;如果它太大，虽然作业可能成功，但会浪费资源。理想的做法是实现工作流运行的近乎最优的资源数量，以确保所有作业成功并最大限度地减少资源浪费。我们提出了一种解决资源分配问题的策略:(1)每个作业消耗的资源由资源监控工具记录;(2)提出了一种基于密度的聚类模型，用于发现所有作业中的聚类;(3)计算最大资源占用作为每个集群的理想数量。我们对同质任务的合成工作流以及生物信息学工具Lifemapper、SHRIMP、BWA和BWA- gatk进行了实验，以捕捉工作流资源消耗的内在本质、模型允许的聚类以及它在实际工作流中的实用性。在Lifemapper中，节省时间的最小值为13.82%，节省内核的最小值为16.62%，节省内存的最小值为49.15%，节省磁盘的最小值为93.89%。在SHRIMP、BWA和BWA- gatk中，内核节省率最低，内存节省率最低，磁盘节省率最低，分别为50%、90.14%和51.82%。与固定资源分配策略相比，我们的方法显著降低了工作流资源的消耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)

自引率

0.00%

发文量