Dynamic Task Shaping for High Throughput Data Analysis Applications in High Energy Physics

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI:10.1109/ipdps53621.2022.00041

Benjamín Tovar, Ben Lyons, K. Mohrman, Barry Sly-Delgado, K. Lannon, D. Thain

{"title":"Dynamic Task Shaping for High Throughput Data Analysis Applications in High Energy Physics","authors":"Benjamín Tovar, Ben Lyons, K. Mohrman, Barry Sly-Delgado, K. Lannon, D. Thain","doi":"10.1109/ipdps53621.2022.00041","DOIUrl":null,"url":null,"abstract":"Distributed data analysis frameworks are widely used for processing large datasets generated by instruments in scientific fields such as astronomy, genomics, and particle physics. Such frameworks partition petabyte-size datasets into chunks and execute many parallel tasks to search for common patterns, locate unusual signals, or compute aggregate properties. When well-configured, such frameworks make it easy to churn through large quantities of data on large clusters. However, configuring frameworks presents a challenge for end users, who must select a variety of parameters such as the blocking of the input data, the number of tasks, the resources allocated to each task, and the size of nodes on which they run. If poorly configured, the result may perform many orders of magnitude worse than optimal, or the application may even fail to make progress at all. Even if a good configuration is found through painstaking observations, the performance may change drastically when the input data or analysis kernel changes. This paper considers the problem of automatically configuring a data analysis application for high energy physics (TopEFT) built upon standard frameworks for physics analysis (Coffea) and distributed tasking (Work Queue). We observe the inherent variability within the application, demonstrate the problems of poor configuration, and then develop several techniques for automatically sizing tasks to meet goals of resource consumption, and overall application completion.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Distributed data analysis frameworks are widely used for processing large datasets generated by instruments in scientific fields such as astronomy, genomics, and particle physics. Such frameworks partition petabyte-size datasets into chunks and execute many parallel tasks to search for common patterns, locate unusual signals, or compute aggregate properties. When well-configured, such frameworks make it easy to churn through large quantities of data on large clusters. However, configuring frameworks presents a challenge for end users, who must select a variety of parameters such as the blocking of the input data, the number of tasks, the resources allocated to each task, and the size of nodes on which they run. If poorly configured, the result may perform many orders of magnitude worse than optimal, or the application may even fail to make progress at all. Even if a good configuration is found through painstaking observations, the performance may change drastically when the input data or analysis kernel changes. This paper considers the problem of automatically configuring a data analysis application for high energy physics (TopEFT) built upon standard frameworks for physics analysis (Coffea) and distributed tasking (Work Queue). We observe the inherent variability within the application, demonstrate the problems of poor configuration, and then develop several techniques for automatically sizing tasks to meet goals of resource consumption, and overall application completion.

查看原文本刊更多论文

高能物理中高通量数据分析应用的动态任务塑造

分布式数据分析框架广泛用于处理天文学、基因组学和粒子物理学等科学领域的仪器生成的大型数据集。这些框架将pb大小的数据集划分为块，并执行许多并行任务来搜索常见模式、定位异常信号或计算聚合属性。如果配置得当，这样的框架可以很容易地处理大型集群上的大量数据。然而，配置框架对最终用户来说是一个挑战，他们必须选择各种参数，例如输入数据的阻塞、任务的数量、分配给每个任务的资源以及它们运行的节点的大小。如果配置不当，结果可能会比最优状态差很多个数量级，或者应用程序甚至根本无法取得进展。即使通过艰苦的观察找到了良好的配置，当输入数据或分析内核发生变化时，性能也可能发生巨大变化。本文考虑了基于物理分析标准框架(Coffea)和分布式任务(Work Queue)的高能物理数据分析应用程序(TopEFT)的自动配置问题。我们观察了应用程序中固有的可变性，演示了配置不良的问题，然后开发了几种自动调整任务大小的技术，以满足资源消耗的目标，以及整个应用程序的完成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量