大规模机器学习工作流程中数据管理挑战的案例研究

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) Pub Date : 2023-05-01 DOI:10.1109/CCGrid57682.2023.00017

Claire Songhyun Lee, V. Hewes, G. Cerati, J. Kowalkowski, Adam Aurisano, Ankit Agrawal, Alok Ratan Choudhary, W. Liao

{"title":"大规模机器学习工作流程中数据管理挑战的案例研究","authors":"Claire Songhyun Lee, V. Hewes, G. Cerati, J. Kowalkowski, Adam Aurisano, Ankit Agrawal, Alok Ratan Choudhary, W. Liao","doi":"10.1109/CCGrid57682.2023.00017","DOIUrl":null,"url":null,"abstract":"Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows\",\"authors\":\"Claire Songhyun Lee, V. Hewes, G. Cerati, J. Kowalkowski, Adam Aurisano, Ankit Agrawal, Alok Ratan Choudhary, W. Liao\",\"doi\":\"10.1109/CCGrid57682.2023.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches.\",\"PeriodicalId\":363806,\"journal\":{\"name\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid57682.2023.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在高性能计算系统上运行科学工作流应用程序在准确性和可伸缩性方面提供了有希望的结果。一个例子是高能物理中的粒子轨迹重建研究，它由多个机器学习任务组成。然而，随着现代高性能计算系统的规模不断扩大，由于对计算能力的需求不断增加，内存占用越来越大，数据在各种存储设备之间的移动也越来越多，研究人员在协调各个工作流任务上花费了更多的精力。当中间结果数据必须在不同的任务之间共享，并且每个任务都要优化以实现自己的设计目标(例如最短的时间或最小的内存占用)时，这些问题会进一步加剧。在本文中，我们研究了科学工作流程中提出的数据管理挑战。我们观察到，单个任务，如数据生成、数据管理、模型训练和推理，通常只使用最适合其I/O性能的数据布局，但与其后续任务正交。考虑到两个任务在工作流中连续运行，我们提出了不同的解决方案，采用了不同的数据结构和布局。我们的实验结果显示，与以前的方法相比，初始化时间和I/O时间分别加快了16.46倍和3.42倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows

Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

自引率

0.00%

发文量