Pangeo基准分析:对象存储与POSIX文件系统

2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW) Pub Date : 2020-10-21 DOI:10.1109/PDSW51947.2020.00012

Haiying Xu, Kevin Paul, Anderson Banihirwe

{"title":"Pangeo基准分析:对象存储与POSIX文件系统","authors":"Haiying Xu, Kevin Paul, Anderson Banihirwe","doi":"10.1109/PDSW51947.2020.00012","DOIUrl":null,"url":null,"abstract":"Pangeo is a community of scientists and software developers collaborating to enable Big Data Geoscience analysis interactively in the public cloud and on high-performance computing (HPC) systems. At the core of the Pangeo software stack is (1) Xarray, which adds labels to metadata such as dimensions, coordinates and attributes for raw array-oriented data, (2) Dask, which provides parallel computation and out-of-core memory capabilities, and (3) Jupyter Lab which offers the web-based interactive environment to the Pangeo platform. Geoscientists now have a strong candidate software stack to analyze large datasets, and they are very curious about performance differences between the Zarr and NetCDF4 data formats on both traditional file storage systems and object storage. We have written a benchmarking suite for the Pangeo stack that can measure scalability and performance information of both input/output (I/O) throughput and computation. We will describe how we performed these benchmarks, analyzed our results, and we will discuss the pros and cons of the Pangeo software stack in terms of I/O scalability on both cloud and HPC storage systems.","PeriodicalId":142923,"journal":{"name":"2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pangeo Benchmarking Analysis: Object Storage vs. POSIX File System\",\"authors\":\"Haiying Xu, Kevin Paul, Anderson Banihirwe\",\"doi\":\"10.1109/PDSW51947.2020.00012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pangeo is a community of scientists and software developers collaborating to enable Big Data Geoscience analysis interactively in the public cloud and on high-performance computing (HPC) systems. At the core of the Pangeo software stack is (1) Xarray, which adds labels to metadata such as dimensions, coordinates and attributes for raw array-oriented data, (2) Dask, which provides parallel computation and out-of-core memory capabilities, and (3) Jupyter Lab which offers the web-based interactive environment to the Pangeo platform. Geoscientists now have a strong candidate software stack to analyze large datasets, and they are very curious about performance differences between the Zarr and NetCDF4 data formats on both traditional file storage systems and object storage. We have written a benchmarking suite for the Pangeo stack that can measure scalability and performance information of both input/output (I/O) throughput and computation. We will describe how we performed these benchmarks, analyzed our results, and we will discuss the pros and cons of the Pangeo software stack in terms of I/O scalability on both cloud and HPC storage systems.\",\"PeriodicalId\":142923,\"journal\":{\"name\":\"2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDSW51947.2020.00012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW51947.2020.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

Pangeo是一个由科学家和软件开发人员组成的社区，致力于在公共云和高性能计算(HPC)系统上实现大数据地球科学的交互式分析。Pangeo软件栈的核心是(1)Xarray，它为原始面向数组的数据添加维度、坐标和属性等元数据标签;(2)Dask，它提供并行计算和核外内存能力;(3)Jupyter Lab，它为Pangeo平台提供基于web的交互环境。地球科学家现在有一个强大的候选软件堆栈来分析大型数据集，他们非常好奇Zarr和NetCDF4数据格式在传统文件存储系统和对象存储上的性能差异。我们为Pangeo堆栈编写了一个基准测试套件，它可以测量输入/输出(I/O)吞吐量和计算的可伸缩性和性能信息。我们将描述如何执行这些基准测试，分析结果，并讨论Pangeo软件堆栈在云和HPC存储系统上的I/O可伸缩性方面的优缺点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pangeo Benchmarking Analysis: Object Storage vs. POSIX File System

Pangeo is a community of scientists and software developers collaborating to enable Big Data Geoscience analysis interactively in the public cloud and on high-performance computing (HPC) systems. At the core of the Pangeo software stack is (1) Xarray, which adds labels to metadata such as dimensions, coordinates and attributes for raw array-oriented data, (2) Dask, which provides parallel computation and out-of-core memory capabilities, and (3) Jupyter Lab which offers the web-based interactive environment to the Pangeo platform. Geoscientists now have a strong candidate software stack to analyze large datasets, and they are very curious about performance differences between the Zarr and NetCDF4 data formats on both traditional file storage systems and object storage. We have written a benchmarking suite for the Pangeo stack that can measure scalability and performance information of both input/output (I/O) throughput and computation. We will describe how we performed these benchmarks, analyzed our results, and we will discuss the pros and cons of the Pangeo software stack in terms of I/O scalability on both cloud and HPC storage systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)

自引率

0.00%

发文量