Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)

Proceedings of the Python in Science Conference Pub Date : 1900-01-01 DOI:10.25080/gerudo-f2bc6f59-000

Project Blosc, Francesc Alted, Marta Iborra, Oscar Guiñón, David Ibáñez, S. Barrachina

{"title":"Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)","authors":"Project Blosc, Francesc Alted, Marta Iborra, Oscar Guiñón, David Ibáñez, S. Barrachina","doi":"10.25080/gerudo-f2bc6f59-000","DOIUrl":null,"url":null,"abstract":"—Large multidimensional datasets are widely used in various engineering and scientific applications. Prompt access to the subsets of these datasets is crucial for an efficient exploration experience. To facilitate this, we have added support for large dimensional datasets to Blosc2, a compression and format library. The extension enables effective support for large multidimensional datasets, with a special encoding of zeros that allows for efficient handling of sparse datasets. Additionally, the new two-level data partition used in Blosc2 reduces the need for decompressing unnecessary data, further accelerating slicing speed. The Blosc2 NDim layer enables the creation and reading of n-dimensional datasets in an extremely efficient manner. This is due to a completely general n-dim 2-level partitioning, which allows for slicing and dicing of arbitrary large (and compressed) data in a more fine-grained way. Having a second partition provides a better flexibility to fit the different partitions at the different CPU cache levels, making compression even more efficient. Additionally, Blosc2 can make use of Btune, a library that automatically finds the optimal combination of compression parameters to suit user needs. Btune employs various techniques, such as a genetic algorithm and a neural network model, to discover the best parameters for a given dataset much more quickly. This approach is a significant improvement over the traditional trial-and-error method, which can take hours or even days to find the best parameters. As an example, we will demonstrate how Blosc2 NDim enables fast exploration of the Milky Way using the Gaia DR3 dataset.","PeriodicalId":364654,"journal":{"name":"Proceedings of the Python in Science Conference","volume":"3 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Python in Science Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25080/gerudo-f2bc6f59-000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

—Large multidimensional datasets are widely used in various engineering and scientific applications. Prompt access to the subsets of these datasets is crucial for an efficient exploration experience. To facilitate this, we have added support for large dimensional datasets to Blosc2, a compression and format library. The extension enables effective support for large multidimensional datasets, with a special encoding of zeros that allows for efficient handling of sparse datasets. Additionally, the new two-level data partition used in Blosc2 reduces the need for decompressing unnecessary data, further accelerating slicing speed. The Blosc2 NDim layer enables the creation and reading of n-dimensional datasets in an extremely efficient manner. This is due to a completely general n-dim 2-level partitioning, which allows for slicing and dicing of arbitrary large (and compressed) data in a more fine-grained way. Having a second partition provides a better flexibility to fit the different partitions at the different CPU cache levels, making compression even more efficient. Additionally, Blosc2 can make use of Btune, a library that automatically finds the optimal combination of compression parameters to suit user needs. Btune employs various techniques, such as a genetic algorithm and a neural network model, to discover the best parameters for a given dataset much more quickly. This approach is a significant improvement over the traditional trial-and-error method, which can take hours or even days to find the best parameters. As an example, we will demonstrate how Blosc2 NDim enables fast exploration of the Milky Way using the Gaia DR3 dataset.

查看原文本刊更多论文

使用Blosc2 NDim作为银河系(或任何其他NDim数据集)的快速探索者

-大型多维数据集广泛应用于各种工程和科学应用。快速访问这些数据集的子集对于有效的勘探体验至关重要。为了实现这一点，我们在压缩和格式库Blosc2中添加了对大维度数据集的支持。该扩展可以有效地支持大型多维数据集，使用特殊的零编码，可以有效地处理稀疏数据集。此外，在Blosc2中使用的新的两级数据分区减少了对不必要数据的解压缩需求，进一步加快了切片速度。Blosc2 NDim层能够以非常有效的方式创建和读取n维数据集。这是由于完全通用的n-dim 2级分区，它允许以更细粒度的方式对任意大(和压缩)数据进行切片和切块。拥有第二个分区提供了更好的灵活性，可以适应不同CPU缓存级别上的不同分区，从而使压缩更加高效。此外，Blosc2可以使用Btune，这是一个库，可以自动找到适合用户需求的压缩参数的最佳组合。Btune采用各种技术，如遗传算法和神经网络模型，以更快地发现给定数据集的最佳参数。与传统的试错法相比，这种方法是一项重大改进，传统的试错法可能需要数小时甚至数天才能找到最佳参数。作为一个例子，我们将演示Blosc2 NDim如何使用盖亚DR3数据集实现对银河系的快速探索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Python in Science Conference

自引率

0.00%

发文量