Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss

Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat
{"title":"Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss","authors":"Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat","doi":"10.1109/NYSDS.2017.8085044","DOIUrl":null,"url":null,"abstract":"Baryon Oscillation Spectroscopic Survey(BOSS) from the Sloan Digital Sky Survey (SDSS), typically produces a single data file per object observed in the FITS format. The FITS format has been a default file format in this field of astronomy for many years. None of the FITS I/O libraries support parallel I/O, thus not a fit in today’s high performance computing. The issue becomes more and more severe as the size of the data and the number of files keep increasing. In this paper, we introduce an alternative file format and build a parallel python tool based on H5py. The developed H5Boss library supports efficient file conversion, large scale data query, and parallel I/O. Given the typical analytics pattern, we are able to scale the H5Boss to millions of object query, with minimum I/O and communication overhead. This study presents a clear picture about the BOSS data analytics and data management with a HPC friendly file format, HDF5.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 New York Scientific Data Summit (NYSDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NYSDS.2017.8085044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Baryon Oscillation Spectroscopic Survey(BOSS) from the Sloan Digital Sky Survey (SDSS), typically produces a single data file per object observed in the FITS format. The FITS format has been a default file format in this field of astronomy for many years. None of the FITS I/O libraries support parallel I/O, thus not a fit in today’s high performance computing. The issue becomes more and more severe as the size of the data and the number of files keep increasing. In this paper, we introduce an alternative file format and build a parallel python tool based on H5py. The developed H5Boss library supports efficient file conversion, large scale data query, and parallel I/O. Given the typical analytics pattern, we are able to scale the H5Boss to millions of object query, with minimum I/O and communication overhead. This study presents a clear picture about the BOSS data analytics and data management with a HPC friendly file format, HDF5.
用H5Boss在BOSS光谱调查数据中搜索数百万个物体
斯隆数字巡天(SDSS)的重子振荡光谱巡天(BOSS)通常以FITS格式为每个观测对象生成一个数据文件。FITS格式多年来一直是天文学领域的默认文件格式。FITS I/O库都不支持并行I/O,因此不适合当今的高性能计算。随着数据量和文件数量的不断增加,这个问题变得越来越严重。在本文中,我们介绍了另一种文件格式,并基于H5py构建了一个并行python工具。开发的H5Boss库支持高效的文件转换、大规模数据查询和并行I/O。给定典型的分析模式,我们能够以最小的I/O和通信开销将H5Boss扩展到数百万个对象查询。本研究通过HPC友好的文件格式HDF5,展示了BOSS数据分析和数据管理的清晰图景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信