Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat
{"title":"Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss","authors":"Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat","doi":"10.1109/NYSDS.2017.8085044","DOIUrl":null,"url":null,"abstract":"Baryon Oscillation Spectroscopic Survey(BOSS) from the Sloan Digital Sky Survey (SDSS), typically produces a single data file per object observed in the FITS format. The FITS format has been a default file format in this field of astronomy for many years. None of the FITS I/O libraries support parallel I/O, thus not a fit in today’s high performance computing. The issue becomes more and more severe as the size of the data and the number of files keep increasing. In this paper, we introduce an alternative file format and build a parallel python tool based on H5py. The developed H5Boss library supports efficient file conversion, large scale data query, and parallel I/O. Given the typical analytics pattern, we are able to scale the H5Boss to millions of object query, with minimum I/O and communication overhead. This study presents a clear picture about the BOSS data analytics and data management with a HPC friendly file format, HDF5.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 New York Scientific Data Summit (NYSDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NYSDS.2017.8085044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Baryon Oscillation Spectroscopic Survey(BOSS) from the Sloan Digital Sky Survey (SDSS), typically produces a single data file per object observed in the FITS format. The FITS format has been a default file format in this field of astronomy for many years. None of the FITS I/O libraries support parallel I/O, thus not a fit in today’s high performance computing. The issue becomes more and more severe as the size of the data and the number of files keep increasing. In this paper, we introduce an alternative file format and build a parallel python tool based on H5py. The developed H5Boss library supports efficient file conversion, large scale data query, and parallel I/O. Given the typical analytics pattern, we are able to scale the H5Boss to millions of object query, with minimum I/O and communication overhead. This study presents a clear picture about the BOSS data analytics and data management with a HPC friendly file format, HDF5.