Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat
{"title":"Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss","authors":"Jialin Liu, D. Bard, Q. Koziol, Stephen Bailey, Prabhat","doi":"10.1109/NYSDS.2017.8085044","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085044","url":null,"abstract":"Baryon Oscillation Spectroscopic Survey(BOSS) from the Sloan Digital Sky Survey (SDSS), typically produces a single data file per object observed in the FITS format. The FITS format has been a default file format in this field of astronomy for many years. None of the FITS I/O libraries support parallel I/O, thus not a fit in today’s high performance computing. The issue becomes more and more severe as the size of the data and the number of files keep increasing. In this paper, we introduce an alternative file format and build a parallel python tool based on H5py. The developed H5Boss library supports efficient file conversion, large scale data query, and parallel I/O. Given the typical analytics pattern, we are able to scale the H5Boss to millions of object query, with minimum I/O and communication overhead. This study presents a clear picture about the BOSS data analytics and data management with a HPC friendly file format, HDF5.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133017540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicole Meister, Ziqiao Guan, Jinzhen Wang, Ronald Lashley, Jiliang Liu, Julien Lhermitte, K. Yager, Hong Qin, Bo Sun, Dantong Yu
{"title":"Robust and scalable deep learning for X-ray synchrotron image analysis","authors":"Nicole Meister, Ziqiao Guan, Jinzhen Wang, Ronald Lashley, Jiliang Liu, Julien Lhermitte, K. Yager, Hong Qin, Bo Sun, Dantong Yu","doi":"10.1109/NYSDS.2017.8085045","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085045","url":null,"abstract":"X-ray scattering is a key technique in modern synchrotron facilities towards material analysis and discovery via structural characterization at the molecular scale and nano-scale. Image classification and tagging play a crucial role in recognizing patterns, inferring meaningful physical properties from sample, and guiding subsequent experiment steps. We designed deeplearning based image classification pipelines and gained significant improvements in terms of accuracy and speed. Constrained by available computing resources and optimization library, we need to make trade-off among computation efficiency, input image size and volume, and the flexibility and stability of processing images with different levels of qualities and artifacts. Consequently, our deep learning framework requires careful data preprocessing techniques to down-sample images and extract true image signals. However, X-ray scattering images contain different levels of noise, numerous gaps, rotations, and defects arising from detector limitations, sample (mis)alignment, and experimental configuration. Traditional methods of healing x-ray scattering images make strong assumptions about these artifacts and require hand-crafted procedures and experiment meta-data to de-noise, interpolate measured data to eliminate gaps, and rotate and translate images to align the center of samples with the center of images. These manual procedures are error-prone, experience-driven, and isolated from the intended image prediction, and consequently not scalable to the data rate of X-ray images from modern detectors. We aim to explore deeplearning based image classification techniques that are robust and capable of leverage high-definition experimental images with rich variations even in a production environment that is not defect-free, and ultimately automate labor-intensive data preprocessing tasks and integrate them seamlessly into our TensorFlow based experimental data analysis framework.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}