Parallel and Distributed Astronomical Data Analysis on Grid Datafarm

N. Yamamoto, O. Tatebe, S. Sekiguchi
{"title":"Parallel and Distributed Astronomical Data Analysis on Grid Datafarm","authors":"N. Yamamoto, O. Tatebe, S. Sekiguchi","doi":"10.1109/GRID.2004.47","DOIUrl":null,"url":null,"abstract":"A comprehensive study of the whole petabyte-scale archival data of astronomical observatories has a possibility of new science and new knowledge in the field, while it was not feasible so far due to lack of enough data analysis environment. The Grid Datafarm architecture is designed for global petabyte-scale data-intensive computing, which provides a Grid file system with file replica management for fault tolerance and load balancing, and parallel and distributed data computing support for a set of files, to meet with the requirements of the comprehensive study of the whole archival data. In the paper, we discuss about worldwide parallel and distributed data analysis in the observational astronomical field. The archival data is stored, replicated and dispersed in a Gfarm file system. All the astronomical data analysis tools successfully access files in Gfarm file system without any code modification, using a syscall hooking library regardless of file replica locations. Performance evaluation of the parallel data analysis in several ways shows file-affinity process scheduling plays an essential role for scalable and efficient parallel file I/O performance. A data calibration tools shows scalable file I/O performance, and achieved the file I/O performance of 5.9 GB/sec and 4.0 GB/sec for reading and writing FITS files, respectively, using 30 cluster nodes (60 CPUs). On-demandfile replica creation mitigates the overhead of access concentration. Another tool shows the performance improvement at a factor of six for reading a shared file by creating file replicas","PeriodicalId":335281,"journal":{"name":"Fifth IEEE/ACM International Workshop on Grid Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fifth IEEE/ACM International Workshop on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2004.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

A comprehensive study of the whole petabyte-scale archival data of astronomical observatories has a possibility of new science and new knowledge in the field, while it was not feasible so far due to lack of enough data analysis environment. The Grid Datafarm architecture is designed for global petabyte-scale data-intensive computing, which provides a Grid file system with file replica management for fault tolerance and load balancing, and parallel and distributed data computing support for a set of files, to meet with the requirements of the comprehensive study of the whole archival data. In the paper, we discuss about worldwide parallel and distributed data analysis in the observational astronomical field. The archival data is stored, replicated and dispersed in a Gfarm file system. All the astronomical data analysis tools successfully access files in Gfarm file system without any code modification, using a syscall hooking library regardless of file replica locations. Performance evaluation of the parallel data analysis in several ways shows file-affinity process scheduling plays an essential role for scalable and efficient parallel file I/O performance. A data calibration tools shows scalable file I/O performance, and achieved the file I/O performance of 5.9 GB/sec and 4.0 GB/sec for reading and writing FITS files, respectively, using 30 cluster nodes (60 CPUs). On-demandfile replica creation mitigates the overhead of access concentration. Another tool shows the performance improvement at a factor of six for reading a shared file by creating file replicas
网格数据平台的并行分布式天文数据分析
对整个pb量级的天文台档案数据进行综合研究,有可能获得该领域的新科学、新知识,但由于缺乏足够的数据分析环境,目前尚不可行。Grid Datafarm架构是为全球pb级数据密集型计算而设计的,它提供了一个具有文件副本管理的网格文件系统,用于容错和负载均衡,并支持一组文件的并行和分布式数据计算,以满足对整个档案数据进行综合研究的需求。本文讨论了观测天文领域的世界性并行和分布式数据分析。归档数据在Gfarm文件系统中存储、复制和分散。所有的天文数据分析工具都可以成功地访问Gfarm文件系统中的文件,而无需修改任何代码,使用系统调用挂钩库,而不考虑文件副本的位置。通过几种方法对并行数据分析的性能评估表明,文件亲和进程调度对于提高并行文件I/O的可伸缩性和效率起着至关重要的作用。数据校准工具显示了可扩展的文件I/O性能,在30个集群节点(60个cpu)的情况下,读取和写入FITS文件的文件I/O性能分别达到5.9 GB/sec和4.0 GB/sec。按需文件副本的创建减轻了访问集中的开销。另一个工具显示,通过创建文件副本来读取共享文件的性能提高了6倍
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信