A Distributed-Processing System for Accelerating Biological Research Using Data-Staging

Y. Kido, S. Seno, S. Date, Y. Takenaka, H. Matsuda
{"title":"A Distributed-Processing System for Accelerating Biological Research Using Data-Staging","authors":"Y. Kido, S. Seno, S. Date, Y. Takenaka, H. Matsuda","doi":"10.2197/IPSJDC.4.250","DOIUrl":null,"url":null,"abstract":"The number of biological databases has been increasing rapidly as a result of progress in biotechnology. As the amount and heterogeneity of biological data increase, it becomes more difficult to manage the data in a few centralized databases. Moreover, the number of sites storing these databases is getting larger, and the geographic distribution of these databases has become wider. In addition, biological research tends to require a large amount of computational resources, i.e., a large number of computing nodes. As such, the computational demand has been increasing with the rapid progress of biological research. Thus, the development of methods that enable computing nodes to use such widely-distributed database sites effectively is desired. In this paper, we propose a method for providing data from the database sites to computing nodes. Since it is difficult to decide which program runs on a node and which data are requested as their inputs in advance, we have introduced the notion of “data-staging” in the proposed method. Data-staging dynamically searches for the input data from the database sites and transfers the input data to the node where the program runs. We have developed a prototype system with data-staging using grid middleware. The effectiveness of the prototype system is demonstrated by measurement of the execution time of similarity search of several-hundred gene sequences against 527 prokaryotic genome data.","PeriodicalId":432390,"journal":{"name":"Ipsj Digital Courier","volume":"163 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ipsj Digital Courier","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/IPSJDC.4.250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The number of biological databases has been increasing rapidly as a result of progress in biotechnology. As the amount and heterogeneity of biological data increase, it becomes more difficult to manage the data in a few centralized databases. Moreover, the number of sites storing these databases is getting larger, and the geographic distribution of these databases has become wider. In addition, biological research tends to require a large amount of computational resources, i.e., a large number of computing nodes. As such, the computational demand has been increasing with the rapid progress of biological research. Thus, the development of methods that enable computing nodes to use such widely-distributed database sites effectively is desired. In this paper, we propose a method for providing data from the database sites to computing nodes. Since it is difficult to decide which program runs on a node and which data are requested as their inputs in advance, we have introduced the notion of “data-staging” in the proposed method. Data-staging dynamically searches for the input data from the database sites and transfers the input data to the node where the program runs. We have developed a prototype system with data-staging using grid middleware. The effectiveness of the prototype system is demonstrated by measurement of the execution time of similarity search of several-hundred gene sequences against 527 prokaryotic genome data.
利用数据分期加速生物研究的分布式处理系统
由于生物技术的进步,生物数据库的数量迅速增加。随着生物数据的数量和异构性的增加,在几个集中的数据库中管理数据变得越来越困难。此外,储存这些数据库的地点越来越多,这些数据库的地理分布也越来越广。此外,生物研究往往需要大量的计算资源,即大量的计算节点。因此,随着生物研究的快速发展,对计算的需求也在不断增加。因此,需要开发使计算节点能够有效地使用这种广泛分布的数据库站点的方法。在本文中,我们提出了一种从数据库站点向计算节点提供数据的方法。由于很难事先决定在节点上运行哪个程序以及请求哪些数据作为其输入,因此我们在建议的方法中引入了“数据分段”的概念。数据暂存动态地从数据库站点搜索输入数据,并将输入数据传输到程序运行的节点。我们利用网格中间件开发了一个具有数据分段功能的原型系统。通过对527个原核生物基因组数据进行数百个基因序列相似性搜索的执行时间测量,验证了原型系统的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信