Optimization of hadoop cluster foranalyzing large-scale sequence data inbioinformatics

IF 0.3 Q4 MATHEMATICS
Ádám Tóth, Ramin Karimi
{"title":"Optimization of hadoop cluster foranalyzing large-scale sequence data inbioinformatics","authors":"Ádám Tóth, Ramin Karimi","doi":"10.33039/AMI.2019.01.002","DOIUrl":null,"url":null,"abstract":"Unexpected growth of high-throughput sequencing platforms in recent years impacted virtually all areas of modern biology. However, the ability to produce data continues to outpace the ability to analyze them. Therefore, continuous efforts are also needed to improve bioinformatics applications for a better use of these research opportunities. Due to the complexity and diver-sity of metagenomics data, it has been a major challenging field of bioinformatics. Sequence-based identification methods such as using DNA signature (unique k-mer) are the most recent popular methods of real-time analysis of raw sequencing data. DNA signature discovery is compute-intensive and time-consuming.Hadoop,the application of parallel and distributed computing is one of the popular applications for the analysis of large scale data in bioinformatics. Optimization of the time-consumption and computational resource usages such as CPU consumption and memory usage are the main goals of this paper, along with the management of the Hadoop cluster nodes.","PeriodicalId":43454,"journal":{"name":"Annales Mathematicae et Informaticae","volume":null,"pages":null},"PeriodicalIF":0.3000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annales Mathematicae et Informaticae","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33039/AMI.2019.01.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 1

Abstract

Unexpected growth of high-throughput sequencing platforms in recent years impacted virtually all areas of modern biology. However, the ability to produce data continues to outpace the ability to analyze them. Therefore, continuous efforts are also needed to improve bioinformatics applications for a better use of these research opportunities. Due to the complexity and diver-sity of metagenomics data, it has been a major challenging field of bioinformatics. Sequence-based identification methods such as using DNA signature (unique k-mer) are the most recent popular methods of real-time analysis of raw sequencing data. DNA signature discovery is compute-intensive and time-consuming.Hadoop,the application of parallel and distributed computing is one of the popular applications for the analysis of large scale data in bioinformatics. Optimization of the time-consumption and computational resource usages such as CPU consumption and memory usage are the main goals of this paper, along with the management of the Hadoop cluster nodes.
生物信息学中大规模序列数据分析的hadoop集群优化
近年来,高通量测序平台的意外增长几乎影响了现代生物学的所有领域。然而,生成数据的能力继续超过分析数据的能力。因此,为了更好地利用这些研究机会,还需要不断努力提高生物信息学的应用。由于宏基因组学数据的复杂性和多样性,它一直是生物信息学的一个主要挑战领域。基于序列的鉴定方法,如使用DNA签名(独特的k-mer)是最新流行的实时分析原始测序数据的方法。DNA特征的发现需要大量计算,而且耗时。Hadoop是并行和分布式计算的应用,是生物信息学中大规模数据分析的热门应用之一。优化时间消耗和计算资源使用(如CPU消耗和内存使用)以及Hadoop集群节点的管理是本文的主要目标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.90
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信