多核结构上短序列读取的进化定位

ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010 Pub Date : 2010-05-16 DOI:10.1109/AICCSA.2010.5586973

A. Stamatakis, Zsolt Komornik, S. Berger

{"title":"多核结构上短序列读取的进化定位","authors":"A. Stamatakis, Zsolt Komornik, S. Berger","doi":"10.1109/AICCSA.2010.5586973","DOIUrl":null,"url":null,"abstract":"The application of high performance computing methods in bioinformatics becomes increasingly important because of the masses of data generated by novel short-read DNA sequencers. One important application of such short reads, is the analysis of microbial communities where the anonymous short reads need to be identified by sequence comparison to a set of reference sequences. This identification is required to analyze the microbial composition and biological diversity of the sample. We briefly introduce a new algorithm for evolutionary (phylogenetic) placement of short reads under the Maximum Likelihood criterion and implement it in RAxML. While this algorithm is significantly more accurate than plain pair-wise sequence comparison it can become highly compute-intensive when a typical number of 100,000 reads and more need to be placed into an existing phylogenetic tree. Therefore, we deploy multi-grain parallelism to improve parallel efficiency of this algorithm on 16-core and 32-core architectures. Via this multi-grain approach, we achieve parallel execution time improvements of 25% and super-linear speedups on 16 cores, as well as near-linear speedups and improvements exceeding 50% on 32-cores on two large real-world microbial datasets. Evolutionary placement of 100,000 reads into a tree with more than 4,000 taxa now only requires less than 2 hours of execution time on 32 cores.","PeriodicalId":352946,"journal":{"name":"ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Evolutionary placement of short sequence reads on multi-core architectures\",\"authors\":\"A. Stamatakis, Zsolt Komornik, S. Berger\",\"doi\":\"10.1109/AICCSA.2010.5586973\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The application of high performance computing methods in bioinformatics becomes increasingly important because of the masses of data generated by novel short-read DNA sequencers. One important application of such short reads, is the analysis of microbial communities where the anonymous short reads need to be identified by sequence comparison to a set of reference sequences. This identification is required to analyze the microbial composition and biological diversity of the sample. We briefly introduce a new algorithm for evolutionary (phylogenetic) placement of short reads under the Maximum Likelihood criterion and implement it in RAxML. While this algorithm is significantly more accurate than plain pair-wise sequence comparison it can become highly compute-intensive when a typical number of 100,000 reads and more need to be placed into an existing phylogenetic tree. Therefore, we deploy multi-grain parallelism to improve parallel efficiency of this algorithm on 16-core and 32-core architectures. Via this multi-grain approach, we achieve parallel execution time improvements of 25% and super-linear speedups on 16 cores, as well as near-linear speedups and improvements exceeding 50% on 32-cores on two large real-world microbial datasets. Evolutionary placement of 100,000 reads into a tree with more than 4,000 taxa now only requires less than 2 hours of execution time on 32 cores.\",\"PeriodicalId\":352946,\"journal\":{\"name\":\"ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICCSA.2010.5586973\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA.2010.5586973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

由于新型短读DNA测序仪产生的大量数据，高性能计算方法在生物信息学中的应用变得越来越重要。这种短序列的一个重要应用是微生物群落的分析，其中匿名短序列需要通过与一组参考序列的序列比较来识别。这种鉴定需要分析样品的微生物组成和生物多样性。本文简要介绍了一种基于最大似然准则的短读段进化(系统发育)定位算法，并在RAxML中实现。虽然这种算法比普通的成对序列比较要准确得多，但当需要将100,000个或更多的读取数据放入现有的系统发育树中时，它可能会变得高度计算密集。因此，我们部署了多粒并行来提高该算法在16核和32核架构上的并行效率。通过这种多粒度方法，我们在16核上实现了25%的并行执行时间改进和超线性加速，在32核上实现了近线性加速和超过50%的改进。在拥有4000多个分类群的树中进行10万个读取的进化放置，现在只需要在32个内核上执行不到2小时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evolutionary placement of short sequence reads on multi-core architectures

The application of high performance computing methods in bioinformatics becomes increasingly important because of the masses of data generated by novel short-read DNA sequencers. One important application of such short reads, is the analysis of microbial communities where the anonymous short reads need to be identified by sequence comparison to a set of reference sequences. This identification is required to analyze the microbial composition and biological diversity of the sample. We briefly introduce a new algorithm for evolutionary (phylogenetic) placement of short reads under the Maximum Likelihood criterion and implement it in RAxML. While this algorithm is significantly more accurate than plain pair-wise sequence comparison it can become highly compute-intensive when a typical number of 100,000 reads and more need to be placed into an existing phylogenetic tree. Therefore, we deploy multi-grain parallelism to improve parallel efficiency of this algorithm on 16-core and 32-core architectures. Via this multi-grain approach, we achieve parallel execution time improvements of 25% and super-linear speedups on 16 cores, as well as near-linear speedups and improvements exceeding 50% on 32-cores on two large real-world microbial datasets. Evolutionary placement of 100,000 reads into a tree with more than 4,000 taxa now only requires less than 2 hours of execution time on 32 cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010

自引率

0.00%

发文量