The Multi-Processor Scheduling Problem in Phylogenetics

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI:10.1109/IPDPSW.2012.86

Jiajie Zhang, A. Stamatakis

{"title":"The Multi-Processor Scheduling Problem in Phylogenetics","authors":"Jiajie Zhang, A. Stamatakis","doi":"10.1109/IPDPSW.2012.86","DOIUrl":null,"url":null,"abstract":"Advances in wet-lab sequencing techniques allow for sequencing between 100 genomes up to 1000 full transcriptomes of species whose evolutionary relationships shall be disentangled by means of phylogenetic analyses. Likelihood-based evolutionary models allow for partitioning such broad phylogenomic datasets, for instance into gene regions, for which likelihood model parameters (except for the tree itself) can be estimated independently. Present day phylogenomic datasets are typically split up into 1000-10,000 distinct partitions. While the likelihood on such datasets needs to be computed in parallel because of the high memory requirements, it has not yet been assessed how to optimally distribute partitions and/or alignment sites to processors, in particular when the number of cores is significantly smaller than the number of partitions. We find that, by distributing partitions (of varying lengths) monolithically to processors, the induced load distribution problem essentially corresponds to the well-known multiprocessor scheduling problem. By implementing the simple Longest Processing Time (LPT) heuristics in the PThreads and MPI version of RAxML-Light, we were able to accelerate run times by up to one order of magnitude. Other heuristics for multi-processor scheduling such as improved MultiFit, improved Zero-One, or the Three Phase approach did not yield notable performance improvements.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.86","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Advances in wet-lab sequencing techniques allow for sequencing between 100 genomes up to 1000 full transcriptomes of species whose evolutionary relationships shall be disentangled by means of phylogenetic analyses. Likelihood-based evolutionary models allow for partitioning such broad phylogenomic datasets, for instance into gene regions, for which likelihood model parameters (except for the tree itself) can be estimated independently. Present day phylogenomic datasets are typically split up into 1000-10,000 distinct partitions. While the likelihood on such datasets needs to be computed in parallel because of the high memory requirements, it has not yet been assessed how to optimally distribute partitions and/or alignment sites to processors, in particular when the number of cores is significantly smaller than the number of partitions. We find that, by distributing partitions (of varying lengths) monolithically to processors, the induced load distribution problem essentially corresponds to the well-known multiprocessor scheduling problem. By implementing the simple Longest Processing Time (LPT) heuristics in the PThreads and MPI version of RAxML-Light, we were able to accelerate run times by up to one order of magnitude. Other heuristics for multi-processor scheduling such as improved MultiFit, improved Zero-One, or the Three Phase approach did not yield notable performance improvements.

查看原文本刊更多论文

系统发育中的多处理器调度问题

湿实验室测序技术的进步允许对物种的100个基因组到1000个完整转录组进行测序，这些物种的进化关系将通过系统发育分析来解开。基于似然的进化模型允许将如此广泛的系统基因组数据集划分为基因区域，这样似然模型参数(除了树本身)可以独立估计。目前的系统基因组数据集通常被分成1000- 10000个不同的分区。虽然由于高内存需求，这些数据集上的可能性需要并行计算，但尚未评估如何最优地将分区和/或对齐位置分配给处理器，特别是当内核数量明显小于分区数量时。我们发现，通过将(不同长度的)分区单片地分配给处理器，诱导的负载分配问题本质上对应于众所周知的多处理器调度问题。通过在RAxML-Light的PThreads和MPI版本中实现简单的最长处理时间(LPT)启发式，我们能够将运行时间加快一个数量级。其他用于多处理器调度的启发式方法，如改进的MultiFit、改进的Zero-One或Three Phase方法，并没有产生显著的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

自引率

0.00%

发文量