F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny

IF 5.5 2区 医学 Q1 VIROLOGY
Virus Evolution Pub Date : 2024-07-25 DOI:10.1093/ve/veae056
Yongtao Ye, Marcus H Shum, Isaac Wu, Carlos Chau, Ningqi Zhao, David K Smith, Joseph T Wu, Tommy T Lam
{"title":"F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny","authors":"Yongtao Ye, Marcus H Shum, Isaac Wu, Carlos Chau, Ningqi Zhao, David K Smith, Joseph T Wu, Tommy T Lam","doi":"10.1093/ve/veae056","DOIUrl":null,"url":null,"abstract":"The unprecedentedly large size of the global SARS-CoV-2 phylogeny makes any computation on the tree difficult. Lineage identification (e.g. the PANGO nomenclature for SARS-CoV-2) and assignment are key to track the virus evolution. It requires annotating clade roots of lineages to unlabeled ancestral nodes in a phylogenetic tree. Then the lineage labels of descendant samples under these clade roots can be inferred to be the corresponding lineages. This is the ancestral lineage annotation problem, and matUtils (a package in pUShER) and PastML are commonly used methods. However, their computational tractability is a challenge and their accuracy needs further exploration in huge SARS-CoV-2 phylogenies. We have developed an efficient and accurate method, called ‘F1ALA’, that utilizes the F1-score to evaluate the confidence with which a specific ancestral node can be annotated as the clade root of a lineage, given the lineage labels of a set of taxa in a rooted tree. Compared to these methods, F1ALA achieved roughly an order of magnitude faster yet with ~12% of their memory usage when annotating 2,277 PANGO lineages in a phylogeny of 5.26 million taxa. F1ALA allows real-time lineage tracking be performed on a laptop computer. F1ALA outperformed matUtils (pUShER) with statistical significance, and had comparable accuracy to PastML in tests on empirical and simulated data. F1ALA enables a tree refinement by pruning taxa with inconsistent labels to their closest annotation nodes and re-inserting them back to the pruned tree to improve a SARS-CoV-2 phylogeny with both higher log-likelihood and lower parsimony score. Given the ultrafast speed and high accuracy, we anticipated that F1ALA will also be useful for large phylogenies of other viruses. Codes and benchmark datasets are publicly available at https://github.com/id-bioinfo/F1ALA.","PeriodicalId":56026,"journal":{"name":"Virus Evolution","volume":"34 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virus Evolution","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ve/veae056","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"VIROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The unprecedentedly large size of the global SARS-CoV-2 phylogeny makes any computation on the tree difficult. Lineage identification (e.g. the PANGO nomenclature for SARS-CoV-2) and assignment are key to track the virus evolution. It requires annotating clade roots of lineages to unlabeled ancestral nodes in a phylogenetic tree. Then the lineage labels of descendant samples under these clade roots can be inferred to be the corresponding lineages. This is the ancestral lineage annotation problem, and matUtils (a package in pUShER) and PastML are commonly used methods. However, their computational tractability is a challenge and their accuracy needs further exploration in huge SARS-CoV-2 phylogenies. We have developed an efficient and accurate method, called ‘F1ALA’, that utilizes the F1-score to evaluate the confidence with which a specific ancestral node can be annotated as the clade root of a lineage, given the lineage labels of a set of taxa in a rooted tree. Compared to these methods, F1ALA achieved roughly an order of magnitude faster yet with ~12% of their memory usage when annotating 2,277 PANGO lineages in a phylogeny of 5.26 million taxa. F1ALA allows real-time lineage tracking be performed on a laptop computer. F1ALA outperformed matUtils (pUShER) with statistical significance, and had comparable accuracy to PastML in tests on empirical and simulated data. F1ALA enables a tree refinement by pruning taxa with inconsistent labels to their closest annotation nodes and re-inserting them back to the pruned tree to improve a SARS-CoV-2 phylogeny with both higher log-likelihood and lower parsimony score. Given the ultrafast speed and high accuracy, we anticipated that F1ALA will also be useful for large phylogenies of other viruses. Codes and benchmark datasets are publicly available at https://github.com/id-bioinfo/F1ALA.
F1ALA:应用于庞大的 SARS-CoV-2 系统发生的超快速、记忆效率高的祖系注释
全球 SARS-CoV-2 系统发生的规模之大前所未有,这使得对该树进行任何计算都十分困难。支系识别(如 SARS-CoV-2 的 PANGO 命名法)和分配是追踪病毒进化的关键。这需要将支系的支系根标注到系统发生树中未标注的祖先节点上。然后,这些支系根下的后代样本的世系标签就可以推断为相应的世系。这就是祖系注释问题,matUtils(pUShER 中的一个软件包)和 PastML 是常用的方法。然而,在庞大的 SARS-CoV-2 系统发生中,它们的计算可操作性是一个挑战,其准确性也需要进一步探索。我们开发了一种高效、准确的方法,称为 "F1ALA",它利用 F1 分数来评估在有根树上一组分类群的世系标签的情况下,特定祖先节点被注释为世系根的可信度。与这些方法相比,F1ALA 在526 万个分类群的系统发生中注释了2277 个 PANGO 系时,速度大约快了一个数量级,但内存使用量仅为它们的 12%。F1ALA 允许在笔记本电脑上进行实时谱系追踪。在统计意义上,F1ALA 优于 matUtils (pUShER),在经验数据和模拟数据的测试中,F1ALA 的准确性与 PastML 相当。F1ALA 通过将标签不一致的类群修剪到与其最接近的注释节点,并将其重新插入到修剪后的树中,实现了树的完善,从而以更高的对数似然比和更低的解析得分完善了 SARS-CoV-2 系统发生学。鉴于 F1ALA 的超快速度和高准确性,我们预计它还将用于其他病毒的大型系统发生。代码和基准数据集可通过 https://github.com/id-bioinfo/F1ALA 公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Virus Evolution
Virus Evolution Immunology and Microbiology-Microbiology
CiteScore
10.50
自引率
5.70%
发文量
108
审稿时长
14 weeks
期刊介绍: Virus Evolution is a new Open Access journal focusing on the long-term evolution of viruses, viruses as a model system for studying evolutionary processes, viral molecular epidemiology and environmental virology. The aim of the journal is to provide a forum for original research papers, reviews, commentaries and a venue for in-depth discussion on the topics relevant to virus evolution.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信