ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing

IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY
Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song
{"title":"ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing","authors":"Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song","doi":"10.1093/sysbio/syaf054","DOIUrl":null,"url":null,"abstract":"Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, and (2) a data encoding scheme specific to CRISPR/Cas9-lineage tracing data which makes it amenable to the general model. Our branch length estimation method, which we call ‘ConvexML’, should be broadly applicable to any evolutionary model with irreversible mutations (ideally, with high diversity) and an approximately ignorable missing data mechanism. ‘ConvexML’ is available through the convexml open source Python package.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf054","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, and (2) a data encoding scheme specific to CRISPR/Cas9-lineage tracing data which makes it amenable to the general model. Our branch length estimation method, which we call ‘ConvexML’, should be broadly applicable to any evolutionary model with irreversible mutations (ideally, with high diversity) and an approximately ignorable missing data mechanism. ‘ConvexML’ is available through the convexml open source Python package.
ConvexML:在不可逆突变模型下快速准确的分支长度估计,通过应用于基于CRISPR/ cas9的谱系追踪来说明
分支长度估计是统计系统发育学中的一个基本问题,也是树重建算法的核心组成部分。传统上,一般采用时间可逆的突变模型,并且存在许多用于此场景的软件工具。随着CRISPR/Cas9谱系追踪技术的出现,人们对不可逆突变模型下分支长度估计的研究产生了浓厚的兴趣。在CRISPR/Cas9突变模型下,不可逆转的突变——以DNA插入或缺失的形式——在实验过程中积累,然后在单细胞水平上读出这些突变,以重建细胞谱系树。然而,迄今为止,对CRISPR/Cas9谱系追踪数据的大多数分析都局限于单细胞树拓扑结构的重建,这些拓扑结构描述了细胞之间的谱系关系,而不是祖先细胞状态与当前状态之间经过的时间。时间分辨树,也就是时间表,将使人们能够以前所未有的分辨率研究细胞群体的进化动态。事实上,时间分辨树将揭示树中事件的时间,亚克隆的相对适应性,以及细胞群体中表型变化的动态-以及其他重要应用。在这项工作中,我们引入了第一个可扩展和精确的方法,通过估计其分支长度将任何给定的单细胞树拓扑细化为单细胞时序图。为此,我们在一般不可逆突变模型下执行正则化最大似然估计,并与仅重建我们确信的祖先状态的最大简约性的保守版本配对。为了处理CRISPR/Cas9谱系追踪数据的特殊性-例如影响连续位点运行的双切除事件-我们避免使我们的模型更复杂,而是选择使用简单但有效的数据编码方案。同样,我们避免显式地对缺失的数据机制(例如可继承的缺失数据)建模,而是假设它们完全是随机丢失的。我们通过使用最小分支长度约束和伪计数的最大似然估计(MLE)的简单惩罚版本来稳定低信息状态下的估计。所有这些都导致了一个凸MLE问题,这个问题可以用现成的凸优化求解器在几秒钟内轻松解决。我们使用模拟和真实谱系追踪数据对我们的方法进行了基准测试,并表明它在几个任务上表现良好,在准确性方面匹配或优于TiDeTree和LAML等竞争方法,同时速度快10 ~ 100倍。值得注意的是,我们的统计模型更简单,更通用,因为我们没有明确地模拟CRISPR/Cas9谱系追踪数据的复杂性。从这个意义上说,我们的贡献是双重的:(1)在一般不可逆突变模型下快速和鲁棒的分支长度估计方法,以及(2)特定于CRISPR/ cas9谱系追踪数据的数据编码方案,使其适用于一般模型。我们的分支长度估计方法,我们称之为“ConvexML”,应该广泛适用于任何具有不可逆突变(理想情况下,具有高多样性)和几乎可以忽略的缺失数据机制的进化模型。‘ ConvexML ’可以通过ConvexML开源Python包获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Systematic Biology
Systematic Biology 生物-进化生物学
CiteScore
13.00
自引率
7.70%
发文量
70
审稿时长
6-12 weeks
期刊介绍: Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信