Current genomic deep learning models display decreased performance in cell type-specific accessible regions

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2024-08-01 DOI:10.1186/s13059-024-03335-2

Pooja Kathail, Richard W. Shuai, Ryan Chung, Chun Jimmie Ye, Gabriel B. Loeb, Nilah M. Ioannidis

{"title":"Current genomic deep learning models display decreased performance in cell type-specific accessible regions","authors":"Pooja Kathail, Richard W. Shuai, Ryan Chung, Chun Jimmie Ye, Gabriel B. Loeb, Nilah M. Ioannidis","doi":"10.1186/s13059-024-03335-2","DOIUrl":null,"url":null,"abstract":"A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models―Enformer and Sei―varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax―through single-task learning or high capacity multi-task models―can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"37 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-024-03335-2","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models―Enformer and Sei―varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax―through single-task learning or high capacity multi-task models―can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.

查看原文本刊更多论文

当前的基因组深度学习模型在细胞类型特异性可访问区域的性能下降

目前已开发出许多深度学习模型，用于预测DNA序列的染色质可及性等表观遗传特征。模型评估通常报告的是全基因组的性能；然而，在基因调控中发挥关键作用的顺式调控元件（CRE）只占基因组的一小部分。此外，细胞类型特异性的 CREs 包含了很大一部分复杂疾病的遗传性。我们评估了具有不同程度细胞类型特异性的染色质可及性区域的基因组深度学习模型。我们评估了该领域的两个建模方向：在数千种输出（细胞类型和表观遗传标记）中训练的通用模型和针对特定组织和任务定制的模型。我们发现，基因组深度学习模型（包括两个最先进的通用模型--Enformer 和 Sei）在整个基因组中的准确性各不相同，而且在细胞类型特异性可访问区域的准确性有所降低。利用在特定组织的细胞类型上训练的可访问性模型，我们发现，通过单任务学习或高容量多任务模型，提高模型学习细胞类型特异性调控语法的能力，可以改善细胞类型特异性可访问区域的性能。我们还发现，改进参考序列预测并不能持续改进变异效应预测，这表明需要新的策略来提高变异的性能。我们的研究结果为基因组深度学习模型的性能提供了一个新的视角，表明整个基因组的性能各不相同，在细胞类型特异性可访问区域的性能尤其下降。我们还确定了在细胞类型特异性可访问区域最大化性能的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.