Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.

IF 4.5 1区生物学 Q1 BIOLOGY

BMC Biology Pub Date : 2025-08-08 DOI:10.1186/s12915-025-02348-y

Jie Liu, Henghui Fan, Na Cheng, Yansen Su, Junfeng Xia

{"title":"Prediction of human pathogenic start loss variants based on self-supervised contrastive learning.","authors":"Jie Liu, Henghui Fan, Na Cheng, Yansen Su, Junfeng Xia","doi":"10.1186/s12915-025-02348-y","DOIUrl":null,"url":null,"abstract":"Background: Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.Results: Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.Conclusions: Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.","PeriodicalId":9339,"journal":{"name":"BMC Biology","volume":"23 1","pages":"250"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12333246/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12915-025-02348-y","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.

Results: Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.

Conclusions: Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.

Abstract Image

查看原文本刊更多论文

基于自监督对比学习的人类致病性起始损失变异预测。

背景：起始丢失变异是一类影响起始密码子碱基的遗传变异，破坏正常的翻译起始过程，导致蛋白质缺失或产生不同的蛋白质。准确评估这些变异的致病性对于破译疾病机制和将基因组学整合到临床实践中至关重要。然而，在人类基因组中数以万计的起始丢失变异中，只有约1%被归类为致病性或良性。仅依赖于少量标记数据的计算方法往往缺乏足够的泛化能力，限制了它们在预测启动损失变量影响方面的有效性。结果：在这里，我们介绍了StartCLR，一种专门用于识别致病性启动损失变异的新型预测方法。StartCLR通过集成来自不同DNA语言模型的嵌入特征，从不同的维度捕获不同的上下文信息。采用自监督预训练与监督微调相结合的方法，既能有效利用大量未标记数据，也能有效利用少量标记数据，提高预测精度。实验结果表明，StartCLR在不同的测试集上表现出较强的泛化能力和较好的预测性能。值得注意的是，当只对高置信度的标记数据进行训练时，尽管标记数据的数量减少了，StartCLR仍然保持甚至提高了预测精度。结论：总的来说，这些发现强调了将自我监督对比学习与未标记数据相结合的潜力，以减轻标记开始损失变量的稀缺性带来的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Biology 生物-生物学

CiteScore

7.80

自引率

1.90%

发文量

260

审稿时长

3 months

期刊介绍： BMC Biology is a broad scope journal covering all areas of biology. Our content includes research articles, new methods and tools. BMC Biology also publishes reviews, Q&A, and commentaries.