利用卷积神经网络从转录物丰度推断蛋白质。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-02-27 DOI:10.1186/s13040-025-00434-z

Patrick Maximilian Schwehn, Pascal Falter-Braun

{"title":"利用卷积神经网络从转录物丰度推断蛋白质。","authors":"Patrick Maximilian Schwehn, Pascal Falter-Braun","doi":"10.1186/s13040-025-00434-z","DOIUrl":null,"url":null,"abstract":"Background: Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).Results: After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r2) of 0.30 in H. sapiens and 0.32 in A. thaliana.Conclusions: For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"18"},"PeriodicalIF":6.1000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866710/pdf/","citationCount":"0","resultStr":"{\"title\":\"Inferring protein from transcript abundances using convolutional neural networks.\",\"authors\":\"Patrick Maximilian Schwehn, Pascal Falter-Braun\",\"doi\":\"10.1186/s13040-025-00434-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).Results: After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r2) of 0.30 in H. sapiens and 0.32 in A. thaliana.Conclusions: For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"18 1\",\"pages\":\"18\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866710/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-025-00434-z\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00434-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：虽然转录本丰度经常被用作蛋白质丰度的替代物，但它是一种不可靠的预测指标。由于蛋白质执行生物功能，其表达水平会影响表型结果，因此我们开发了一种卷积神经网络（CNN），根据智人（H. sapiens）和参照植物拟南芥（A. thaliana）的 mRNA 丰度、蛋白质序列和 mRNA 序列预测蛋白质丰度：经过超参数优化和初始数据探索，我们为基于值和基于序列的数据实施了不同的训练模块。通过分析学习到的权重，我们揭示了影响蛋白质-mRNA比值（PTRs）的常见和生物特异性序列特征，包括已知和推测的序列母题。加入特定条件下的蛋白质相互作用信息后，发现了与许多 PTRs 相关的基因，但并没有提高预测结果，这可能是由于数据不足造成的。综合模型预测了未见基因的蛋白质丰度，其决定系数（r2）在智人中为 0.30，在大连人中为 0.32：结论：对于智人来说，我们的模型比以前基于序列的方法提高了近 50%的预测性能，而对于三叶虫来说，它是首个同类模型。该模型学习到的图案再现了已知的调控元素，支持其在与蛋白质调控相关的系统级和假设驱动研究方法中的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Inferring protein from transcript abundances using convolutional neural networks.

查看原文本刊更多论文

Inferring protein from transcript abundances using convolutional neural networks.

Background: Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).

Results: After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r²) of 0.30 in H. sapiens and 0.32 in A. thaliana.

Conclusions: For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.