基于深度学习和物理专家的鲁棒RNA二级结构预测。

IF 1.3 Q3 BIOCHEMICAL RESEARCH METHODS

Biology Methods and Protocols Pub Date : 2025-01-06 eCollection Date: 2025-01-01 DOI:10.1093/biomethods/bpae097

Xiangyun Qiu

{"title":"基于深度学习和物理专家的鲁棒RNA二级结构预测。","authors":"Xiangyun Qiu","doi":"10.1093/biomethods/bpae097","DOIUrl":null,"url":null,"abstract":"A mixture-of-experts (MoE) approach has been developed to mitigate the poor out-of-distribution (OOD) generalization of deep learning (DL) models for single-sequence-based prediction of RNA secondary structure. The main idea behind this approach is to use DL models for in-distribution (ID) test sequences to leverage their superior ID performances, while relying on physics-based models for OOD sequences to ensure robust predictions. One key ingredient of the pipeline, named MoEFold2D, is automated ID/OOD detection via consensus analysis of an ensemble of DL model predictions without requiring access to training data during inference. Specifically, motivated by the clustered distribution of known RNA structures, a collection of distinct DL models is trained by iteratively leaving one cluster out. Each DL model hence serves as an expert on all but one cluster in the training data. Consequently, for an ID sequence, all but one DL model makes accurate predictions consistent with one another, while an OOD sequence yields highly inconsistent predictions among all DL models. Through consensus analysis of DL predictions, test sequences are categorized as ID or OOD. ID sequences are subsequently predicted by averaging the DL models in consensus, and OOD sequences are predicted using physics-based models. Instead of remediating generalization gaps with alternative approaches such as transfer learning and sequence alignment, MoEFold2D circumvents unpredictable ID-OOD gaps and combines the strengths of DL and physics-based models to achieve accurate ID and robust OOD predictions.","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"10 1","pages":"bpae097"},"PeriodicalIF":1.3000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729747/pdf/","citationCount":"0","resultStr":"{\"title\":\"Robust RNA secondary structure prediction with a mixture of deep learning and physics-based experts.\",\"authors\":\"Xiangyun Qiu\",\"doi\":\"10.1093/biomethods/bpae097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A mixture-of-experts (MoE) approach has been developed to mitigate the poor out-of-distribution (OOD) generalization of deep learning (DL) models for single-sequence-based prediction of RNA secondary structure. The main idea behind this approach is to use DL models for in-distribution (ID) test sequences to leverage their superior ID performances, while relying on physics-based models for OOD sequences to ensure robust predictions. One key ingredient of the pipeline, named MoEFold2D, is automated ID/OOD detection via consensus analysis of an ensemble of DL model predictions without requiring access to training data during inference. Specifically, motivated by the clustered distribution of known RNA structures, a collection of distinct DL models is trained by iteratively leaving one cluster out. Each DL model hence serves as an expert on all but one cluster in the training data. Consequently, for an ID sequence, all but one DL model makes accurate predictions consistent with one another, while an OOD sequence yields highly inconsistent predictions among all DL models. Through consensus analysis of DL predictions, test sequences are categorized as ID or OOD. ID sequences are subsequently predicted by averaging the DL models in consensus, and OOD sequences are predicted using physics-based models. Instead of remediating generalization gaps with alternative approaches such as transfer learning and sequence alignment, MoEFold2D circumvents unpredictable ID-OOD gaps and combines the strengths of DL and physics-based models to achieve accurate ID and robust OOD predictions.\",\"PeriodicalId\":36528,\"journal\":{\"name\":\"Biology Methods and Protocols\",\"volume\":\"10 1\",\"pages\":\"bpae097\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729747/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology Methods and Protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/biomethods/bpae097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpae097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

为了缓解基于单序列的RNA二级结构预测中深度学习（DL）模型的差分布外（OOD）泛化，开发了一种专家混合（MoE）方法。这种方法背后的主要思想是使用分布式（ID）测试序列的DL模型来利用其优越的ID性能，同时依靠基于物理的OOD序列模型来确保稳健的预测。该管道的一个关键组成部分，名为MoEFold2D，是通过对DL模型预测集合的共识分析来自动检测ID/OOD，而无需在推理期间访问训练数据。具体来说，受已知RNA结构聚类分布的驱动，通过迭代地剔除一个聚类来训练一组不同的DL模型。因此，每个DL模型都是训练数据中除一个聚类之外的所有聚类的专家。因此，对于ID序列，除了一个DL模型之外，所有DL模型都能做出彼此一致的准确预测，而OOD序列在所有DL模型中产生高度不一致的预测。通过DL预测的一致性分析，将测试序列分类为ID或OOD。随后，通过一致平均DL模型预测ID序列，使用基于物理的模型预测OOD序列。MoEFold2D没有使用迁移学习和序列对齐等替代方法来弥补泛化差距，而是绕过了不可预测的ID-OOD差距，并结合了DL和基于物理的模型的优势，以实现准确的ID和稳健的OOD预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust RNA secondary structure prediction with a mixture of deep learning and physics-based experts.

A mixture-of-experts (MoE) approach has been developed to mitigate the poor out-of-distribution (OOD) generalization of deep learning (DL) models for single-sequence-based prediction of RNA secondary structure. The main idea behind this approach is to use DL models for in-distribution (ID) test sequences to leverage their superior ID performances, while relying on physics-based models for OOD sequences to ensure robust predictions. One key ingredient of the pipeline, named MoEFold2D, is automated ID/OOD detection via consensus analysis of an ensemble of DL model predictions without requiring access to training data during inference. Specifically, motivated by the clustered distribution of known RNA structures, a collection of distinct DL models is trained by iteratively leaving one cluster out. Each DL model hence serves as an expert on all but one cluster in the training data. Consequently, for an ID sequence, all but one DL model makes accurate predictions consistent with one another, while an OOD sequence yields highly inconsistent predictions among all DL models. Through consensus analysis of DL predictions, test sequences are categorized as ID or OOD. ID sequences are subsequently predicted by averaging the DL models in consensus, and OOD sequences are predicted using physics-based models. Instead of remediating generalization gaps with alternative approaches such as transfer learning and sequence alignment, MoEFold2D circumvents unpredictable ID-OOD gaps and combines the strengths of DL and physics-based models to achieve accurate ID and robust OOD predictions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biology Methods and Protocols Agricultural and Biological Sciences-Agricultural and Biological Sciences (all)

CiteScore

3.80

自引率

2.80%

发文量

审稿时长

19 weeks