VTrans: A VAE-Based Pre-Trained Transformer Method for Microbiome Data Analysis.

IF 1.4 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology Pub Date : 2025-04-28 DOI:10.1089/cmb.2024.0884

Xinyuan Shi, Fangfang Zhu, Wenwen Min

{"title":"VTrans: A VAE-Based Pre-Trained Transformer Method for Microbiome Data Analysis.","authors":"Xinyuan Shi, Fangfang Zhu, Wenwen Min","doi":"10.1089/cmb.2024.0884","DOIUrl":null,"url":null,"abstract":"<p><p>Predicting the survival outcomes and assessing the risk of patients play a pivotal role in comprehending the microbial composition across various stages of cancer. With the ongoing advancements in deep learning, it has been substantiated that deep learning holds the potential to analyze patient survival risks based on microbial data. However, confronting a common challenge in individual cancer datasets involves the limited sample size and the high dimensionality of the feature space. This predicament often leads to overfitting issues in deep learning models, hindering their ability to effectively extract profound data representations and resulting in suboptimal model performance. To overcome these challenges, we advocate the utilization of pretraining and fine-tuning strategies, which have proven effective in addressing the constraint of having a smaller sample size in individual cancer datasets. In this study, we propose a deep learning model that amalgamates Transformer encoder and variational autoencoder (VAE), VTrans, employing both pre-training and fine-tuning strategies to predict the survival risk of cancer patients using microbial data. Furthermore, we highlight the potential of extending VTrans to integrate microbial multi-omics data. Our method is assessed on three distinct cancer datasets from The Cancer Genome Atlas Program, and the research findings demonstrated that (1) VTrans excels in terms of performance compared to conventional machine learning and other deep learning models. (2) The utilization of pretraning significantly enhances its performance. (3) In contrast to positional encoding, employing VAE encoding proves to be more effective in enriching data representation. (4) Using the idea of saliency map, it is possible to observe which microbes have a high contribution to the classification results. These results demonstrate the effectiveness of VTrans in prediting patient survival risk. Source code and all datasets used in this paper are available at https://github.com/wenwenmin/VTrans and https://doi.org/10.5281/zenodo.14166580.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0884","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Predicting the survival outcomes and assessing the risk of patients play a pivotal role in comprehending the microbial composition across various stages of cancer. With the ongoing advancements in deep learning, it has been substantiated that deep learning holds the potential to analyze patient survival risks based on microbial data. However, confronting a common challenge in individual cancer datasets involves the limited sample size and the high dimensionality of the feature space. This predicament often leads to overfitting issues in deep learning models, hindering their ability to effectively extract profound data representations and resulting in suboptimal model performance. To overcome these challenges, we advocate the utilization of pretraining and fine-tuning strategies, which have proven effective in addressing the constraint of having a smaller sample size in individual cancer datasets. In this study, we propose a deep learning model that amalgamates Transformer encoder and variational autoencoder (VAE), VTrans, employing both pre-training and fine-tuning strategies to predict the survival risk of cancer patients using microbial data. Furthermore, we highlight the potential of extending VTrans to integrate microbial multi-omics data. Our method is assessed on three distinct cancer datasets from The Cancer Genome Atlas Program, and the research findings demonstrated that (1) VTrans excels in terms of performance compared to conventional machine learning and other deep learning models. (2) The utilization of pretraning significantly enhances its performance. (3) In contrast to positional encoding, employing VAE encoding proves to be more effective in enriching data representation. (4) Using the idea of saliency map, it is possible to observe which microbes have a high contribution to the classification results. These results demonstrate the effectiveness of VTrans in prediting patient survival risk. Source code and all datasets used in this paper are available at https://github.com/wenwenmin/VTrans and https://doi.org/10.5281/zenodo.14166580.

查看原文本刊更多论文

VTrans：一种基于vae的微生物组数据分析预训练变压器方法。

预测生存结果和评估患者风险在了解不同癌症阶段的微生物组成方面起着关键作用。随着深度学习的不断进步，已经证实深度学习具有基于微生物数据分析患者生存风险的潜力。然而，单个癌症数据集面临的一个共同挑战是有限的样本量和特征空间的高维。这种困境经常导致深度学习模型中的过拟合问题，阻碍了它们有效提取深度数据表示的能力，并导致模型性能次优。为了克服这些挑战，我们提倡使用预训练和微调策略，这些策略已被证明在解决单个癌症数据集中样本量较小的限制方面是有效的。在这项研究中，我们提出了一个深度学习模型，该模型结合了变压器编码器和变分自编码器（VAE）， VTrans，采用预训练和微调策略，利用微生物数据预测癌症患者的生存风险。此外，我们强调了扩展VTrans以整合微生物多组学数据的潜力。我们的方法在来自癌症基因组图谱计划的三个不同的癌症数据集上进行了评估，研究结果表明：(1)与传统机器学习和其他深度学习模型相比，VTrans在性能方面表现出色。(2)预训练的使用显著提高了其性能。(3)与位置编码相比，采用VAE编码在丰富数据表示方面更为有效。(4)利用显著性图的思想，可以观察到哪些微生物对分类结果的贡献较大。这些结果证明了VTrans在预测患者生存风险方面的有效性。本文中使用的源代码和所有数据集可在https://github.com/wenwenmin/VTrans和https://doi.org/10.5281/zenodo.14166580上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Biology 生物-计算机：跨学科应用

CiteScore

3.60

自引率

5.90%

发文量

113

审稿时长

6-12 weeks

期刊介绍： Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases