利用心血管疾病患者的多组学特征发现新型生物标记物和预测疾病的多模态人工智能/人工智能模型

bioRxiv - Genomics Pub Date : 2024-08-09 DOI:10.1101/2024.08.07.607041

William DeGroat, Habiba Abdelhalim, Elizabeth Peker, Neev Sheth, Rishabh Narayanan, Saman Zeeshan, Bruce T. Liang, Zeeshan Ahmed

{"title":"利用心血管疾病患者的多组学特征发现新型生物标记物和预测疾病的多模态人工智能/人工智能模型","authors":"William DeGroat, Habiba Abdelhalim, Elizabeth Peker, Neev Sheth, Rishabh Narayanan, Saman Zeeshan, Bruce T. Liang, Zeeshan Ahmed","doi":"10.1101/2024.08.07.607041","DOIUrl":null,"url":null,"abstract":"Cardiovascular diseases (CVDs) are multifactorial diseases, requiring personalized assessment and treatment. The advancements in multi-omics technologies, namely RNA-seq and whole genome sequencing, have offered translational researchers a comprehensive view of the human genome; utilizing this data, we can reveal novel biomarkers and segment patient populations based on personalized risk factors. Limitations in these technologies in failing to capture disease complexity can be accounted for by using an integrated approach, characterizing variants alongside expression related to emerging phenotypes. Designed and implemented data analytics methodology is based on a nexus of orthodox bioinformatics, classical statistics, and multimodal artificial intelligence and machine learning techniques. Our approach has the potential to reveal the intricate mechanisms of CVD that can facilitate patient-specific disease risk and response profiling. We sourced transcriptomic expression and variants from CVD and control subjects. By integrating these multi-omics datasets with clinical demographics, we generated patient-specific profiles. Utilizing a robust feature selection approach, we reported a signature of 27 transcripts and variants efficient at predicting CVD. Here, differential expression analysis and minimum redundancy maximum relevance feature selection elucidated biomarkers explanatory of the disease phenotype. We used Combination Annotation Dependent Depletion and allele frequencies to identify variants with pathogenic characteristics in CVD patients. Classification models trained on this signature demonstrated high-accuracy predictions for CVDs. Overall, we observed an XGBoost model hyperparameterized using Bayesian optimization perform the best (AUC 1.0). Using SHapley Additive exPlanations, we compiled risk assessments for patients capable of further contextualizing these predictions in a clinical setting. We discovered a 27-component signature explanatory of phenotypic differences in CVD patients and healthy controls using a feature selection approach prioritizing both biological relevance and efficiency in machine learning. Literature review revealed previous CVD associations in a majority of these diagnostic biomarkers. Classification models trained on this signature were able to predict CVD in patients with high accuracy. Here, we propose a framework generalizable to other diseases and disorders.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal AI/ML for discovering novel biomarkers and predicting disease using multi-omics profiles of patients with cardiovascular diseases\",\"authors\":\"William DeGroat, Habiba Abdelhalim, Elizabeth Peker, Neev Sheth, Rishabh Narayanan, Saman Zeeshan, Bruce T. Liang, Zeeshan Ahmed\",\"doi\":\"10.1101/2024.08.07.607041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cardiovascular diseases (CVDs) are multifactorial diseases, requiring personalized assessment and treatment. The advancements in multi-omics technologies, namely RNA-seq and whole genome sequencing, have offered translational researchers a comprehensive view of the human genome; utilizing this data, we can reveal novel biomarkers and segment patient populations based on personalized risk factors. Limitations in these technologies in failing to capture disease complexity can be accounted for by using an integrated approach, characterizing variants alongside expression related to emerging phenotypes. Designed and implemented data analytics methodology is based on a nexus of orthodox bioinformatics, classical statistics, and multimodal artificial intelligence and machine learning techniques. Our approach has the potential to reveal the intricate mechanisms of CVD that can facilitate patient-specific disease risk and response profiling. We sourced transcriptomic expression and variants from CVD and control subjects. By integrating these multi-omics datasets with clinical demographics, we generated patient-specific profiles. Utilizing a robust feature selection approach, we reported a signature of 27 transcripts and variants efficient at predicting CVD. Here, differential expression analysis and minimum redundancy maximum relevance feature selection elucidated biomarkers explanatory of the disease phenotype. We used Combination Annotation Dependent Depletion and allele frequencies to identify variants with pathogenic characteristics in CVD patients. Classification models trained on this signature demonstrated high-accuracy predictions for CVDs. Overall, we observed an XGBoost model hyperparameterized using Bayesian optimization perform the best (AUC 1.0). Using SHapley Additive exPlanations, we compiled risk assessments for patients capable of further contextualizing these predictions in a clinical setting. We discovered a 27-component signature explanatory of phenotypic differences in CVD patients and healthy controls using a feature selection approach prioritizing both biological relevance and efficiency in machine learning. Literature review revealed previous CVD associations in a majority of these diagnostic biomarkers. Classification models trained on this signature were able to predict CVD in patients with high accuracy. Here, we propose a framework generalizable to other diseases and disorders.\",\"PeriodicalId\":501161,\"journal\":{\"name\":\"bioRxiv - Genomics\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.07.607041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.07.607041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

心血管疾病（CVD）是一种多因素疾病，需要个性化的评估和治疗。多组学技术（即 RNA-seq 和全基因组测序）的进步为转化研究人员提供了全面的人类基因组视图；利用这些数据，我们可以揭示新的生物标志物，并根据个性化风险因素对患者群体进行细分。这些技术无法捕捉疾病复杂性的局限性可以通过使用综合方法来解决，即在描述变异的同时描述与新兴表型相关的表达。设计和实施的数据分析方法基于正统生物信息学、经典统计学以及多模式人工智能和机器学习技术的结合。我们的方法有可能揭示心血管疾病的复杂机制，从而促进特定患者的疾病风险和反应分析。我们从心血管疾病和对照受试者中获取转录组表达和变异。通过将这些多组学数据集与临床人口统计数据整合，我们生成了患者特异性特征。利用稳健的特征选择方法，我们报告了 27 个转录本和变异的特征，它们能有效预测心血管疾病。在这里，差异表达分析和最小冗余最大相关性特征选择阐明了解释疾病表型的生物标志物。我们利用组合注释依赖性缺失和等位基因频率来识别心血管疾病患者中具有致病特征的变异。根据这一特征训练的分类模型对心血管疾病的预测准确率很高。总体而言，我们观察到使用贝叶斯优化方法超参数化的 XGBoost 模型表现最佳（AUC 1.0）。通过使用 SHapley Additive exPlanations，我们为患者编制了风险评估，从而能够在临床环境中进一步对这些预测进行背景分析。我们发现了一个由 27 个成分组成的特征，它可以解释心血管疾病患者和健康对照组的表型差异，我们采用的特征选择方法优先考虑了生物学相关性和机器学习的效率。文献综述显示，这些诊断生物标志物中的大部分以前都与心血管疾病有关。根据该特征训练的分类模型能够高精度地预测患者的心血管疾病。在此，我们提出了一个可推广到其他疾病和失调的框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal AI/ML for discovering novel biomarkers and predicting disease using multi-omics profiles of patients with cardiovascular diseases

Cardiovascular diseases (CVDs) are multifactorial diseases, requiring personalized assessment and treatment. The advancements in multi-omics technologies, namely RNA-seq and whole genome sequencing, have offered translational researchers a comprehensive view of the human genome; utilizing this data, we can reveal novel biomarkers and segment patient populations based on personalized risk factors. Limitations in these technologies in failing to capture disease complexity can be accounted for by using an integrated approach, characterizing variants alongside expression related to emerging phenotypes. Designed and implemented data analytics methodology is based on a nexus of orthodox bioinformatics, classical statistics, and multimodal artificial intelligence and machine learning techniques. Our approach has the potential to reveal the intricate mechanisms of CVD that can facilitate patient-specific disease risk and response profiling. We sourced transcriptomic expression and variants from CVD and control subjects. By integrating these multi-omics datasets with clinical demographics, we generated patient-specific profiles. Utilizing a robust feature selection approach, we reported a signature of 27 transcripts and variants efficient at predicting CVD. Here, differential expression analysis and minimum redundancy maximum relevance feature selection elucidated biomarkers explanatory of the disease phenotype. We used Combination Annotation Dependent Depletion and allele frequencies to identify variants with pathogenic characteristics in CVD patients. Classification models trained on this signature demonstrated high-accuracy predictions for CVDs. Overall, we observed an XGBoost model hyperparameterized using Bayesian optimization perform the best (AUC 1.0). Using SHapley Additive exPlanations, we compiled risk assessments for patients capable of further contextualizing these predictions in a clinical setting. We discovered a 27-component signature explanatory of phenotypic differences in CVD patients and healthy controls using a feature selection approach prioritizing both biological relevance and efficiency in machine learning. Literature review revealed previous CVD associations in a majority of these diagnostic biomarkers. Classification models trained on this signature were able to predict CVD in patients with high accuracy. Here, we propose a framework generalizable to other diseases and disorders.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv - Genomics

自引率

0.00%

发文量