Biodata Mining最新文献

筛选
英文 中文
Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping. 可解释无监督树集成的特征图:中心性、相互作用和疾病亚型分型中的应用。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-02-15 DOI: 10.1186/s13040-025-00430-3
Christel Sirocchi, Martin Urschler, Bastian Pfeifer
{"title":"Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping.","authors":"Christel Sirocchi, Martin Urschler, Bastian Pfeifer","doi":"10.1186/s13040-025-00430-3","DOIUrl":"10.1186/s13040-025-00430-3","url":null,"abstract":"<p><p>Explainable and interpretable machine learning has emerged as essential in leveraging artificial intelligence within high-stakes domains such as healthcare to ensure transparency and trustworthiness. Feature importance analysis plays a crucial role in improving model interpretability by pinpointing the most relevant input features, particularly in disease subtyping applications, aimed at stratifying patients based on a small set of signature genes and biomarkers. While clustering methods, including unsupervised random forests, have demonstrated good performance, approaches for evaluating feature contributions in an unsupervised regime are notably scarce. To address this gap, we introduce a novel methodology to enhance the interpretability of unsupervised random forests by elucidating feature contributions through the construction of feature graphs, both over the entire dataset and individual clusters, that leverage parent-child node splits within the trees. Feature selection strategies to derive effective feature combinations from these graphs are presented and extensively evaluated on synthetic and benchmark datasets against state-of-the-art methods, standing out for performance, computational efficiency, reliability, versatility and ability to provide cluster-specific insights. In a disease subtyping application, clustering kidney cancer gene expression data over a feature subset selected with our approach reveals three patient groups with different survival outcomes. Cluster-specific analysis identifies distinctive feature contributions and interactions, essential for devising targeted interventions, conducting personalised risk assessments, and enhancing our understanding of the underlying molecular complexities.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"15"},"PeriodicalIF":4.0,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143426202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agenda setting for health equity assessment through the lenses of social determinants of health using machine learning approach: a framework and preliminary pilot study. 通过使用机器学习方法的健康社会决定因素制定卫生公平评估议程:框架和初步试点研究。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-02-10 DOI: 10.1186/s13040-025-00428-x
Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian
{"title":"Agenda setting for health equity assessment through the lenses of social determinants of health using machine learning approach: a framework and preliminary pilot study.","authors":"Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian","doi":"10.1186/s13040-025-00428-x","DOIUrl":"10.1186/s13040-025-00428-x","url":null,"abstract":"<p><strong>Introduction: </strong>The integration of Artificial Intelligence (AI) and Machine Learning (ML) is transforming public health by enhancing the assessment and mitigation of health inequities. As the use of AI tools, especially ML techniques, rises, they play a pivotal role in informing policies that promote a more equitable society. This study aims to develop a framework utilizing ML to analyze health system data and set agendas for health equity interventions, focusing on social determinants of health (SDH).</p><p><strong>Method: </strong>This study utilized the CRISP-ML(Q) model to introduce a platform for health equity assessment, facilitating its design and implementation in health systems. Initially, a conceptual model was developed through a comprehensive literature review and document analysis. A pilot implementation was conducted to test the feasibility and effectiveness of using ML algorithms in assessing health equity. Life expectancy was chosen as the health outcome for this pilot; data from 2000 to 2020 with 140 features was cleaned, transformed, and prepared for modeling. Multiple ML models were developed and evaluated using SPSS Modeler software version 18.0.</p><p><strong>Results: </strong>ML algorithms effectively identified key SDH influencing life expectancy. Among algorithms, the Linear Discriminant algorithm as classification model was selected as the best model due to its high accuracy in both testing and training phases, its strong performance in identifying key features, and its good generalizability to new data. Additionally, CHAID in numeric models was the best for predicting the actual value of life expectancy based on various features. These models highlighted the importance of features like current health expenditure, domestic general government health expenditure, and GDP in predicting life expectancy.</p><p><strong>Conclusion: </strong>The findings underscore the significance of employing innovative methods like CRISP-ML(Q) and ML algorithms to enhance health equity. Integrating this platform into health systems can help countries better prioritize and address health inequities. The pilot implementation demonstrated these methods' practical applicability and effectiveness, aiding policymakers in making informed decisions to improve health equity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"14"},"PeriodicalIF":4.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11808983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143392203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms. 利用 XGBoost 机器学习算法建立骨质疏松性脊椎骨折的免疫细胞图谱和预测模型。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-02-04 DOI: 10.1186/s13040-025-00427-y
Yi-Chou Chen, Hui-Chen Su, Shih-Ming Huang, Ching-Hsiao Yu, Jen-Huei Chang, Yi-Lin Chiu
{"title":"Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms.","authors":"Yi-Chou Chen, Hui-Chen Su, Shih-Ming Huang, Ching-Hsiao Yu, Jen-Huei Chang, Yi-Lin Chiu","doi":"10.1186/s13040-025-00427-y","DOIUrl":"10.1186/s13040-025-00427-y","url":null,"abstract":"<p><strong>Background: </strong>Osteoporosis significantly increases the risk of vertebral fractures, particularly among postmenopausal women, decreasing their quality of life. These fractures, often undiagnosed, can lead to severe health consequences and are influenced by bone mineral density and abnormal loads. Management strategies range from non-surgical interventions to surgical treatments. Moreover, the interaction between immune cells and bone cells plays a crucial role in bone repair processes, highlighting the importance of osteoimmunology in understanding and treating bone pathologies.</p><p><strong>Methods: </strong>This study aims to investigate the xCell signature-based immune cell profiles in osteoporotic patients with and without vertebral fractures, utilizing advanced predictive modeling through the XGBoost algorithm.</p><p><strong>Results: </strong>Our findings reveal an increased presence of CD4 + naïve T cells and central memory T cells in VF patients, indicating distinct adaptive immune responses. The XGBoost model identified Th1 cells, CD4 memory T cells, and hematopoietic stem cells as key predictors of VF. Notably, VF patients exhibited a reduction in Th1 cells and an enrichment of Th17 cells, which promote osteoclastogenesis and bone resorption. Gene expression analysis further highlighted an upregulation of osteoclast-related genes and a downregulation of osteoblast-related genes in VF patients, emphasizing the disrupted balance between bone formation and resorption. These findings underscore the critical role of immune cells in the pathogenesis of osteoporotic fractures and highlight the potential of XGBoost in identifying key biomarkers and therapeutic targets for mitigating fracture risk in osteoporotic patients.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"13"},"PeriodicalIF":4.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites. xgboost增强集合模型使用判别杂交特征来预测sumoylation位点。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-02-03 DOI: 10.1186/s13040-024-00415-8
Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad
{"title":"XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.","authors":"Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad","doi":"10.1186/s13040-024-00415-8","DOIUrl":"10.1186/s13040-024-00415-8","url":null,"abstract":"<p><p>Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"12"},"PeriodicalIF":4.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143123566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MiCML: a causal machine learning cloud platform for the analysis of treatment effects using microbiome profiles. MiCML:一个因果机器学习云平台,用于使用微生物组概况分析治疗效果。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-30 DOI: 10.1186/s13040-025-00422-3
Hyunwook Koh, Jihun Kim, Hyojung Jang
{"title":"MiCML: a causal machine learning cloud platform for the analysis of treatment effects using microbiome profiles.","authors":"Hyunwook Koh, Jihun Kim, Hyojung Jang","doi":"10.1186/s13040-025-00422-3","DOIUrl":"10.1186/s13040-025-00422-3","url":null,"abstract":"<p><strong>Background: </strong>The treatment effects are heterogenous across patients due to the differences in their microbiomes, which in turn implies that we can enhance the treatment effect by manipulating the patient's microbiome profile. Then, the coadministration of microbiome-based dietary supplements/therapeutics along with the primary treatment has been the subject of intensive investigation. However, for this, we first need to comprehend which microbes help (or prevent) the treatment to cure the patient's disease.</p><p><strong>Results: </strong>In this paper, we introduce a cloud platform, named microbiome causal machine learning (MiCML), for the analysis of treatment effects using microbiome profiles on user-friendly web environments. MiCML is in particular unique with the up-to-date features of (i) batch effect correction to mitigate systematic variation in collective large-scale microbiome data due to the differences in their underlying batches, and (ii) causal machine learning to estimate treatment effects with consistency and then discern microbial taxa that enhance (or lower) the efficacy of the primary treatment. We also stress that MiCML can handle the data from either randomized controlled trials or observational studies.</p><p><strong>Conclusion: </strong>We describe MiCML as a useful analytic tool for microbiome-based personalized medicine. MiCML is freely available on our web server ( http://micml.micloud.kr ). MiCML can also be implemented locally on the user's computer through our GitHub repository ( https://github.com/hk1785/micml ).</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"10"},"PeriodicalIF":4.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A deep learning approach for classifying and predicting children's nutritional status in Ethiopia using LSTM-FC neural networks. 使用LSTM-FC神经网络分类和预测埃塞俄比亚儿童营养状况的深度学习方法。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-30 DOI: 10.1186/s13040-025-00425-0
Getnet Bogale Begashaw, Temesgen Zewotir, Haile Mekonnen Fenta
{"title":"A deep learning approach for classifying and predicting children's nutritional status in Ethiopia using LSTM-FC neural networks.","authors":"Getnet Bogale Begashaw, Temesgen Zewotir, Haile Mekonnen Fenta","doi":"10.1186/s13040-025-00425-0","DOIUrl":"10.1186/s13040-025-00425-0","url":null,"abstract":"<p><strong>Background: </strong>This study employs a LSTM-FC neural networks to address the critical public health issue of child undernutrition in Ethiopia. By employing this method, the study aims classify children's nutritional status and predict transitions between different undernutrition states over time. This analysis is based on longitudinal data extracted from the Young Lives cohort study, which tracked 1,997 Ethiopian children across five survey rounds conducted from 2002 to 2016. This paper applies rigorous data preprocessing, including handling missing values, normalization, and balancing, to ensure optimal model performance. Feature selection was performed using SHapley Additive exPlanations to identify key factors influencing nutritional status predictions. Hyperparameter tuning was thoroughly applied during model training to optimize performance. Furthermore, this paper compares the performance of LSTM-FC with existing baseline models to demonstrate its superiority. We used Python's TensorFlow and Keras libraries on a GPU-equipped system for model training.</p><p><strong>Results: </strong>LSTM-FC demonstrated superior predictive accuracy and long-term forecasting compared to baseline models for assessing child nutritional status. The classification and prediction performance of the model showed high accuracy rates above 93%, with perfect predictions for Normal (N) and Stunted & Wasted (SW) categories, minimal errors in most other nutritional statuses, and slight over- or underestimations in a few instances. The LSTM-FC model demonstrates strong generalization performance across multiple folds, with high recall and consistent F1-scores, indicating its robustness in predicting nutritional status. We analyzed the prevalence of children's nutritional status during their transition from late adolescence to early adulthood. The results show a notable decline in normal nutritional status among males, decreasing from 58.3% at age 5 to 33.5% by age 25. At the same time, the risk of severe undernutrition, including conditions of being underweight, stunted, and wasted (USW), increased from 1.3% to 9.4%.</p><p><strong>Conclusions: </strong>The LSTM-FC model outperforms baseline methods in classifying and predicting Ethiopian children's nutritional statuses. The findings reveal a critical rise in undernutrition, emphasizing the need for urgent public health interventions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"11"},"PeriodicalIF":4.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A generative deep neural network for pan-digestive tract cancer survival analysis. 泛消化道肿瘤生存分析的生成式深度神经网络。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-27 DOI: 10.1186/s13040-025-00426-z
Lekai Xu, Tianjun Lan, Yiqian Huang, Liansheng Wang, Junqi Lin, Xinpeng Song, Hui Tang, Haotian Cao, Hua Chai
{"title":"A generative deep neural network for pan-digestive tract cancer survival analysis.","authors":"Lekai Xu, Tianjun Lan, Yiqian Huang, Liansheng Wang, Junqi Lin, Xinpeng Song, Hui Tang, Haotian Cao, Hua Chai","doi":"10.1186/s13040-025-00426-z","DOIUrl":"10.1186/s13040-025-00426-z","url":null,"abstract":"<p><strong>Background: </strong>The accurate identification of molecular subtypes in digestive tract cancer (DTC) is crucial for making informed treatment decisions and selecting potential biomarkers. With the rapid advancement of artificial intelligence, various machine learning algorithms have been successfully applied in this field. However, the complexity and high dimensionality of the data features may lead to overlapping and ambiguous subtypes during clustering.</p><p><strong>Results: </strong>In this study, we propose GDEC, a multi-task generative deep neural network designed for precise digestive tract cancer subtyping. The network optimization process involves employing an integrated loss function consisting of two modules: the generative-adversarial module facilitates spatial data distribution understanding for extracting high-quality information, while the clustering module aids in identifying disease subtypes. The experiments conducted on digestive tract cancer datasets demonstrate that GDEC exhibits exceptional performance compared to other advanced methodologies and can separate different cancer molecular subtypes that possess both statistical and biological significance. Subsequently, 21 hub genes related to pan-DTC heterogeneity and prognosis were identified based on the subtypes clustered by GDEC. The following drug analysis suggested Dasatinib and YM155 as potential therapeutic agents for improving the prognosis of patients in pan-DTC immunotherapy, thereby contributing to the enhancement of cancer patient survival.</p><p><strong>Conclusions: </strong>The experiment indicate that GDEC outperforms better than other deep-learning-based methods, and the interpretable algorithm can select biologically significant genes and potential drugs for DTC treatment.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"9"},"PeriodicalIF":4.0,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771125/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143054000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Motif clustering and digital biomarker extraction for free-living physical activity analysis. 基序聚类和数字生物标记提取用于自由生活的身体活动分析。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-22 DOI: 10.1186/s13040-025-00424-1
Ya-Ting Liang, Charlotte Wang
{"title":"Motif clustering and digital biomarker extraction for free-living physical activity analysis.","authors":"Ya-Ting Liang, Charlotte Wang","doi":"10.1186/s13040-025-00424-1","DOIUrl":"10.1186/s13040-025-00424-1","url":null,"abstract":"<p><strong>Background: </strong>Analyzing free-living physical activity (PA) data presents challenges due to variability in daily routines and the lack of activity labels. Traditional approaches often rely on summary statistics, which may not capture the nuances of individual activity patterns. To address these limitations and advance our understanding of the relationship between PA patterns and health outcomes, we propose a novel motif clustering algorithm that identifies and characterizes specific PA patterns.</p><p><strong>Methods: </strong>This paper proposes an elastic distance-based motif clustering algorithm for identifying specific PA patterns (motifs) in free-living PA data. The algorithm segments long-term PA curves into short-term segments and utilizes elastic shape analysis to measure the similarity between activity segments. This enables the discovery of recurring motifs through pattern clustering. Then, functional principal component analysis (FPCA) is then used to extract digital biomarkers from each motif. These digital biomarkers can subsequently be used to explore the relationship between PA and health outcomes of interest.</p><p><strong>Results: </strong>We demonstrate the efficacy of our method through three real-world applications. Results show that digital biomarkers derived from these motifs effectively capture the association between PA patterns and disease outcomes, improving the accuracy of patient classification.</p><p><strong>Conclusions: </strong>This study introduced a novel approach to analyzing free-living PA data by identifying and characterizing specific activity patterns (motifs). The derived digital biomarkers provide a more nuanced understanding of PA and its impact on health, with potential applications in personalized health assessment and disease detection, offering a promising future for healthcare.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"8"},"PeriodicalIF":4.0,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11753168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer. 基于集成机器学习的性能评估确定了对癌症驱动突变进行最佳分类的顶级计算机致病性预测方法。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-20 DOI: 10.1186/s13040-024-00420-x
Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K Biswas
{"title":"An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer.","authors":"Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K Biswas","doi":"10.1186/s13040-024-00420-x","DOIUrl":"10.1186/s13040-024-00420-x","url":null,"abstract":"<p><strong>Background and objective: </strong>Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC).</p><p><strong>Methods: </strong>The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms-logistic regression, random forest, and support vector machine-along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods.</p><p><strong>Results: </strong>The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability.</p><p><strong>Conclusions: </strong>The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"7"},"PeriodicalIF":4.0,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11744934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enriched phenotypes in rare variant carriers suggest pathogenic mechanisms in rare disease patients. 罕见变异携带者的丰富表型提示罕见病患者的致病机制。
IF 4 3区 生物学
Biodata Mining Pub Date : 2025-01-17 DOI: 10.1186/s13040-024-00418-5
Lane Fitzsimmons, Brett Beaulieu-Jones, Shilpa Nadimpalli Kobren
{"title":"Enriched phenotypes in rare variant carriers suggest pathogenic mechanisms in rare disease patients.","authors":"Lane Fitzsimmons, Brett Beaulieu-Jones, Shilpa Nadimpalli Kobren","doi":"10.1186/s13040-024-00418-5","DOIUrl":"10.1186/s13040-024-00418-5","url":null,"abstract":"<p><strong>Background: </strong>The mechanistic pathways that give rise to the extreme symptoms exhibited by rare disease patients are complex, heterogeneous, and difficult to discern. Understanding these mechanisms is critical for developing treatments that address the underlying causes of diseases rather than merely the presenting symptoms. Moreover, the same dysfunctional series of interrelated symptoms implicated in rare recessive diseases may also lead to milder and potentially preventable symptoms in carriers in the general population. Seizures are a common and extreme phenotype that can result from diverse and often elusive pathways in patients with ultrarare or undiagnosed disorders.</p><p><strong>Methods: </strong>In this pilot study, we present an approach to understand the underlying pathways leading to seizures in patients from the Undiagnosed Diseases Network (UDN) by analyzing aggregated genotype and phenotype data from the UK Biobank (UKB). Specifically, we look for enriched phenotypes across UKB participants who harbor rare variants in the same gene known or suspected to be causally implicated in a UDN patient's recessively manifesting disorder. Analyzing these milder but related associated phenotypes in UKB participants can provide insight into the disease-causing mechanisms at play in rare disease UDN patients.</p><p><strong>Results: </strong>We present six vignettes of undiagnosed patients experiencing seizures as part of their recessive genetic condition. For each patient, we analyze a gene of interest: MPO, P2RX7, SQSTM1, COL27A1, PIGQ, or CACNA2D2, and find relevant symptoms associated with UKB participants. We discuss the potential mechanisms by which the digestive, skeletal, circulatory, and immune system abnormalities found in the UKB patients may contribute to the severe presentations exhibited by UDN patients. We find that in our set of rare disease patients, seizures may result from diverse, multi-step pathways that involve multiple body systems.</p><p><strong>Conclusions: </strong>Analyses of large-scale population cohorts such as the UKB can be a critical tool to further our understanding of rare diseases in general. Continued research in this area could lead to more precise diagnostics and personalized treatment strategies for patients with rare and undiagnosed conditions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"6"},"PeriodicalIF":4.0,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11740427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信