Biodata MiningPub Date : 2025-03-07DOI: 10.1186/s13040-025-00435-y
Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri
{"title":"Unsupervised clustering based coronary artery segmentation.","authors":"Belén Serrano-Antón, Manuel Insúa Villa, Santiago Pendón-Minguillón, Santiago Paramés-Estévez, Alberto Otero-Cacho, Diego López-Otero, Brais Díaz-Fernández, María Bastos-Fernández, José R González-Juanatey, Alberto P Muñuzuri","doi":"10.1186/s13040-025-00435-y","DOIUrl":"10.1186/s13040-025-00435-y","url":null,"abstract":"<p><strong>Background: </strong>The acquisition of 3D geometries of coronary arteries from computed tomography coronary angiography (CTCA) is crucial for clinicians, enabling visualization of lesions and supporting decision-making processes. Manual segmentation of coronary arteries is time-consuming and prone to errors. There is growing interest in automatic segmentation algorithms, particularly those based on neural networks, which require large datasets and significant computational resources for training. This paper proposes an automatic segmentation methodology based on clustering algorithms and a graph structure, which integrates data from both the clustering process and the original images.</p><p><strong>Results: </strong>The study compares two approaches: a 2.5D version using axial, sagittal, and coronal slices (3Axis), and a perpendicular version (Perp), which uses the cross-section of each vessel. The methodology was tested on two patient groups: a test set of 10 patients and an additional set of 22 patients with clinically diagnosed lesions. The 3Axis method achieved a Dice score of 0.88 in the test set and 0.83 in the lesion set, while the Perp method obtained Dice scores of 0.81 in the test set and 0.82 in the lesion set, decreasing to 0.79 and 0.80 in the lesion region, respectively. These results are competitive with current state-of-the-art methods.</p><p><strong>Conclusions: </strong>This clustering-based segmentation approach offers a robust framework that can be easily integrated into clinical workflows, improving both accuracy and efficiency in coronary artery analysis. Additionally, the ability to visualize clusters and graphs from any cross-section enhances the method's explainability, providing clinicians with deeper insights into vascular structures. The study demonstrates the potential of clustering algorithms for improving segmentation performance in coronary artery imaging.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"21"},"PeriodicalIF":4.0,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11887207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143587591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-03-04DOI: 10.1186/s13040-025-00436-x
Onur Erdogan, Cem Iyigun, Yeşim Aydın Son
{"title":"EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multiplatform GWASs for late-onset alzheimer's disease.","authors":"Onur Erdogan, Cem Iyigun, Yeşim Aydın Son","doi":"10.1186/s13040-025-00436-x","DOIUrl":"10.1186/s13040-025-00436-x","url":null,"abstract":"<p><p>Late-onset Alzheimer's disease (LOAD) is a progressive and complex neurodegenerative disorder of the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains resulting from due to traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic etiology that is still unclear, which limits its early or differential diagnosis. The Genome-Wide Association Studies (GWAS) enable the exploration of individual variants' statistical interactions at candidate loci, but univariate analysis overlooks interactions between variants. Machine learning (ML) algorithms can capture hidden, novel, and significant patterns while considering nonlinear interactions between variants to understand the genetic predisposition for complex genetic disorders. When working on different platforms, majority voting cannot be applied because the attributes differ. Hence, a new post-ML ensemble approach was developed to select significant SNVs via multiple genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each dataset. The proposed ensemble algorithm utilizes the chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multimodel Random Forest (RF) validations to prioritize SNVs and candidate causative genes for LOAD. The scoring method is scalable and can be applied to any multiplatform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes candidate causative variants related to LOAD among three GWAS datasets.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"20"},"PeriodicalIF":4.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-03-03DOI: 10.1186/s13040-025-00433-0
Lanfang Zhang, Yuan Cai, Lin Li, Jie Hu, Changsha Jia, Xu Kuang, Yi Zhou, Zhiai Lan, Chunyan Liu, Feng Jiang, Nana Sun, Ni Zeng
{"title":"Analysis of global trends and hotspots of skin microbiome in acne: a bibliometric perspective.","authors":"Lanfang Zhang, Yuan Cai, Lin Li, Jie Hu, Changsha Jia, Xu Kuang, Yi Zhou, Zhiai Lan, Chunyan Liu, Feng Jiang, Nana Sun, Ni Zeng","doi":"10.1186/s13040-025-00433-0","DOIUrl":"10.1186/s13040-025-00433-0","url":null,"abstract":"<p><strong>Background: </strong>Acne is a chronic inflammatory condition affecting the hair follicles and sebaceous glands. Recent research has revealed significant advances in the study of the acne skin microbiome. Systematic analysis of research trends and hotspots in the acne skin microbiome is lacking. This study utilized bibliometric methods to conduct in-depth research on the recognition structure of the acne skin microbiome, identifying hot trends and emerging topics.</p><p><strong>Methods: </strong>We performed a topic search to retrieve articles about skin microbiome in acne from the Web of Science Core Collection. Bibliometric research was conducted using CiteSpace, VOSviewer, and R language.</p><p><strong>Results: </strong>This study analyzed 757 articles from 1362 institutions in 68 countries, the United States leading the research efforts. Notably, Brigitte Dréno from the University of Nantes emerged as the most prolific author in this field, with 19 papers and 334 co-citations. The research output on the skin microbiome of acne continues to increase, with Experimental Dermatology being the journal with the highest number of published articles. The primary focus is investigating the skin microbiome's mechanisms in acne development and exploring treatment strategies. These findings have important implications for developing microbiome-targeted therapies, which could provide new, personalized treatment options for patients with acne. Emerging research hotspots include skincare, gut microbiome, and treatment.</p><p><strong>Conclusion: </strong>The study's findings indicate a thriving research interest in the skin microbiome and its relationship to acne, focusing on acne treatment through the regulation of the skin microbiome balance. Currently, the development of skincare products targeting the regulation of the skin microbiome represents a research hotspot, reflecting the transition from basic scientific research to clinical practice.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"19"},"PeriodicalIF":4.0,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11874858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143544184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-27DOI: 10.1186/s13040-025-00434-z
Patrick Maximilian Schwehn, Pascal Falter-Braun
{"title":"Inferring protein from transcript abundances using convolutional neural networks.","authors":"Patrick Maximilian Schwehn, Pascal Falter-Braun","doi":"10.1186/s13040-025-00434-z","DOIUrl":"10.1186/s13040-025-00434-z","url":null,"abstract":"<p><strong>Background: </strong>Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).</p><p><strong>Results: </strong>After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r<sup>2</sup>) of 0.30 in H. sapiens and 0.32 in A. thaliana.</p><p><strong>Conclusions: </strong>For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model's learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"18"},"PeriodicalIF":4.0,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143525013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-18DOI: 10.1186/s13040-025-00429-w
Tayo Obafemi-Ajayi, Steven F Jennings, Yu Zhang, Kara Li Liu, Joan Peckham, Jason H Moore
{"title":"AI as an accelerator for defining new problems that transcends boundaries.","authors":"Tayo Obafemi-Ajayi, Steven F Jennings, Yu Zhang, Kara Li Liu, Joan Peckham, Jason H Moore","doi":"10.1186/s13040-025-00429-w","DOIUrl":"10.1186/s13040-025-00429-w","url":null,"abstract":"<p><p>Interdisciplinary, transdisciplinary, convergence, and No-Boundary Thinking (NBT) research are methodology and technology-agnostic approaches to problem solving. The focus is on defining problems informed by access to multiple knowledge sources and expert perspectives across different domains, with the goal of accessing all available knowledge sources and perspectives. While access to all available knowledge sources and perspectives could be seen as a difficult to attain objective, with the recent rise of AI we might be closer to approaching this goal. We review several examples of methodologies and technologies that have been used to put these strategies into action, but the primary focus of this paper is on how recent advances in AI now enable a quantum leap forward in defining new problems. By leveraging the capacity of AI to synthesize knowledge from multiple domains, these tools can be used to propose multiple candidate problem definitions. AI is uniquely able to draw upon many more knowledge sources than any individual-or even a very large team-could. Coupled with human intelligence, better problems can be defined to address complex scholarly or societal challenges.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"17"},"PeriodicalIF":4.0,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11837601/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143450623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-17DOI: 10.1186/s13040-025-00431-2
Arezoo Abasi, Ahmad Nazari, Azar Moezy, Seyed Ali Fatemi Aghda
{"title":"Machine learning models for reinjury risk prediction using cardiopulmonary exercise testing (CPET) data: optimizing athlete recovery.","authors":"Arezoo Abasi, Ahmad Nazari, Azar Moezy, Seyed Ali Fatemi Aghda","doi":"10.1186/s13040-025-00431-2","DOIUrl":"10.1186/s13040-025-00431-2","url":null,"abstract":"<p><strong>Background: </strong>Cardiopulmonary Exercise Testing (CPET) provides detailed insights into athletes' cardiovascular and pulmonary function, making it a valuable tool in assessing recovery and injury risks. However, traditional statistical models often fail to leverage the full potential of CPET data in predicting reinjury. Machine learning (ML) algorithms offer promising capabilities in uncovering complex patterns within this data, allowing for more accurate injury risk assessment.</p><p><strong>Objective: </strong>This study aimed to develop machine learning models to predict reinjury risk among elite soccer players using CPET data. Specifically, we sought to identify key physiological and performance variables that correlate with reinjury and to evaluate the performance of various ML algorithms in generating accurate predictions.</p><p><strong>Methods: </strong>A dataset of 256 elite soccer players from 16 national and top-tier teams in Iran was analyzed, incorporating physiological variables and categorical data. Several machine learning models, including CatBoost, SVM, Random Forest, and XGBoost, were employed to predict reinjury risk. Model performance was assessed using metrics such as accuracy, precision, recall, F1-score, AUC, and SHAP values to ensure robust evaluation and interpretability.</p><p><strong>Results: </strong>CatBoost and SVM exhibited the best performance, with CatBoost achieving the highest accuracy (0.9138) and F1-score (0.9148), and SVM achieving the highest AUC (0.9725). A significant association was found between a history of concussion and reinjury risk (χ² = 13.0360, p = 0.0015), highlighting the importance of neurological recovery in preventing future injuries. Heart rate metrics, particularly HRmax and HR2, were also significantly lower in players who experienced reinjury, indicating reduced cardiovascular capacity in this group.</p><p><strong>Conclusion: </strong>Machine learning models, particularly CatBoost and SVM, provide promising tools for predicting reinjury risk using CPET data. These models offer clinicians more precise, data-driven insights into athlete recovery and risk management. Future research should explore the integration of external factors such as training load and psychological readiness to further refine these predictions and enhance injury prevention protocols.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"16"},"PeriodicalIF":4.0,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11834553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143442544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-15DOI: 10.1186/s13040-025-00430-3
Christel Sirocchi, Martin Urschler, Bastian Pfeifer
{"title":"Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping.","authors":"Christel Sirocchi, Martin Urschler, Bastian Pfeifer","doi":"10.1186/s13040-025-00430-3","DOIUrl":"10.1186/s13040-025-00430-3","url":null,"abstract":"<p><p>Explainable and interpretable machine learning has emerged as essential in leveraging artificial intelligence within high-stakes domains such as healthcare to ensure transparency and trustworthiness. Feature importance analysis plays a crucial role in improving model interpretability by pinpointing the most relevant input features, particularly in disease subtyping applications, aimed at stratifying patients based on a small set of signature genes and biomarkers. While clustering methods, including unsupervised random forests, have demonstrated good performance, approaches for evaluating feature contributions in an unsupervised regime are notably scarce. To address this gap, we introduce a novel methodology to enhance the interpretability of unsupervised random forests by elucidating feature contributions through the construction of feature graphs, both over the entire dataset and individual clusters, that leverage parent-child node splits within the trees. Feature selection strategies to derive effective feature combinations from these graphs are presented and extensively evaluated on synthetic and benchmark datasets against state-of-the-art methods, standing out for performance, computational efficiency, reliability, versatility and ability to provide cluster-specific insights. In a disease subtyping application, clustering kidney cancer gene expression data over a feature subset selected with our approach reveals three patient groups with different survival outcomes. Cluster-specific analysis identifies distinctive feature contributions and interactions, essential for devising targeted interventions, conducting personalised risk assessments, and enhancing our understanding of the underlying molecular complexities.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"15"},"PeriodicalIF":4.0,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143426202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-10DOI: 10.1186/s13040-025-00428-x
Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian
{"title":"Agenda setting for health equity assessment through the lenses of social determinants of health using machine learning approach: a framework and preliminary pilot study.","authors":"Maryam Ramezani, Mohammadreza Mobinizadeh, Ahad Bakhtiari, Hamid R Rabiee, Maryam Ramezani, Hakimeh Mostafavi, Alireza Olyaeemanesh, Ali Akbar Fazaeli, Alireza Atashi, Saharnaz Sazgarnejad, Efat Mohamadi, Amirhossein Takian","doi":"10.1186/s13040-025-00428-x","DOIUrl":"10.1186/s13040-025-00428-x","url":null,"abstract":"<p><strong>Introduction: </strong>The integration of Artificial Intelligence (AI) and Machine Learning (ML) is transforming public health by enhancing the assessment and mitigation of health inequities. As the use of AI tools, especially ML techniques, rises, they play a pivotal role in informing policies that promote a more equitable society. This study aims to develop a framework utilizing ML to analyze health system data and set agendas for health equity interventions, focusing on social determinants of health (SDH).</p><p><strong>Method: </strong>This study utilized the CRISP-ML(Q) model to introduce a platform for health equity assessment, facilitating its design and implementation in health systems. Initially, a conceptual model was developed through a comprehensive literature review and document analysis. A pilot implementation was conducted to test the feasibility and effectiveness of using ML algorithms in assessing health equity. Life expectancy was chosen as the health outcome for this pilot; data from 2000 to 2020 with 140 features was cleaned, transformed, and prepared for modeling. Multiple ML models were developed and evaluated using SPSS Modeler software version 18.0.</p><p><strong>Results: </strong>ML algorithms effectively identified key SDH influencing life expectancy. Among algorithms, the Linear Discriminant algorithm as classification model was selected as the best model due to its high accuracy in both testing and training phases, its strong performance in identifying key features, and its good generalizability to new data. Additionally, CHAID in numeric models was the best for predicting the actual value of life expectancy based on various features. These models highlighted the importance of features like current health expenditure, domestic general government health expenditure, and GDP in predicting life expectancy.</p><p><strong>Conclusion: </strong>The findings underscore the significance of employing innovative methods like CRISP-ML(Q) and ML algorithms to enhance health equity. Integrating this platform into health systems can help countries better prioritize and address health inequities. The pilot implementation demonstrated these methods' practical applicability and effectiveness, aiding policymakers in making informed decisions to improve health equity.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"14"},"PeriodicalIF":4.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11808983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143392203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms.","authors":"Yi-Chou Chen, Hui-Chen Su, Shih-Ming Huang, Ching-Hsiao Yu, Jen-Huei Chang, Yi-Lin Chiu","doi":"10.1186/s13040-025-00427-y","DOIUrl":"10.1186/s13040-025-00427-y","url":null,"abstract":"<p><strong>Background: </strong>Osteoporosis significantly increases the risk of vertebral fractures, particularly among postmenopausal women, decreasing their quality of life. These fractures, often undiagnosed, can lead to severe health consequences and are influenced by bone mineral density and abnormal loads. Management strategies range from non-surgical interventions to surgical treatments. Moreover, the interaction between immune cells and bone cells plays a crucial role in bone repair processes, highlighting the importance of osteoimmunology in understanding and treating bone pathologies.</p><p><strong>Methods: </strong>This study aims to investigate the xCell signature-based immune cell profiles in osteoporotic patients with and without vertebral fractures, utilizing advanced predictive modeling through the XGBoost algorithm.</p><p><strong>Results: </strong>Our findings reveal an increased presence of CD4 + naïve T cells and central memory T cells in VF patients, indicating distinct adaptive immune responses. The XGBoost model identified Th1 cells, CD4 memory T cells, and hematopoietic stem cells as key predictors of VF. Notably, VF patients exhibited a reduction in Th1 cells and an enrichment of Th17 cells, which promote osteoclastogenesis and bone resorption. Gene expression analysis further highlighted an upregulation of osteoclast-related genes and a downregulation of osteoblast-related genes in VF patients, emphasizing the disrupted balance between bone formation and resorption. These findings underscore the critical role of immune cells in the pathogenesis of osteoporotic fractures and highlight the potential of XGBoost in identifying key biomarkers and therapeutic targets for mitigating fracture risk in osteoporotic patients.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"13"},"PeriodicalIF":4.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-02-03DOI: 10.1186/s13040-024-00415-8
Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad
{"title":"XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.","authors":"Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad","doi":"10.1186/s13040-024-00415-8","DOIUrl":"10.1186/s13040-024-00415-8","url":null,"abstract":"<p><p>Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"12"},"PeriodicalIF":4.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143123566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}