Biodata MiningPub Date : 2025-05-13DOI: 10.1186/s13040-025-00450-z
Berit Hunsdieck, Christian Bender, Katja Ickstadt, Johanna Mielke
{"title":"Joint models in big data: simulation-based guidelines for required data quality in longitudinal electronic health records.","authors":"Berit Hunsdieck, Christian Bender, Katja Ickstadt, Johanna Mielke","doi":"10.1186/s13040-025-00450-z","DOIUrl":"10.1186/s13040-025-00450-z","url":null,"abstract":"<p><strong>Background: </strong>Over the past decade an increase in usage of electronic health data (EHR) by office-based physicians and hospitals has been reported. However, these data types come with challenge regarding completeness and data quality and it is, especially for more complex models, unclear how these characteristics influence the performance.</p><p><strong>Methods: </strong>In this paper, we focus on joint models which combines longitudinal modelling with survival modelling to incorporate all available information. The aim of this paper is to establish simulation-based guidelines for the necessary quality of longitudinal EHR data so that joint models perform better than cox models. We conducted an extensive simulation study by systematically and transparently varying different characteristics of data quality, e.g., measurement frequency, noise, and heterogeneity between patients. We apply the joint models and evaluate their performance relative to traditional Cox survival modelling techniques.</p><p><strong>Results: </strong>Key findings suggest that biomarker changes before disease onset must be consistent within similar patient groups. With increasing noise and a higher measurement density, the joint model surpasses the traditional Cox regression model in terms of model performance. We illustrate the usefulness and limitations of the guidelines with two real-world examples, namely the influence of serum bilirubin on primary biliary liver cirrhosis and the influence of the estimated glomerular filtration rate on chronic kidney disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"35"},"PeriodicalIF":4.0,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143993927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-12DOI: 10.1186/s13040-025-00449-6
Ya-Ting Liang, Charlotte Wang
{"title":"Correction: Motif clustering and digital biomarker extraction for free-living physical activity analysis.","authors":"Ya-Ting Liang, Charlotte Wang","doi":"10.1186/s13040-025-00449-6","DOIUrl":"10.1186/s13040-025-00449-6","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"34"},"PeriodicalIF":4.0,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12067653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144008381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-06DOI: 10.1186/s13040-025-00447-8
Sulaiman Mohammed Alnasser
{"title":"Revisiting the approaches to DNA damage detection in genetic toxicology: insights and regulatory implications.","authors":"Sulaiman Mohammed Alnasser","doi":"10.1186/s13040-025-00447-8","DOIUrl":"https://doi.org/10.1186/s13040-025-00447-8","url":null,"abstract":"<p><p>Genetic toxicology is crucial for evaluating the potential risks of chemicals and drugs to human health and the environment. The emergence of high-throughput technologies has transformed this field, providing more efficient, cost-effective, and ethically sound methods for genotoxicity testing. It utilizes advanced screening techniques, including automated in vitro assays and computational models to rapidly assess the genotoxic potential of thousands of compounds simultaneously. This review explores the transformation of traditional in vitro and in vivo methods into computational models for genotoxicity assessment. By leveraging advances in machine learning, artificial intelligence, and high-throughput screening, computational approaches are increasingly replacing conventional methods. Coupling conventional screening with artificial intelligence (AI) and machine learning (ML) models has significantly enhanced their predictive capabilities, enabling the identification of genotoxicity signatures tied to molecular structures and biological pathways. Regulatory agencies increasingly support such methodologies as humane alternatives to traditional animal models, provided they are validated and exhibit strong predictive power. Standardization efforts, including the establishment of common endpoints across testing approaches, are pivotal for enhancing comparability and fostering consensus in toxicological assessments. Initiatives like ToxCast exemplify the successful incorporation of HTS data into regulatory decision-making, demonstrating that well-interpreted in vitro results can align with in vivo outcomes. Innovations in testing methodologies, global data sharing, and real-time monitoring continue to refine the precision and personalization of risk assessments, promising a transformative impact on safety evaluations and regulatory frameworks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"33"},"PeriodicalIF":4.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12054138/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144051469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-05-02DOI: 10.1186/s13040-025-00444-x
Suruthy Sivanathan, Ting Hu
{"title":"Learning the therapeutic targets of acute myeloid leukemia through multiscale human interactome network and community analysis.","authors":"Suruthy Sivanathan, Ting Hu","doi":"10.1186/s13040-025-00444-x","DOIUrl":"https://doi.org/10.1186/s13040-025-00444-x","url":null,"abstract":"<p><p>Acute myeloid leukemia (AML) is caused by proliferation of mutated myeloid progenitor cells. The standard chemotherapy regimen does not efficiently cause remission as there is a high relapse rate. Resistance acquired by leukemic stem cells is suggested to be one of the root causes of relapse. Therefore, there is an urgency to develop new drugs for therapy. Repurposing approved drugs for AML can provide a cost-friendly, time-efficient, and affordable alternative. The multiscale interactome network is a computational tool that can identify potential therapeutic candidates by comparing mechanisms of the drug and disease. Communities that could be potentially experimentally validated are detected in the multiscale interactome network using the algorithm CRank. The results are evaluated through literature search and Gene Ontology (GO) enrichment analysis. In this research, we identify therapeutic candidates for AML and their mechanisms from the interactome, and isolate prioritized communities that are dominant in the therapeutic mechanism that could potentially be used as a prompt for pre-clinical/translational research (e.g. bioinformatics, laboratory research) to focus on biological functions and mechanisms that are associated with the disease and drug. This method may allow for an efficient and accelerated discovery of potential candidates for AML, a rapidly progressing disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"32"},"PeriodicalIF":4.0,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12049071/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144052657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-04-16DOI: 10.1186/s13040-025-00445-w
Zhixuan Zeng, Yang Liu, Shuo Yao, Minjie Lin, Xu Cai, Wenbin Nan, Yiyang Xie, Xun Gong
{"title":"Inter-organ correlation based multi-task deep learning model for dynamically predicting functional deterioration in multiple organ systems of ICU patients.","authors":"Zhixuan Zeng, Yang Liu, Shuo Yao, Minjie Lin, Xu Cai, Wenbin Nan, Yiyang Xie, Xun Gong","doi":"10.1186/s13040-025-00445-w","DOIUrl":"https://doi.org/10.1186/s13040-025-00445-w","url":null,"abstract":"<p><strong>Background: </strong>Functional deterioration (FD) of various organ systems is the major cause of death in ICU patients, but few studies propose effective multi-task (MT) model to predict FD of multiple organs simultaneously. This study propose a MT deep learning model named inter-organ correlation based multi-task model (IOC-MT), to dynamically predict FD in six organ systems.</p><p><strong>Methods: </strong>Three public ICU databases were used for model training and validation. The IOC-MT was designed based on the routine MT deep learning framework, but it used a Graph Attention Networks (GAT) module to capture inter-organ correlation and an adaptive adjustment mechanism (AAM) to adjust prediction. We compared the IOC-MT to five single-task (ST) baseline models, including three deep models (LSTM-ST, GRU-ST, Transformer-ST) and two machine learning models (GRU-ST, RF-ST), and performed ablation study to assess the contribution of important components in IOC-MT. Model discrimination was evaluated by AUROC and AUPRC, and model calibration was assessed by the calibration curve. The attention weight and adjustment coefficient were analyzed at both overall and individual level to show the AAM of IOC-MT.</p><p><strong>Results: </strong>The IOC-MT had comparable discrimination and calibration to LSTM-ST, GRU-ST and Transformer-ST for most organs under different gap windows in the internal and external validation, and obviously outperformed GRU-ST, RF-ST. The ablation study showed that the GAT, AAM and missing indicator could improve the overall performance of the model. Furthermore, the inter-organ correlation and prediction adjustment of IOC-MT were intuitive and comprehensible, and also had biological plausibility.</p><p><strong>Conclusions: </strong>The IOC-MT is a promising MT model for dynamically predicting FD in six organ systems. It can capture inter-organ correlation and adjust the prediction for one organ based on aggregated information from the other organs.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"31"},"PeriodicalIF":4.0,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12001458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144043336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-04-15DOI: 10.1186/s13040-025-00446-9
Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore
{"title":"Enhancing clinical outcome predictions through effective sample size evaluation in graph-based digital twin modeling.","authors":"Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore","doi":"10.1186/s13040-025-00446-9","DOIUrl":"https://doi.org/10.1186/s13040-025-00446-9","url":null,"abstract":"<p><p>Digital twins in healthcare offer an innovative approach to precision diagnosis, prognosis, and treatment. SynTwin, a novel computational methodology to generate digital twins using synthetic data and network science, has previously shown promise for improving prediction of breast cancer mortality. In this study, we validate SynTwin using population-level data for different cancer types from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). We assess its predictive accuracy across cancer types of varying sample sizes (n = 1,000 to 30,000 records), mortality rates (35% to 60%), and study designs, revealing insights into the strengths and limitations of digital twins derived from synthetic data in mortality prediction. We also evaluate the effect of sample size (n = 1,000 to 70,000 records) on predictive accuracy for selected cancers (non-Hodgkin lymphoma, bladder, and colorectal cancers). Our results indicate that for larger datasets (n > 10,000) including digital twins in the nearest network neighbor prediction model significantly improves the performance compared to using real patients alone. Specifically, AUROCs ranged from 0.828 to 0.884 for cancers such as cervix uteri and ovarian cancer with digital twins, compared to 0.720 to 0.858 when using real patient data. Similarly, among the selected three cancers, AUROCs using digital twins exceeded AUROCs using real patients alone by at least 0.06 with narrowing variance in performance as the sample size increased. These results highlight the benefit of network-based digital twins, while emphasizing the importance of considering effective sample size when developing predictive models like SynTwin.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"30"},"PeriodicalIF":4.0,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11998210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144057855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-04-11DOI: 10.1186/s13040-025-00439-8
Amr Eledkawy, Taher Hamza, Sara El-Metwally
{"title":"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":"https://doi.org/10.1186/s13040-025-00439-8","url":null,"abstract":"<p><strong>Background: </strong>Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.</p><p><strong>Results: </strong>The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).</p><p><strong>Conclusion: </strong>The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144023569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-04-04DOI: 10.1186/s13040-025-00443-y
Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin
{"title":"Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction.","authors":"Di Zhao, Wenxuan Mu, Xiangxing Jia, Shuang Liu, Yonghe Chu, Jiana Meng, Hongfei Lin","doi":"10.1186/s13040-025-00443-y","DOIUrl":"10.1186/s13040-025-00443-y","url":null,"abstract":"<p><p>Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11969866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143781479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-03-28DOI: 10.1186/s13040-025-00441-0
Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini
{"title":"Multivariate longitudinal clustering reveals neuropsychological factors as dementia predictors in an Alzheimer's disease progression study.","authors":"Patrizia Ribino, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini","doi":"10.1186/s13040-025-00441-0","DOIUrl":"https://doi.org/10.1186/s13040-025-00441-0","url":null,"abstract":"<p><p>Dementia due to Alzheimer's disease (AD) is a multifaceted neurodegenerative disorder characterized by various cognitive and behavioral decline factors. In this work, we propose an extension of the traditional k-means clustering for multivariate time series data to cluster joint trajectories of different features describing progression over time. The algorithm we propose here enables the joint analysis of various longitudinal features to explore co-occurring trajectory factors among markers indicative of cognitive decline in individuals participating in an AD progression study. By examining how multiple variables co-vary and evolve together, we identify distinct subgroups within the cohort based on their longitudinal trajectories. Our clustering method enhances the understanding of individual development across multiple dimensions and provides deeper medical insights into the trajectories of cognitive decline. In addition, the proposed algorithm is also able to make a selection of the most significant features in separating clusters by considering trajectories over time. This process, together with a preliminary pre-processing on the OASIS-3 dataset, reveals an important role of some neuropsychological factors. In particular, the proposed method has identified a significant profile compatible with a syndrome known as Mild Behavioral Impairment (MBI), displaying behavioral manifestations of individuals that may precede the cognitive symptoms typically observed in AD patients. The findings underscore the importance of considering multiple longitudinal features in clinical modeling, ultimately supporting more effective and individualized patient management strategies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951806/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-03-28DOI: 10.1186/s13040-025-00442-z
Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao
{"title":"Network-based multi-omics integrative analysis methods in drug discovery: a systematic review.","authors":"Wei Jiang, Weicai Ye, Xiaoming Tan, Yun-Juan Bao","doi":"10.1186/s13040-025-00442-z","DOIUrl":"https://doi.org/10.1186/s13040-025-00442-z","url":null,"abstract":"<p><p>The integration of multi-omics data from diverse high-throughput technologies has revolutionized drug discovery. While various network-based methods have been developed to integrate multi-omics data, systematic evaluation and comparison of these methods remain challenging. This review aims to analyze network-based approaches for multi-omics integration and evaluate their applications in drug discovery. We conducted a comprehensive review of literature (2015-2024) on network-based multi-omics integration methods in drug discovery, and categorized methods into four primary types: network propagation/diffusion, similarity-based approaches, graph neural networks, and network inference models. We also discussed the applications of the methods in three scenario of drug discovery, including drug target identification, drug response prediction, and drug repurposing, and finally evaluated the performance of the methods by highlighting their advantages and limitations in specific applications. While network-based multi-omics integration has shown promise in drug discovery, challenges remain in computational scalability, data integration, and biological interpretation. Future developments should focus on incorporating temporal and spatial dynamics, improving model interpretability, and establishing standardized evaluation frameworks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11954193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}