Alessio Zanga , Alice Bernasconi , Peter J.F. Lucas , Hanny Pijnenborg , Casper Reijnen , Marco Scutari , Anthony C. Constantinou
{"title":"Federated causal discovery with missing data in a multicentric study on endometrial cancer","authors":"Alessio Zanga , Alice Bernasconi , Peter J.F. Lucas , Hanny Pijnenborg , Casper Reijnen , Marco Scutari , Anthony C. Constantinou","doi":"10.1016/j.jbi.2025.104877","DOIUrl":"10.1016/j.jbi.2025.104877","url":null,"abstract":"<div><h3>Objectives:</h3><div>Establishing causal dependencies is crucial in applied domains, such as medicine and healthcare, where decision-making must be explainable. In these settings, small sample sizes and missing data call for federated approaches to maximise the amount of information we can use.</div></div><div><h3>Methods:</h3><div>We propose a novel federated causal discovery algorithm capable of pooling information from multiple sources with heterogeneous missing data to learn a graph representing cause–effect relationships. In particular, we learn a causal graph on a centralised server while taking into account both prior knowledge and missingness mechanism specific to each client.</div></div><div><h3>Results:</h3><div>We applied the proposed algorithm to synthetic data and real-world data from a multicentric study on endometrial cancer, validating the obtained causal graph through quantitative analyses and a clinical literature review.</div></div><div><h3>Conclusion:</h3><div>Our approach learns an accurate model despite data missing not-at-random.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104877"},"PeriodicalIF":4.5,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144707575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scoping review of natural language processing in addressing medically inaccurate information: Errors, misinformation, and hallucination","authors":"Zhaoyi Sun , Wen-Wai Yim , Özlem Uzuner , Fei Xia , Meliha Yetisgen","doi":"10.1016/j.jbi.2025.104866","DOIUrl":"10.1016/j.jbi.2025.104866","url":null,"abstract":"<div><h3>Objective:</h3><div>This review aims to explore the potential and challenges of using Natural Language Processing (NLP) to detect, correct, and mitigate medically inaccurate information, including errors, misinformation, and hallucination. By unifying these concepts, the review emphasizes their shared methodological foundations and their distinct implications for healthcare. Our goal is to advance patient safety, improve public health communication, and support the development of more reliable and transparent NLP applications in healthcare.</div></div><div><h3>Methods:</h3><div>A scoping review was conducted following PRISMA-ScR guidelines, analyzing studies from 2020 to 2024 across five databases. Studies were selected based on their use of NLP to address medically inaccurate information and were categorized by topic, tasks, document types, datasets, models, and evaluation metrics.</div></div><div><h3>Results:</h3><div>NLP has shown potential in addressing medically inaccurate information on the following tasks: (1) error detection (2) error correction (3) misinformation detection (4) misinformation correction (5) hallucination detection (6) hallucination mitigation. However, challenges remain with data privacy, context dependency, and evaluation standards.</div></div><div><h3>Conclusion:</h3><div>This review highlights the advancements in applying NLP to tackle medically inaccurate information while underscoring the need to address persistent challenges. Future efforts should focus on developing real-world datasets, refining contextual methods, and improving hallucination management to ensure reliable and transparent healthcare applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104866"},"PeriodicalIF":4.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Dai , Yu Huang , Yuxi Liu , Xing He , Jingchuan Guo , Mattia Prosperi , Jiang Bian
{"title":"Variational temporal deconfounder network for individualized treatment effect estimation with longitudinal observational data","authors":"Hao Dai , Yu Huang , Yuxi Liu , Xing He , Jingchuan Guo , Mattia Prosperi , Jiang Bian","doi":"10.1016/j.jbi.2025.104880","DOIUrl":"10.1016/j.jbi.2025.104880","url":null,"abstract":"<div><h3>Objective</h3><div>By leveraging real-world electronic health record (EHR) data, this study set out to estimate individualized treatment effects (ITE) in longitudinal observational settings to advance personalized medicine, addressing key challenges that are often observed in real-world clinical scenarios and pose statistical challenges, including hidden confounding and dynamic treatment regimens.</div></div><div><h3>Methods</h3><div>We propose the Variational Temporal Deconfounder Network (VTDNet), a novel framework designed to account for time-varying hidden confounding using a variational recurrent transformer-based autoencoder. Specifically, VTDNet comprises three critical components: a temporal Encoder-Decoder structure to capture hidden representation, a Treatment Block that captures interdependencies among multiple treatments, and a Potential Outcome Block that predicts both factual and counterfactual outcomes. We assess the effectiveness of the proposed framework using a synthetic dataset and two real-world datasets: MIMIC-III, an EHR dataset focusing on intensive care settings, and NACC, emphasizing neurodegenerative disease, collected using a standardized protocol from participants enrolled in Alzheimer’s Disease Research Center (ADRC) clinical cores.</div></div><div><h3>Results</h3><div>Experimental results on the synthetic dataset demonstrate superior accuracy under varying levels of confounding. On real-world EHR datasets, VTDNet achieves lower root mean squared error, mean absolute error, and influence function precision in the estimation of heterogeneous effects compared to existing state-of-the-art methods.</div></div><div><h3>Conclusion</h3><div>The proposed VTDNet offers a robust framework for estimating individualized treatment effects in longitudinal settings, effectively accommodating irregular time points and high-dimensional data while addressing hidden confounders through a deep generative approach. It holds significant potential to advance personalized medicine and support real-world evidence generation. Future work will aim to extend VTDNet to continuous treatment scenarios, such as dose–response analysis, to further broaden its applicability in clinical practice.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104880"},"PeriodicalIF":4.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144695011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard Wyss , Jie Yang , Sebastian Schneeweiss , Joseph M. Plasek , Li Zhou , Thomas Deramus , Janick G. Weberpals , Kerry Ngan , Theodore N. Tsacogianis , Kueiyu Joshua Lin
{"title":"Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies","authors":"Richard Wyss , Jie Yang , Sebastian Schneeweiss , Joseph M. Plasek , Li Zhou , Thomas Deramus , Janick G. Weberpals , Kerry Ngan , Theodore N. Tsacogianis , Kueiyu Joshua Lin","doi":"10.1016/j.jbi.2025.104882","DOIUrl":"10.1016/j.jbi.2025.104882","url":null,"abstract":"<div><h3>Background</h3><div>To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors (‘proxy’ confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from electronic health records (EHRs). Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.</div></div><div><h3>Objective</h3><div>To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.</div></div><div><h3>Methods</h3><div>We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used least absolute shrinkage and selection operator (LASSO) regression to fit several propensity score (PS) models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.</div></div><div><h3>Results</h3><div>Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being < 0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.</div></div><div><h3>Conclusion</h3><div>Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104882"},"PeriodicalIF":4.0,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144682653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ba-Hoang Tran , Hung-Manh Hoang , Binh-Nguyen Nguyen , Duy-Cat Can , Hoang-Quynh Le
{"title":"A multifaceted approach to drug–drug interaction extraction with fusion strategies","authors":"Ba-Hoang Tran , Hung-Manh Hoang , Binh-Nguyen Nguyen , Duy-Cat Can , Hoang-Quynh Le","doi":"10.1016/j.jbi.2025.104874","DOIUrl":"10.1016/j.jbi.2025.104874","url":null,"abstract":"<div><h3>Objective:</h3><div>Drug–drug interactions (DDIs) occur when one medication affects the efficacy of another, potentially leading to unforeseen patient outcomes. Existing studies primarily focus on textual data, but overlook a wealth of the drug’s multimodal information. This study aims to enhance DDI extraction by integrating diverse data modalities and evaluating various fusion strategies.</div></div><div><h3>Methods:</h3><div>We introduce a multimodal approach that integrates diverse representations of drug information (scientific text, graphs, formulas, images, and descriptions) to enhance the detection of drug–drug interactions. We explored various fusion techniques to effectively combine these modalities across early, intermediate, and late fusion phases. Additionally, we identify the factors contributing to failed cases, providing insights into the model’s limitations and potential improvements. We have conducted experiments using publicly available DDI datasets, demonstrating significant improvements compared to existing methods.</div></div><div><h3>Results</h3><div>: The proposed model significantly outperformed existing methods in DDI detection. Intermediate fusion strategies, particularly prediction-level concatenation, demonstrated superior accuracy and robustness. Detailed analyses identified factors contributing to failed cases, offering insights for future improvements.</div></div><div><h3>Conclusion:</h3><div>The findings highlight the potential of multimodal fusion to enhance predictive accuracy, providing a foundation for safer drug therapies and better-informed clinical decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104874"},"PeriodicalIF":4.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144659341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Baskett , Benjamin Black , Adnan I. Qureshi , Chi-Ren Shyu
{"title":"Identifying homogenous patient subgroups using transformer based hierarchical clustering of heterogeneous Mixed-Modality medical data","authors":"William Baskett , Benjamin Black , Adnan I. Qureshi , Chi-Ren Shyu","doi":"10.1016/j.jbi.2025.104878","DOIUrl":"10.1016/j.jbi.2025.104878","url":null,"abstract":"<div><h3>Objective</h3><div>Patients are highly heterogeneous, with varying needs and responses to treatment. Identifying clinically homogenous patient subgroups is critical to improve personalized care. Patient records are often heterogeneous, may include multiple modalities which conventionally require separate data processing considerations, and are often incomplete, leading to difficulties in identifying meaningful clusters of patients.</div></div><div><h3>Methods</h3><div>We introduce a Med-ROAR, a transformer-based Random Order AutoRegressive (ROAR) embedding model for medical data. Med-ROAR hierarchically clusters data by encoding it into hierarchical discrete embeddings using a modified self-attention operation to facilitate random order mixed modality autoregressive modeling. This allows the model to accept arbitrary mixes of record types without special considerations. We compare our method’s clustering effectiveness to standard agglomerative clustering using 147,469 individuals diagnosed with Autism Spectrum Disorder (ASD). We also evaluate its use on data with mixed modalities and its resilience to missing information using 50,458 clinical records from Intensive Care Unit (ICU) patients which include both tabular and time-series components.</div></div><div><h3>Results</h3><div>We demonstrate that Med-ROAR is more likely to discover more cohesive high-level clusters than distance-based methods like agglomerative clustering. Our exploratory analysis of the autism data identifies clinically meaningful patterns of phenotypes within ASD. We identify homogenous, but atypical, patient subgroups within the ASD population. We also demonstrate Med-ROAR’s effectiveness in clustering patients using mixes of both tabular and time series clinical records from ICU patients. We demonstrate that Med-ROAR can predict patient subgroups even using incomplete, preliminary information collected shortly after admission.</div></div><div><h3>Conclusion</h3><div>Med-ROAR is a flexible hierarchical clustering technique which learns to cluster patients based on learned high-level semantic similarities rather than rule-based metrics. It can accept whatever patient data may be available without modification to the underlying model architecture. The data modalities which Med-ROAR can accept are primarily constrained by computational resources, rather than architectural limitations.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104878"},"PeriodicalIF":4.0,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144642647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sooyon Kim , Yongtaek Lim , Sungjun Lim , Gyeongdeok Seo , Jihee Kim , Hojun Park , Jaehun Jung , Kyungwoo Song
{"title":"COVID-19 prediction with doubly multi-task Gaussian Process","authors":"Sooyon Kim , Yongtaek Lim , Sungjun Lim , Gyeongdeok Seo , Jihee Kim , Hojun Park , Jaehun Jung , Kyungwoo Song","doi":"10.1016/j.jbi.2025.104872","DOIUrl":"10.1016/j.jbi.2025.104872","url":null,"abstract":"<div><div>This paper addresses a real-world multi-task prediction problem with time-series characteristics by proposing a novel Doubly Multi-Task Gaussian Process (DMTGP) model. Motivated by strong correlations between the number of confirmed cases and deaths, as well as between cases across the different countries, the model incorporates task-wise correlations to predict the number of COVID-19 patients, considering both task-specific (individual) and cross-task (shared) information to enhance overall performance. We constructed a database for three East Asian countries — Japan, South Korea, and Taiwan — and aim to simultaneously predict the number of confirmed cases and deaths in each country. To model the interactions among these countries, we employed a Transformer encoder layer to calculate cross-attention scores. Qualitative analysis of the attention score map demonstrates that our framework effectively captures the dynamic relationships between multiple nations over time. Our experimental results show that the DMTGP model outperforms other baseline models in handling doubly multiple tasks.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104872"},"PeriodicalIF":4.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144626489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prabodi Senevirathna , Douglas E.V. Pires , Daniel Capurro
{"title":"Uncovering digital overdiagnosis – Quantification and mitigation using clinical trajectories: Heparin-induced thrombocytopenia use case","authors":"Prabodi Senevirathna , Douglas E.V. Pires , Daniel Capurro","doi":"10.1016/j.jbi.2025.104876","DOIUrl":"10.1016/j.jbi.2025.104876","url":null,"abstract":"<div><h3>Objective</h3><div>Overdiagnosis occurs when abnormalities meeting diagnostic criteria would remain asymptomatic if undiagnosed. Cases initially identified through digital diagnostic tools but later recognised as overdiagnosis are referred to as ‘digital overdiagnosis’. Data-driven frameworks to quantify and mitigate overdiagnosis remain limited. This study introduces a framework that integrates clinical trajectories to train a machine learning (ML)-based disease classifier, enabling the quantification and mitigation of digital overdiagnosis, using Heparin-Induced Thrombocytopenia (HIT) as a case study.</div></div><div><h3>Methods</h3><div>A pre-existing HIT classifier identified HIT-positive and HIT-negative cases, with ground truth based on HIT diagnostic criteria. Clinical trajectories for True Positive (TP) and True Negative (TN) patients were clustered using a novel process-models-based approach. Overdiagnosis was detected when TP cases clustered with predominantly TN cases. The classifier was then retrained with an ‘updated label’ integrating both HIT criteria and the concordant trajectory, to reduce overdiagnosis while maintaining accuracy.</div></div><div><h3>Results</h3><div>7.2% of TP cases were identified as overdiagnosed. Retraining with the updated labels successfully reclassified 89.5% of overdiagnosed cases as TN, with only a minimal reduction in performance (MCC decreased by 0.03, positive likelihood ratio decreased by 0.49, and negative likelihood ratio increased by 0.05). Clinical outcomes—length of stay, thrombotic events, and mortality—differed significantly between non-overdiagnosed and overdiagnosed cases, and between non-overdiagnosed and TN cases, but not between overdiagnosed and TN cases, confirming that overdiagnosed patients resemble TN patients.</div></div><div><h3>Conclusion</h3><div>Incorporating clinical trajectories into ML-based diagnosis enables the quantification of digital overdiagnosis. This approach could refine ML algorithms by prompting a reassessment of criteria-based disease labels in supervised learning.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104876"},"PeriodicalIF":4.0,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144608513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiyan Deng , Shen Zhao , Yongming Miao , Junjie Zhu , Jin Li
{"title":"MedKA: A knowledge graph-augmented approach to improve factuality in medical Large Language Models","authors":"Yiyan Deng , Shen Zhao , Yongming Miao , Junjie Zhu , Jin Li","doi":"10.1016/j.jbi.2025.104871","DOIUrl":"10.1016/j.jbi.2025.104871","url":null,"abstract":"<div><div>Large language models (LLMs) have demonstrated remarkable potential in medical applications. However, they still face critical challenges such as hallucinations, knowledge inconsistency, and insufficient integration of domain-specific medical expertise. To address these limitations, we introduce MedKA, a novel knowledge graph-augmented approach for fine-tuning and evaluating medical LLMs. Our approach systematically transforms structured knowledge from a medical knowledge graph into a high-quality QA corpus, cMKGQA, by clustering multiple fields around clinically meaningful scenarios (e.g., diagnosis, treatment planning). This grouping strategy enables comprehensive and use-case-specific data construction and supports one-stage training of the LLM, ensuring better alignment with structured medical knowledge. This transformation process ensures the comprehensive integration of domain-specific knowledge while maintaining factual consistency. To evaluate the factuality of LLM-generated response, we further propose the Knowledge Graph-based Auxiliary Evaluation Metrics (KG-AEMs)—a novel benchmarking framework that compares LLM outputs with fine-grained, attribute-level ground truth from knowledge graph. Experimental results demonstrate that MedKA achieves state-of-the-art performance, significantly outperforming existing models, including LLaMA-3.1-8B-Chinese-Chat, HuatuoGPT2-7B, and Apollo2-7B. On the cMKGQA dataset, MedKA achieves 44.63 BLEU-1 and 17.62 BLEU-4 scores, with particularly strong performance in areas such as medication recommendations and diagnostic tests as measured by KG-AEMs. Our approach highlights the potential of integrating knowledge graphs into LLM fine-tuning to improve the accuracy and reliability of medical AI systems. It advances factual accuracy in medical dialogue systems and provides a comprehensive framework for evaluating the integration of medical knowledge into LLMs. This work is publicly available on Github: <span><span>https://github.com/Yai017/MedKA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"168 ","pages":"Article 104871"},"PeriodicalIF":4.0,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajing Xue , Yaqing Xu , Jingmao Li , Shuangge Ma , Kuangnan Fang
{"title":"Joint modeling of mixed outcomes using a rank-based sparse neural network","authors":"Jiajing Xue , Yaqing Xu , Jingmao Li , Shuangge Ma , Kuangnan Fang","doi":"10.1016/j.jbi.2025.104870","DOIUrl":"10.1016/j.jbi.2025.104870","url":null,"abstract":"<div><h3>Objective:</h3><div>In the past few decades, high-throughput profiling has been extensively conducted, leading to significant advancements in cancer research, survival analysis, and other biomedical studies. While many methods have been developed to identify important features and construct predictive models, biomedical research often faces challenges due to insufficient information caused by high dimensionality and small sample sizes, which frequently lead to unsatisfactory identification and prediction accuracy.</div></div><div><h3>Methods:</h3><div>In this paper, we propose a rank-based sparse neural network that efficiently leverages information from mixed outcomes, particularly incorporating survival data. The proposed method accounts for unknown relationships between outcomes and high-dimensional covariates, whereas many traditional methods are built on a parametric framework. A novel loss function is derived to address the gradient imbalance issue and accommodate mixed outcomes. A sparse layer is developed to implement the penalization method, enabling the identification of important variables.</div></div><div><h3>Results:</h3><div>We conducted extensive simulation studies, showing that the proposed method is effective and broadly applicable. The analysis of skin cutaneous melanoma (SKCM) demonstrates the competitive performance of our proposed method.</div></div><div><h3>Conclusion:</h3><div>The proposed method effectively models mixed outcomes (including survival data) and selects important features, which is beneficial for biomedical studies like cancer and genomic research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104870"},"PeriodicalIF":4.0,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144584025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}