Thomas A. Lasko , William W. Stead , John M. Still , Thomas Z. Li , Michael Kammer , Marco Barbero-Mota , Eric V. Strobl , Bennett A. Landman , Fabien Maldonado
{"title":"Unsupervised discovery of clinical disease signatures using probabilistic independence","authors":"Thomas A. Lasko , William W. Stead , John M. Still , Thomas Z. Li , Michael Kammer , Marco Barbero-Mota , Eric V. Strobl , Bennett A. Landman , Fabien Maldonado","doi":"10.1016/j.jbi.2025.104837","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This study uses probabilistic independence to disentangle patient-specific sources of disease and their signatures in Electronic Health Record (EHR) data.</div></div><div><h3>Materials and Methods</h3><div>We model a disease source as an unobserved root node in the causal graph of observed EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in <span><math><mrow><mn>630</mn><mo>,</mo><mn>000</mn></mrow></math></span> cross-sectional training instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records, comparing the inferred causes to an external reference list and other medical literature. We compared models trained by three different algorithms and used corresponding models trained directly from the observed variables as baselines.</div></div><div><h3>Results</h3><div>The model recovered 92% of malignant and 30% of benign causes in the reference standard. Of the top 20 inferred causes of malignancy, 14 were not listed in the reference standard, but had supporting evidence in the literature, as did 11 of the top 20 inferred causes of benign nodules. The model decomposed listed malignant causes by an average factor of 5.5 and benign causes by 4.1, with most stratifying by disease course or treatment regimen. Predictive accuracy of causal predictive models trained on source expressions (Random Forest AUC 0.788) was similar to (p = 0.058) their associational baselines (0.738).</div></div><div><h3>Discussion</h3><div>Most of the unrecovered causes were due to the rarity of the condition or lack of sufficient detail in the input data. Surprisingly, the causal model found many patients with apparently undiagnosed cancer as the source of the malignant nodules. Causal model AUC also suggests that some sources remained undiscovered in this cohort.</div></div><div><h3>Conclusion</h3><div>These promising results demonstrate the potential of using probabilistic independence to disentangle complex clinical signatures from noisy, asynchronous, and incomplete EHR data that represent the confluence of multiple simultaneous conditions, and to identify patient-specific causes that support precise treatment decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104837"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425000668","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This study uses probabilistic independence to disentangle patient-specific sources of disease and their signatures in Electronic Health Record (EHR) data.
Materials and Methods
We model a disease source as an unobserved root node in the causal graph of observed EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in cross-sectional training instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records, comparing the inferred causes to an external reference list and other medical literature. We compared models trained by three different algorithms and used corresponding models trained directly from the observed variables as baselines.
Results
The model recovered 92% of malignant and 30% of benign causes in the reference standard. Of the top 20 inferred causes of malignancy, 14 were not listed in the reference standard, but had supporting evidence in the literature, as did 11 of the top 20 inferred causes of benign nodules. The model decomposed listed malignant causes by an average factor of 5.5 and benign causes by 4.1, with most stratifying by disease course or treatment regimen. Predictive accuracy of causal predictive models trained on source expressions (Random Forest AUC 0.788) was similar to (p = 0.058) their associational baselines (0.738).
Discussion
Most of the unrecovered causes were due to the rarity of the condition or lack of sufficient detail in the input data. Surprisingly, the causal model found many patients with apparently undiagnosed cancer as the source of the malignant nodules. Causal model AUC also suggests that some sources remained undiscovered in this cohort.
Conclusion
These promising results demonstrate the potential of using probabilistic independence to disentangle complex clinical signatures from noisy, asynchronous, and incomplete EHR data that represent the confluence of multiple simultaneous conditions, and to identify patient-specific causes that support precise treatment decisions.
期刊介绍:
The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.