Richard Wyss , Jie Yang , Sebastian Schneeweiss , Joseph M. Plasek , Li Zhou , Thomas Deramus , Janick G. Weberpals , Kerry Ngan , Theodore N. Tsacogianis , Kueiyu Joshua Lin
{"title":"医疗保健数据库研究中可扩展特征工程和超高维混杂调整的自然语言处理。","authors":"Richard Wyss , Jie Yang , Sebastian Schneeweiss , Joseph M. Plasek , Li Zhou , Thomas Deramus , Janick G. Weberpals , Kerry Ngan , Theodore N. Tsacogianis , Kueiyu Joshua Lin","doi":"10.1016/j.jbi.2025.104882","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors (‘proxy’ confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from electronic health records (EHRs). Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.</div></div><div><h3>Objective</h3><div>To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.</div></div><div><h3>Methods</h3><div>We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used least absolute shrinkage and selection operator (LASSO) regression to fit several propensity score (PS) models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.</div></div><div><h3>Results</h3><div>Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being < 0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.</div></div><div><h3>Conclusion</h3><div>Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104882"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies\",\"authors\":\"Richard Wyss , Jie Yang , Sebastian Schneeweiss , Joseph M. Plasek , Li Zhou , Thomas Deramus , Janick G. Weberpals , Kerry Ngan , Theodore N. Tsacogianis , Kueiyu Joshua Lin\",\"doi\":\"10.1016/j.jbi.2025.104882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors (‘proxy’ confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from electronic health records (EHRs). Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.</div></div><div><h3>Objective</h3><div>To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.</div></div><div><h3>Methods</h3><div>We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used least absolute shrinkage and selection operator (LASSO) regression to fit several propensity score (PS) models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.</div></div><div><h3>Results</h3><div>Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being < 0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.</div></div><div><h3>Conclusion</h3><div>Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.</div></div>\",\"PeriodicalId\":15263,\"journal\":{\"name\":\"Journal of Biomedical Informatics\",\"volume\":\"169 \",\"pages\":\"Article 104882\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S153204642500111X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S153204642500111X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies
Background
To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors (‘proxy’ confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from electronic health records (EHRs). Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.
Objective
To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.
Methods
We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used least absolute shrinkage and selection operator (LASSO) regression to fit several propensity score (PS) models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.
Results
Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being < 0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.
Conclusion
Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.
期刊介绍:
The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.