Extraction of Social Determinants of Health From Electronic Health Records Using Natural Language Processing.

IF 2.8 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2025-07-01 Epub Date: 2025-07-23 DOI:10.1200/CCI-24-00317

Zhenghua Chen, Patricia Lasserre, Angela Lin, Rasika Rajapakshe

{"title":"Extraction of Social Determinants of Health From Electronic Health Records Using Natural Language Processing.","authors":"Zhenghua Chen, Patricia Lasserre, Angela Lin, Rasika Rajapakshe","doi":"10.1200/CCI-24-00317","DOIUrl":null,"url":null,"abstract":"Purpose: Social Determinants of Health (SDoH) have a significant effect on health outcomes and inequalities. SDoH can be extracted from electronic health records (EHR) to aid policy development and research to improve population health. Automated extraction using artificial intelligence (AI) can improve efficiency and cost-effectiveness. The focus of this study was to autonomously extract comprehensive SDoH details from EHR using a natural language processing (NLP)-based AI pipeline.Materials and methods: A curated set of 1,000 BC Cancer clinical documents with concentrated SDoH information served as the reference standard for training and evaluating NLP models. Two pipelines were used: an open-source pipeline trained on the annotated medical documents and an industrial pretrained solution used as a benchmark. Three experiments optimized the first pipeline's performance, assessing the effect of including subtype word positions during training. The superior open-source pipeline was then used to extract SDoH information from 13,258 oncology documents.Results: The open-source pipeline achieved an average F1 score accuracy of 0.88 on the validation data set for extracting 13 SDoH factors, surpassing the benchmark by 5%. It excelled in detailed subtype extraction, while the benchmark performed better in identifying rarely annotated SDoH information in BC Cancer data set. Overall, 60,717 SDoH factors and associated details were extracted from BC Cancer EHR oncology documents. The most frequently extracted SDoH factors included tobacco use, employment status, marital status, alcohol consumption, and living status, occurring between 8k to 12k times.Conclusion: This study demonstrates the potential of an NLP pipeline to extract SDoH factors from clinical notes, with strong performance on limited data, although data set-specific adjustments are needed for broader application across institutions.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400317"},"PeriodicalIF":2.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309507/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Social Determinants of Health (SDoH) have a significant effect on health outcomes and inequalities. SDoH can be extracted from electronic health records (EHR) to aid policy development and research to improve population health. Automated extraction using artificial intelligence (AI) can improve efficiency and cost-effectiveness. The focus of this study was to autonomously extract comprehensive SDoH details from EHR using a natural language processing (NLP)-based AI pipeline.

Materials and methods: A curated set of 1,000 BC Cancer clinical documents with concentrated SDoH information served as the reference standard for training and evaluating NLP models. Two pipelines were used: an open-source pipeline trained on the annotated medical documents and an industrial pretrained solution used as a benchmark. Three experiments optimized the first pipeline's performance, assessing the effect of including subtype word positions during training. The superior open-source pipeline was then used to extract SDoH information from 13,258 oncology documents.

Results: The open-source pipeline achieved an average F1 score accuracy of 0.88 on the validation data set for extracting 13 SDoH factors, surpassing the benchmark by 5%. It excelled in detailed subtype extraction, while the benchmark performed better in identifying rarely annotated SDoH information in BC Cancer data set. Overall, 60,717 SDoH factors and associated details were extracted from BC Cancer EHR oncology documents. The most frequently extracted SDoH factors included tobacco use, employment status, marital status, alcohol consumption, and living status, occurring between 8k to 12k times.

Conclusion: This study demonstrates the potential of an NLP pipeline to extract SDoH factors from clinical notes, with strong performance on limited data, although data set-specific adjustments are needed for broader application across institutions.

Abstract Image

查看原文本刊更多论文

使用自然语言处理从电子健康记录中提取健康的社会决定因素。

目的：健康的社会决定因素（SDoH）对健康结果和不平等有重大影响。SDoH可以从电子健康记录（EHR）中提取，以帮助制定政策和研究，以改善人口健康。使用人工智能（AI）的自动提取可以提高效率和成本效益。本研究的重点是使用基于自然语言处理（NLP）的人工智能管道，从电子病历中自主提取全面的SDoH细节。材料和方法：收集1000份BC癌临床文献，集中SDoH信息，作为训练和评估NLP模型的参考标准。使用了两种管道：一种是在带注释的医疗文档上训练的开源管道，另一种是用作基准的工业预训练解决方案。三个实验优化了第一个管道的性能，评估了在训练过程中包含子类型词位置的效果。然后使用优越的开源管道从13,258份肿瘤学文档中提取SDoH信息。结果：开源管道在验证数据集上提取13个SDoH因子的平均F1得分准确率为0.88，比基准提高5%。它在详细的亚型提取方面表现出色，而基准在识别BC癌症数据集中很少注释的SDoH信息方面表现更好。总的来说，从BC癌症EHR肿瘤学文件中提取了60,717个SDoH因子和相关细节。最常提取的SDoH因素包括吸烟、就业状况、婚姻状况、饮酒和生活状况，出现次数在8k至12k之间。结论：本研究证明了NLP管道从临床记录中提取SDoH因素的潜力，在有限的数据上表现出色，尽管需要对数据集进行特定的调整，以便在各机构之间进行更广泛的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190