Natural language processing-driven state machines to extract social factors from unstructured clinical documentation.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2023-07-01 DOI:10.1093/jamiaopen/ooad024

Katie S Allen, Dan R Hood, Jonathan Cummins, Suranga Kasturi, Eneida A Mendonca, Joshua R Vest

{"title":"Natural language processing-driven state machines to extract social factors from unstructured clinical documentation.","authors":"Katie S Allen, Dan R Hood, Jonathan Cummins, Suranga Kasturi, Eneida A Mendonca, Joshua R Vest","doi":"10.1093/jamiaopen/ooad024","DOIUrl":null,"url":null,"abstract":"Objective: This study sought to create natural language processing algorithms to extract the presence of social factors from clinical text in 3 areas: (1) housing, (2) financial, and (3) unemployment. For generalizability, finalized models were validated on data from a separate health system for generalizability.Materials and methods: Notes from 2 healthcare systems, representing a variety of note types, were utilized. To train models, the study utilized n-grams to identify keywords and implemented natural language processing (NLP) state machines across all note types. Manual review was conducted to determine performance. Sampling was based on a set percentage of notes, based on the prevalence of social need. Models were optimized over multiple training and evaluation cycles. Performance metrics were calculated using positive predictive value (PPV), negative predictive value, sensitivity, and specificity.Results: PPV for housing rose from 0.71 to 0.95 over 3 training runs. PPV for financial rose from 0.83 to 0.89 over 2 training iterations, while PPV for unemployment rose from 0.78 to 0.88 over 3 iterations. The test data resulted in PPVs of 0.94, 0.97, and 0.95 for housing, financial, and unemployment, respectively. Final specificity scores were 0.95, 0.97, and 0.95 for housing, financial, and unemployment, respectively.Discussion: We developed 3 rule-based NLP algorithms, trained across health systems. While this is a less sophisticated approach, the algorithms demonstrated a high degree of generalizability, maintaining >0.85 across all predictive performance metrics.Conclusion: The rule-based NLP algorithms demonstrated consistent performance in identifying 3 social factors within clinical text. These methods may be a part of a strategy to measure social factors within an institution.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"6 2","pages":"ooad024"},"PeriodicalIF":2.5000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/5b/2f/ooad024.PMC10112959.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooad024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 2

Abstract

Objective: This study sought to create natural language processing algorithms to extract the presence of social factors from clinical text in 3 areas: (1) housing, (2) financial, and (3) unemployment. For generalizability, finalized models were validated on data from a separate health system for generalizability.

Materials and methods: Notes from 2 healthcare systems, representing a variety of note types, were utilized. To train models, the study utilized n-grams to identify keywords and implemented natural language processing (NLP) state machines across all note types. Manual review was conducted to determine performance. Sampling was based on a set percentage of notes, based on the prevalence of social need. Models were optimized over multiple training and evaluation cycles. Performance metrics were calculated using positive predictive value (PPV), negative predictive value, sensitivity, and specificity.

Results: PPV for housing rose from 0.71 to 0.95 over 3 training runs. PPV for financial rose from 0.83 to 0.89 over 2 training iterations, while PPV for unemployment rose from 0.78 to 0.88 over 3 iterations. The test data resulted in PPVs of 0.94, 0.97, and 0.95 for housing, financial, and unemployment, respectively. Final specificity scores were 0.95, 0.97, and 0.95 for housing, financial, and unemployment, respectively.

Discussion: We developed 3 rule-based NLP algorithms, trained across health systems. While this is a less sophisticated approach, the algorithms demonstrated a high degree of generalizability, maintaining >0.85 across all predictive performance metrics.

Conclusion: The rule-based NLP algorithms demonstrated consistent performance in identifying 3 social factors within clinical text. These methods may be a part of a strategy to measure social factors within an institution.

Abstract Image

查看原文本刊更多论文

自然语言处理驱动的状态机从非结构化临床文档中提取社会因素。

目的:本研究试图创建自然语言处理算法，从三个领域的临床文本中提取社会因素的存在:(1)住房，(2)金融，(3)失业。为了提高通用性，最后确定的模型在一个单独的卫生系统的数据上进行了验证。材料和方法:使用了来自2个医疗保健系统的笔记，代表了各种笔记类型。为了训练模型，该研究利用n-gram来识别关键字，并在所有音符类型中实现自然语言处理(NLP)状态机。进行人工审查以确定性能。抽样是根据社会需求的普遍程度，根据一定比例的笔记进行的。通过多个训练和评估周期对模型进行优化。使用阳性预测值(PPV)、阴性预测值、敏感性和特异性计算性能指标。结果:经过3次训练，住房的PPV由0.71提高到0.95。金融的PPV在2次迭代中从0.83上升到0.89，而失业的PPV在3次迭代中从0.78上升到0.88。测试数据显示，住房、金融和失业的ppv分别为0.94、0.97和0.95。住房、金融和失业的最终特异性评分分别为0.95、0.97和0.95。讨论:我们开发了3种基于规则的NLP算法，在卫生系统中进行了训练。虽然这是一种不太复杂的方法，但算法显示出高度的泛化性，在所有预测性能指标中保持>0.85。结论:基于规则的NLP算法在识别临床文本中的3个社会因素方面表现一致。这些方法可能是衡量一个机构内社会因素的策略的一部分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊