慢性腰痛患者病历中健康社会决定因素的自动提取

D. Lituiev, Benjamin Lacar, Sang S. Pak, Peter L Abramowitsch, E. D. Marchis, Thomas A. Peterson
{"title":"慢性腰痛患者病历中健康社会决定因素的自动提取","authors":"D. Lituiev, Benjamin Lacar, Sang S. Pak, Peter L Abramowitsch, E. D. Marchis, Thomas A. Peterson","doi":"10.1101/2022.03.04.22271541","DOIUrl":null,"url":null,"abstract":"Background. Adverse social determinants of health (SDoH), or social risk factors, such as food insecurity and housing instability, are known to contribute to poor health outcomes and inequities. Our ability to study these linkages is limited because SDoH information is more frequently documented in free-text clinical notes than structured data fields. To overcome this challenge, there is a growing push to develop techniques for automated extraction of SDoH. In this study, we explored natural language processing (NLP) and inference (NLI) methods to extract SDoH information from clinical notes of patients with chronic low back pain (cLBP), to enhance future analyses of the associations between SDoH and low back pain outcomes and disparities. Methods. Clinical notes (n=1,576) for patients with cLBP (n=386) were annotated for seven SDoH domains: housing, food, transportation, finances, insurance coverage, marital and partnership status, and other social support, resulting in 626 notes with at least one annotated entity for 364 patients. We additionally labelled pain scores, depression, and anxiety. We used a two-tier taxonomy with these 10 first-level ontological classes and 68 second-level ontological classes. We developed and validated extraction systems based on both rule-based and machine learning approaches. As a rule-based approach, we iteratively configured a clinical Text Analysis and Knowledge Extraction System (cTAKES) system. We trained two machine learning models (based on convolutional neural network (CNN) and RoBERTa transformer), and a hybrid system combining pattern matching and bag-of-words models. Additionally, we evaluated a RoBERTa based entailment model as an alternative technique of SDoH detection in clinical texts. We used a model previously trained on general domain data without additional training on our dataset. Results. Four annotators achieved high agreement (average kappa=95%, F1=91.20%). Annotation frequency varied significantly dependent on note type. By tuning cTAKES, we achieved a performance of F1=47.11% for first-level classes. For most classes, the machine learning RoBERTa-based NER model performed better (first-level F1=84.35%) than other models within the internal test dataset. The hybrid system on average performed slightly worse than the RoBERTa NER model (first-level F1=80.27%), matching or outperforming the former in terms of recall. Using an out-of-the-box entailment model, we detected many but not all challenging wordings missed by other models, reaching an average F1 of 76.04%, while matching and outperforming the tested NER models in several classes. Still, the entailment model may be sensitive to hypothesis wording and may require further fine tuning. Conclusion. This study developed a corpus of annotated clinical notes covering a broad spectrum of SDoH classes. This corpus provides a basis for training machine learning models and serves as a benchmark for predictive models for named entity recognition for SDoH and knowledge extraction from clinical texts.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Automatic Extraction of Social Determinants of Health from Medical Notes of Chronic Lower Back Pain Patients\",\"authors\":\"D. Lituiev, Benjamin Lacar, Sang S. Pak, Peter L Abramowitsch, E. D. Marchis, Thomas A. Peterson\",\"doi\":\"10.1101/2022.03.04.22271541\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background. Adverse social determinants of health (SDoH), or social risk factors, such as food insecurity and housing instability, are known to contribute to poor health outcomes and inequities. Our ability to study these linkages is limited because SDoH information is more frequently documented in free-text clinical notes than structured data fields. To overcome this challenge, there is a growing push to develop techniques for automated extraction of SDoH. In this study, we explored natural language processing (NLP) and inference (NLI) methods to extract SDoH information from clinical notes of patients with chronic low back pain (cLBP), to enhance future analyses of the associations between SDoH and low back pain outcomes and disparities. Methods. Clinical notes (n=1,576) for patients with cLBP (n=386) were annotated for seven SDoH domains: housing, food, transportation, finances, insurance coverage, marital and partnership status, and other social support, resulting in 626 notes with at least one annotated entity for 364 patients. We additionally labelled pain scores, depression, and anxiety. We used a two-tier taxonomy with these 10 first-level ontological classes and 68 second-level ontological classes. We developed and validated extraction systems based on both rule-based and machine learning approaches. As a rule-based approach, we iteratively configured a clinical Text Analysis and Knowledge Extraction System (cTAKES) system. We trained two machine learning models (based on convolutional neural network (CNN) and RoBERTa transformer), and a hybrid system combining pattern matching and bag-of-words models. Additionally, we evaluated a RoBERTa based entailment model as an alternative technique of SDoH detection in clinical texts. We used a model previously trained on general domain data without additional training on our dataset. Results. Four annotators achieved high agreement (average kappa=95%, F1=91.20%). Annotation frequency varied significantly dependent on note type. By tuning cTAKES, we achieved a performance of F1=47.11% for first-level classes. For most classes, the machine learning RoBERTa-based NER model performed better (first-level F1=84.35%) than other models within the internal test dataset. The hybrid system on average performed slightly worse than the RoBERTa NER model (first-level F1=80.27%), matching or outperforming the former in terms of recall. Using an out-of-the-box entailment model, we detected many but not all challenging wordings missed by other models, reaching an average F1 of 76.04%, while matching and outperforming the tested NER models in several classes. Still, the entailment model may be sensitive to hypothesis wording and may require further fine tuning. Conclusion. This study developed a corpus of annotated clinical notes covering a broad spectrum of SDoH classes. This corpus provides a basis for training machine learning models and serves as a benchmark for predictive models for named entity recognition for SDoH and knowledge extraction from clinical texts.\",\"PeriodicalId\":236137,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association : JAMIA\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association : JAMIA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2022.03.04.22271541\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.03.04.22271541","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

背景。众所周知,不利的健康社会决定因素(SDoH)或社会风险因素,如粮食不安全和住房不稳定,会导致健康状况不佳和不公平。我们研究这些联系的能力有限,因为SDoH信息更多地记录在自由文本临床记录中,而不是结构化数据字段。为了克服这一挑战,人们越来越多地推动开发自动提取SDoH的技术。在这项研究中,我们探索了自然语言处理(NLP)和推理(NLI)方法从慢性腰痛(cLBP)患者的临床记录中提取SDoH信息,以加强未来SDoH与腰痛结局和差异之间的关联分析。方法。cLBP患者的临床记录(n=1,576) (n=386)对七个SDoH领域进行了注释:住房、食物、交通、财务、保险、婚姻和伴侣关系状况以及其他社会支持,结果364例患者的626份记录至少有一个注释实体。我们还标记了疼痛评分、抑郁和焦虑。我们使用两层分类法,其中包含10个第一级本体类和68个第二级本体类。我们开发并验证了基于规则和机器学习方法的提取系统。作为一种基于规则的方法,我们迭代地配置了临床文本分析和知识提取系统(cTAKES)系统。我们训练了两个机器学习模型(基于卷积神经网络(CNN)和RoBERTa变压器),以及一个结合模式匹配和词袋模型的混合系统。此外,我们评估了基于RoBERTa的蕴涵模型作为临床文献中SDoH检测的替代技术。我们使用了一个以前在一般领域数据上训练过的模型,而没有在我们的数据集上进行额外的训练。结果。4名注释者获得了高一致性(平均kappa=95%, F1=91.20%)。注释频率因注释类型的不同而有显著差异。通过调优ctake,我们实现了一级类的F1=47.11%的性能。对于大多数类,基于roberta的机器学习NER模型比内部测试数据集中的其他模型表现更好(一级F1=84.35%)。混合系统的平均表现略低于RoBERTa NER模型(一级F1=80.27%),在召回率方面与前者相当或优于前者。使用开箱即用的蕴涵模型,我们检测到许多但不是所有被其他模型遗漏的具有挑战性的词语,平均F1达到76.04%,同时在几个类别中匹配并优于被测试的NER模型。不过,蕴涵模型可能对假设措辞很敏感,可能需要进一步微调。结论。本研究开发了一个涵盖广泛的SDoH类别的注释临床笔记语料库。该语料库为训练机器学习模型提供了基础,并可作为SDoH命名实体识别和临床文本知识提取的预测模型的基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic Extraction of Social Determinants of Health from Medical Notes of Chronic Lower Back Pain Patients
Background. Adverse social determinants of health (SDoH), or social risk factors, such as food insecurity and housing instability, are known to contribute to poor health outcomes and inequities. Our ability to study these linkages is limited because SDoH information is more frequently documented in free-text clinical notes than structured data fields. To overcome this challenge, there is a growing push to develop techniques for automated extraction of SDoH. In this study, we explored natural language processing (NLP) and inference (NLI) methods to extract SDoH information from clinical notes of patients with chronic low back pain (cLBP), to enhance future analyses of the associations between SDoH and low back pain outcomes and disparities. Methods. Clinical notes (n=1,576) for patients with cLBP (n=386) were annotated for seven SDoH domains: housing, food, transportation, finances, insurance coverage, marital and partnership status, and other social support, resulting in 626 notes with at least one annotated entity for 364 patients. We additionally labelled pain scores, depression, and anxiety. We used a two-tier taxonomy with these 10 first-level ontological classes and 68 second-level ontological classes. We developed and validated extraction systems based on both rule-based and machine learning approaches. As a rule-based approach, we iteratively configured a clinical Text Analysis and Knowledge Extraction System (cTAKES) system. We trained two machine learning models (based on convolutional neural network (CNN) and RoBERTa transformer), and a hybrid system combining pattern matching and bag-of-words models. Additionally, we evaluated a RoBERTa based entailment model as an alternative technique of SDoH detection in clinical texts. We used a model previously trained on general domain data without additional training on our dataset. Results. Four annotators achieved high agreement (average kappa=95%, F1=91.20%). Annotation frequency varied significantly dependent on note type. By tuning cTAKES, we achieved a performance of F1=47.11% for first-level classes. For most classes, the machine learning RoBERTa-based NER model performed better (first-level F1=84.35%) than other models within the internal test dataset. The hybrid system on average performed slightly worse than the RoBERTa NER model (first-level F1=80.27%), matching or outperforming the former in terms of recall. Using an out-of-the-box entailment model, we detected many but not all challenging wordings missed by other models, reaching an average F1 of 76.04%, while matching and outperforming the tested NER models in several classes. Still, the entailment model may be sensitive to hypothesis wording and may require further fine tuning. Conclusion. This study developed a corpus of annotated clinical notes covering a broad spectrum of SDoH classes. This corpus provides a basis for training machine learning models and serves as a benchmark for predictive models for named entity recognition for SDoH and knowledge extraction from clinical texts.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信