利用自然语言处理技术从临床笔记中识别胰腺癌风险因素

IF 2.8 2区医学 Q2 GASTROENTEROLOGY & HEPATOLOGY

Pancreatology Pub Date : 2024-06-01 DOI:10.1016/j.pan.2024.03.016

Dhruv Sarwal , Liwei Wang , Sonal Gandhi , Elham Sagheb Hossein Pour , Laurens P. Janssens , Adriana M. Delgado , Karen A. Doering , Anup Kumar Mishra , Jason D. Greenwood , Hongfang Liu , Shounak Majumder

{"title":"利用自然语言处理技术从临床笔记中识别胰腺癌风险因素","authors":"Dhruv Sarwal , Liwei Wang , Sonal Gandhi , Elham Sagheb Hossein Pour , Laurens P. Janssens , Adriana M. Delgado , Karen A. Doering , Anup Kumar Mishra , Jason D. Greenwood , Hongfang Liu , Shounak Majumder","doi":"10.1016/j.pan.2024.03.016","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><p>Screening for pancreatic ductal adenocarcinoma (PDAC) is considered in high-risk individuals (HRIs) with established PDAC risk factors, such as family history and germline mutations in PDAC susceptibility genes. Accurate assessment of risk factor status is provider knowledge-dependent and requires extensive manual chart review by experts. Natural Language Processing (NLP) has shown promise in automated data extraction from the electronic health record (EHR). We aimed to use NLP for automated extraction of PDAC risk factors from unstructured clinical notes in the EHR.</p></div><div><h3>Methods</h3><p>We first developed rule-based NLP algorithms to extract PDAC risk factors at the document-level, using an annotated corpus of 2091 clinical notes. Next, we further improved the NLP algorithms using a cohort of 1138 patients through patient-level training, validation, and testing, with comparison against a pre-specified reference standard. To minimize false-negative results we prioritized algorithm recall.</p></div><div><h3>Results</h3><p>In the test set (n = 807), the NLP algorithms achieved a recall of 0.933, precision of 0.790, and F<sub>1</sub>-score of 0.856 for family history of PDAC. For germline genetic mutations, the algorithm had a high recall of 0.851, while precision and F<sub>1</sub>-score were lower at 0.350 and 0.496 respectively. Most false positives for germline mutations resulted from erroneous recognition of tissue mutations.</p></div><div><h3>Conclusions</h3><p>Rule-based NLP algorithms applied to unstructured clinical notes are highly sensitive for automated identification of PDAC risk factors. Further validation in a large primary-care patient population is warranted to assess real-world utility in identifying HRIs for pancreatic cancer screening.</p></div>","PeriodicalId":19976,"journal":{"name":"Pancreatology","volume":"24 4","pages":"Pages 572-578"},"PeriodicalIF":2.8000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identification of pancreatic cancer risk factors from clinical notes using natural language processing\",\"authors\":\"Dhruv Sarwal , Liwei Wang , Sonal Gandhi , Elham Sagheb Hossein Pour , Laurens P. Janssens , Adriana M. Delgado , Karen A. Doering , Anup Kumar Mishra , Jason D. Greenwood , Hongfang Liu , Shounak Majumder\",\"doi\":\"10.1016/j.pan.2024.03.016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><p>Screening for pancreatic ductal adenocarcinoma (PDAC) is considered in high-risk individuals (HRIs) with established PDAC risk factors, such as family history and germline mutations in PDAC susceptibility genes. Accurate assessment of risk factor status is provider knowledge-dependent and requires extensive manual chart review by experts. Natural Language Processing (NLP) has shown promise in automated data extraction from the electronic health record (EHR). We aimed to use NLP for automated extraction of PDAC risk factors from unstructured clinical notes in the EHR.</p></div><div><h3>Methods</h3><p>We first developed rule-based NLP algorithms to extract PDAC risk factors at the document-level, using an annotated corpus of 2091 clinical notes. Next, we further improved the NLP algorithms using a cohort of 1138 patients through patient-level training, validation, and testing, with comparison against a pre-specified reference standard. To minimize false-negative results we prioritized algorithm recall.</p></div><div><h3>Results</h3><p>In the test set (n = 807), the NLP algorithms achieved a recall of 0.933, precision of 0.790, and F<sub>1</sub>-score of 0.856 for family history of PDAC. For germline genetic mutations, the algorithm had a high recall of 0.851, while precision and F<sub>1</sub>-score were lower at 0.350 and 0.496 respectively. Most false positives for germline mutations resulted from erroneous recognition of tissue mutations.</p></div><div><h3>Conclusions</h3><p>Rule-based NLP algorithms applied to unstructured clinical notes are highly sensitive for automated identification of PDAC risk factors. Further validation in a large primary-care patient population is warranted to assess real-world utility in identifying HRIs for pancreatic cancer screening.</p></div>\",\"PeriodicalId\":19976,\"journal\":{\"name\":\"Pancreatology\",\"volume\":\"24 4\",\"pages\":\"Pages 572-578\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pancreatology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1424390324000759\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pancreatology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1424390324000759","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

筛查胰腺导管腺癌（PDAC）的对象是具有 PDAC 风险因素（如家族史和 PDAC 易感基因的种系突变）的高危人群（HRIs）。风险因素状态的准确评估依赖于提供者的知识，需要专家进行大量的人工病历审查。自然语言处理（NLP）已显示出从电子健康记录（EHR）中自动提取数据的前景。我们首先开发了基于规则的 NLP 算法，利用 2091 份临床笔记的注释语料库在文档级别提取 PDAC 风险因素。接下来，我们通过患者层面的训练、验证和测试，并与预先指定的参考标准进行比较，利用 1138 例患者进一步改进了 NLP 算法。结果在测试集（n = 807）中，NLP 算法在 PDAC 家族史方面的召回率为 0.933，精确度为 0.790，F1 分数为 0.856。对于种系基因突变，该算法的召回率高达 0.851，而精确度和 F1 分数较低，分别为 0.350 和 0.496。结论基于规则的 NLP 算法应用于非结构化临床笔记，对自动识别 PDAC 风险因素非常敏感。有必要在大量初级保健患者群体中进行进一步验证，以评估在胰腺癌筛查中识别HRIs的实际效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identification of pancreatic cancer risk factors from clinical notes using natural language processing

Objectives

Screening for pancreatic ductal adenocarcinoma (PDAC) is considered in high-risk individuals (HRIs) with established PDAC risk factors, such as family history and germline mutations in PDAC susceptibility genes. Accurate assessment of risk factor status is provider knowledge-dependent and requires extensive manual chart review by experts. Natural Language Processing (NLP) has shown promise in automated data extraction from the electronic health record (EHR). We aimed to use NLP for automated extraction of PDAC risk factors from unstructured clinical notes in the EHR.

Methods

We first developed rule-based NLP algorithms to extract PDAC risk factors at the document-level, using an annotated corpus of 2091 clinical notes. Next, we further improved the NLP algorithms using a cohort of 1138 patients through patient-level training, validation, and testing, with comparison against a pre-specified reference standard. To minimize false-negative results we prioritized algorithm recall.

Results

In the test set (n = 807), the NLP algorithms achieved a recall of 0.933, precision of 0.790, and F₁-score of 0.856 for family history of PDAC. For germline genetic mutations, the algorithm had a high recall of 0.851, while precision and F₁-score were lower at 0.350 and 0.496 respectively. Most false positives for germline mutations resulted from erroneous recognition of tissue mutations.

Conclusions

Rule-based NLP algorithms applied to unstructured clinical notes are highly sensitive for automated identification of PDAC risk factors. Further validation in a large primary-care patient population is warranted to assess real-world utility in identifying HRIs for pancreatic cancer screening.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pancreatology 医学-胃肠肝病学

CiteScore

7.20

自引率

5.60%

发文量

194

审稿时长

44 days

期刊介绍： Pancreatology is the official journal of the International Association of Pancreatology (IAP), the European Pancreatic Club (EPC) and several national societies and study groups around the world. Dedicated to the understanding and treatment of exocrine as well as endocrine pancreatic disease, this multidisciplinary periodical publishes original basic, translational and clinical pancreatic research from a range of fields including gastroenterology, oncology, surgery, pharmacology, cellular and molecular biology as well as endocrinology, immunology and epidemiology. Readers can expect to gain new insights into pancreatic physiology and into the pathogenesis, diagnosis, therapeutic approaches and prognosis of pancreatic diseases. The journal features original articles, case reports, consensus guidelines and topical, cutting edge reviews, thus representing a source of valuable, novel information for clinical and basic researchers alike.