开发一种自然语言处理管道，从电子牙科临床记录中自动提取牙周病信息

Proceedings of the 6th International Conference on Medical and Health Informatics Pub Date : 2022-05-13 DOI:10.1145/3545729.3545744

J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu

{"title":"开发一种自然语言处理管道，从电子牙科临床记录中自动提取牙周病信息","authors":"J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu","doi":"10.1145/3545729.3545744","DOIUrl":null,"url":null,"abstract":"Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.","PeriodicalId":432782,"journal":{"name":"Proceedings of the 6th International Conference on Medical and Health Informatics","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Develop a Natural Language Processing Pipeline to Automate Extraction of Periodontal Disease Information from Electronic Dental Clinical Notes\",\"authors\":\"J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu\",\"doi\":\"10.1145/3545729.3545744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.\",\"PeriodicalId\":432782,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Medical and Health Informatics\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Medical and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545729.3545744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Medical and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545729.3545744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

牙周病(PD)是最普遍的牙齿疾病之一，80%的美国成年人患有此病。如果PD的病因和危险因素得到早期识别和控制，PD是可以预防的。电子牙科记录(EDR)数据为研究人员提供了一个独特的机会来开发预测模型，从而提供个性化的疾病风险和治疗建议。然而，90%的重要临床信息仅以电子病历的自由文本格式记录。本研究的目的是开发自然语言处理(NLP)应用程序，以结构化格式提取PD诊断，病史(如心血管疾病，糖尿病)和社会史(如吸烟)，以进行全面的牙周随访研究。方法:我们开发了一个五阶段的NLP管道。首先，我们使用SQL查询从EDR检索结构化和非结构化数据。接下来，我们使用自底向上和自顶向下的方法开发了手动注释指南。手工注释由专家在步骤3中执行，该步骤也构建了金标准数据。部分金标准数据用于命名实体识别(NER)开发(步骤4)和额外验证(步骤5)。结果:SQL查询产生27,138个独特患者的队列，每个患者都有多个临床记录。手工标注指南是基于100名独特患者的347份临床笔记，以确定我们的EDR系统中的书写模式。同时研究现有文献，制定手工标注指南。对于步骤3，两位领域专家使用eHOST注释工具手动审查了4,000份临床记录。评估显示，从EDR中提取患者详细的PD诊断、CVD、吸烟和糖尿病信息的准确率为94.5%。统计提取的临床记录提高了我们对牙周炎分类和其他全身性疾病如心血管疾病(CVD)和糖尿病的认识。结论:我们的NLP管道增加了EDR在牙周研究中的利用，这些EDR来自自由文本牙科临床记录。因此，开发新的信息学方法，如NLP，对于优化和有效地利用EDR数据进行研究至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Develop a Natural Language Processing Pipeline to Automate Extraction of Periodontal Disease Information from Electronic Dental Clinical Notes

Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Conference on Medical and Health Informatics

自引率

0.00%

发文量