J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu
{"title":"开发一种自然语言处理管道,从电子牙科临床记录中自动提取牙周病信息","authors":"J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu","doi":"10.1145/3545729.3545744","DOIUrl":null,"url":null,"abstract":"Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.","PeriodicalId":432782,"journal":{"name":"Proceedings of the 6th International Conference on Medical and Health Informatics","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Develop a Natural Language Processing Pipeline to Automate Extraction of Periodontal Disease Information from Electronic Dental Clinical Notes\",\"authors\":\"J. Patel, R. Rao, R. Brandon, Vishnu Iyer, J. Albandar, M. Tellez, J. Krois, Huanmei Wu\",\"doi\":\"10.1145/3545729.3545744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.\",\"PeriodicalId\":432782,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Medical and Health Informatics\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Medical and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545729.3545744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Medical and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545729.3545744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Develop a Natural Language Processing Pipeline to Automate Extraction of Periodontal Disease Information from Electronic Dental Clinical Notes
Introduction: Periodontal disease (PD) is one of the most prevalent dental diseases, suffered by 80% of US adults. PD can be prevented if its etiologic and risk factors are identified and controlled early. Electronic dental record (EDR) data provide researchers a unique opportunity to develop prediction models that can provide personalized disease risk and treatment recommendations. However, 90% of important clinical information is documented only in the free-text format of EDR. The objective of this study was to develop natural language processing (NLP) applications to extract PD diagnoses, medical histories (e.g., cardiovascular diseases, diabetes), and social history (e.g., smoking) in a structured format for comprehensive follow-up periodontal research. Methods: We have developed a five-stage NLP pipeline. First, we retrieved both structured and non-structured data from the EDR using SQL queries. Next, we developed manual annotation guidelines using both the bottom-up and top-down approaches. The manual annotations were performed by experts in Step 3, which also built the gold standard data. Part of the gold standard data was used in a named entity recognition (NER) development (Step 4), and for additional validation (Step 5). Results: The SQL queries resulted in a cohort of 27,138 unique patients, each with multiple clinical notes. The manual annotation guidelines were based on 100 unique patients with 347 clinical notes to identify the writing patterns in our EDR system. Existing literature was also studied to develop manual annotation guidelines. For Step 3, two domain experts manually reviewed 4,000 clinical notes using the eHOST annotation tool. The evaluation showed 94.5% accuracy in extracting patients’ detailed PD diagnoses, CVD, smoking, and diabetes information from the EDR. Statistics of the extracted clinical notes improved our knowledge on periodontitis classification and other systemic diseases such as cardiovascular diseases (CVD) and diabetes. Conclusion: Our NLP pipeline increased the utilization of EDR for periodontal research from the free-text dental clinical notes. Hence, developing novel informatics methods such as NLP is critical for using EDR data optimally and efficiently for research.