Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES
JAMIA Open Pub Date : 2025-06-28 eCollection Date: 2025-06-01 DOI:10.1093/jamiaopen/ooaf061
Yao-Shun Chuang, Chun-Teh Lee, Guo-Hao Lin, Ryan Brandon, Xiaoqian Jiang, Muhammad F Walji, Oluwabunmi Tokede
{"title":"Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes.","authors":"Yao-Shun Chuang, Chun-Teh Lee, Guo-Hao Lin, Ryan Brandon, Xiaoqian Jiang, Muhammad F Walji, Oluwabunmi Tokede","doi":"10.1093/jamiaopen/ooaf061","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While most health-care providers now use electronic health records (EHRs) to document clinical care, many still treat them as digital versions of paper records. As a result, documentation often remains unstructured, with free-text entries in progress notes. This limits the potential for secondary use and analysis, as machine-learning and data analysis algorithms are more effective with structured data.</p><p><strong>Objective: </strong>This study aims to use advanced artificial intelligence (AI) and natural language processing (NLP) techniques to improve diagnostic information extraction from clinical notes in a periodontal use case. By automating this process, the study seeks to reduce missing data in dental records and minimize the need for extensive manual annotation, a long-standing barrier to widespread NLP deployment in dental data extraction.</p><p><strong>Materials and methods: </strong>This research utilizes large language models (LLMs), specifically Generative Pretrained Transformer 4, to generate synthetic medical notes for fine-tuning a RoBERTa model. This model was trained to better interpret and process dental language, with particular attention to periodontal diagnoses. Model performance was evaluated by manually reviewing 360 clinical notes randomly selected from each of the participating site's dataset.</p><p><strong>Results: </strong>The results demonstrated high accuracy of periodontal diagnosis data extraction, with the sites 1 and 2 achieving a weighted average score of 0.97-0.98. This performance held for all dimensions of periodontal diagnosis in terms of stage, grade, and extent.</p><p><strong>Discussion: </strong>Synthetic data effectively reduced manual annotation needs while preserving model quality. Generalizability across institutions suggests viability for broader adoption, though future work is needed to improve contextual understanding.</p><p><strong>Conclusion: </strong>The study highlights the potential transformative impact of AI and NLP on health-care research. Most clinical documentation (40%-80%) is free text. Scaling our method could enhance clinical data reuse.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf061"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12205731/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: While most health-care providers now use electronic health records (EHRs) to document clinical care, many still treat them as digital versions of paper records. As a result, documentation often remains unstructured, with free-text entries in progress notes. This limits the potential for secondary use and analysis, as machine-learning and data analysis algorithms are more effective with structured data.

Objective: This study aims to use advanced artificial intelligence (AI) and natural language processing (NLP) techniques to improve diagnostic information extraction from clinical notes in a periodontal use case. By automating this process, the study seeks to reduce missing data in dental records and minimize the need for extensive manual annotation, a long-standing barrier to widespread NLP deployment in dental data extraction.

Materials and methods: This research utilizes large language models (LLMs), specifically Generative Pretrained Transformer 4, to generate synthetic medical notes for fine-tuning a RoBERTa model. This model was trained to better interpret and process dental language, with particular attention to periodontal diagnoses. Model performance was evaluated by manually reviewing 360 clinical notes randomly selected from each of the participating site's dataset.

Results: The results demonstrated high accuracy of periodontal diagnosis data extraction, with the sites 1 and 2 achieving a weighted average score of 0.97-0.98. This performance held for all dimensions of periodontal diagnosis in terms of stage, grade, and extent.

Discussion: Synthetic data effectively reduced manual annotation needs while preserving model quality. Generalizability across institutions suggests viability for broader adoption, though future work is needed to improve contextual understanding.

Conclusion: The study highlights the potential transformative impact of AI and NLP on health-care research. Most clinical documentation (40%-80%) is free text. Scaling our method could enhance clinical data reuse.

通过生成人工智能和合成笔记提取跨机构牙科电子健康记录实体。
背景:虽然大多数医疗保健提供者现在使用电子健康记录(EHRs)来记录临床护理,但许多人仍然将其视为纸质记录的数字版本。因此,文档通常是非结构化的,在进度记录中有自由文本条目。这限制了二次使用和分析的潜力,因为机器学习和数据分析算法对结构化数据更有效。目的:本研究旨在利用先进的人工智能(AI)和自然语言处理(NLP)技术来改进牙周病例临床记录的诊断信息提取。通过自动化这一过程,该研究旨在减少牙科记录中的缺失数据,并最大限度地减少对大量人工注释的需求,这是在牙科数据提取中广泛部署NLP的长期障碍。材料和方法:本研究利用大型语言模型(llm),特别是生成预训练Transformer 4,生成用于微调RoBERTa模型的合成医学笔记。这个模型经过训练,可以更好地解释和处理牙科语言,特别注意牙周诊断。模型的性能通过人工审查从每个参与站点的数据集中随机选择的360个临床记录来评估。结果:牙周诊断数据提取的准确性较高,1、2位的加权平均得分为0.97 ~ 0.98。这种表现适用于牙周诊断的各个方面,包括阶段、等级和程度。讨论:合成数据在保持模型质量的同时有效地减少了手工注释需求。跨机构的概括性表明更广泛采用的可行性,尽管未来的工作需要提高对上下文的理解。结论:该研究突出了人工智能和NLP对医疗保健研究的潜在变革性影响。大多数临床文献(40%-80%)是免费文本。扩展我们的方法可以提高临床数据的重用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信