Automating the extraction of otology symptoms from clinic letters: a methodological study using natural language processing.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-09-29 DOI:10.1186/s12911-025-03180-8

Nikhil Joshi, Kawsar Noor, Xi Bai, Marina Forbes, Talisa Ross, Liam Barrett, Richard J B Dobson, Anne G M Schilder, Nishchay Mehta, Watjana Lilaonitkul

{"title":"Automating the extraction of otology symptoms from clinic letters: a methodological study using natural language processing.","authors":"Nikhil Joshi, Kawsar Noor, Xi Bai, Marina Forbes, Talisa Ross, Liam Barrett, Richard J B Dobson, Anne G M Schilder, Nishchay Mehta, Watjana Lilaonitkul","doi":"10.1186/s12911-025-03180-8","DOIUrl":null,"url":null,"abstract":"Background: Most healthcare data is in an unstructured format that requires processing to make it usable for research. Generally, this is done manually, which is both time-consuming and poorly scalable. Natural language processing (NLP) using machine learning offers a method to automate data extraction. In this paper we describe the development of a set of NLP models to extract and contextualise otology symptoms from free text documents.Methods: A dataset of 1,148 otology clinic letters written between 2009 - 2011, from a London NHS hospital, were manually annotated and used to train a hybrid dictionary and machine learning NLP model to identify six key otological symptoms: hearing loss, impairment of balance, otalgia, otorrhoea, tinnitus and vertigo. Subsequently, a set of Bidirectional-Long-Short-Term-Memory (Bi-LSTM) models were trained to extract contextual information for each symptom, for example, defining the laterality of the ear affected.Results: There were 1,197 symptom annotations and 2,861 contextual annotations with 24% of patients presenting with hearing loss. The symptom extraction model achieved a macro F1 score of 0.73. The Bi-LSTM models achieved a mean macro F1 score of 0.69 for the contextualisation tasks.Conclusion: NLP models for symptom extraction and contextualisation were successfully created and shown to perform well on real life data. Refinement is needed to produce models that can run without manual review. Downstream applications for these models include deep semantic searching in electronic health records, cohort identification for clinical trials and facilitating research into hearing loss phenotypes. Further testing of the external validity of the developed models is required.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"353"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482202/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03180-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Most healthcare data is in an unstructured format that requires processing to make it usable for research. Generally, this is done manually, which is both time-consuming and poorly scalable. Natural language processing (NLP) using machine learning offers a method to automate data extraction. In this paper we describe the development of a set of NLP models to extract and contextualise otology symptoms from free text documents.

Methods: A dataset of 1,148 otology clinic letters written between 2009 - 2011, from a London NHS hospital, were manually annotated and used to train a hybrid dictionary and machine learning NLP model to identify six key otological symptoms: hearing loss, impairment of balance, otalgia, otorrhoea, tinnitus and vertigo. Subsequently, a set of Bidirectional-Long-Short-Term-Memory (Bi-LSTM) models were trained to extract contextual information for each symptom, for example, defining the laterality of the ear affected.

Results: There were 1,197 symptom annotations and 2,861 contextual annotations with 24% of patients presenting with hearing loss. The symptom extraction model achieved a macro F1 score of 0.73. The Bi-LSTM models achieved a mean macro F1 score of 0.69 for the contextualisation tasks.

Conclusion: NLP models for symptom extraction and contextualisation were successfully created and shown to perform well on real life data. Refinement is needed to produce models that can run without manual review. Downstream applications for these models include deep semantic searching in electronic health records, cohort identification for clinical trials and facilitating research into hearing loss phenotypes. Further testing of the external validity of the developed models is required.

Abstract Image

查看原文本刊更多论文

从临床信函中自动提取耳科症状：使用自然语言处理的方法学研究。

背景：大多数医疗保健数据采用非结构化格式，需要对其进行处理才能用于研究。通常，这是手动完成的，这既耗时又缺乏可伸缩性。使用机器学习的自然语言处理（NLP）提供了一种自动化数据提取的方法。在本文中，我们描述了一套NLP模型的发展，用于从自由文本文档中提取耳科症状并将其置于语境中。方法：对2009年至2011年间来自伦敦NHS医院的1148封耳科门诊信函进行了人工注释，并用于训练混合词典和机器学习NLP模型，以识别六种关键的耳科症状：听力损失、平衡障碍、耳痛、耳鸣和眩晕。随后，训练了一组双向长短期记忆（Bi-LSTM）模型，以提取每种症状的上下文信息，例如，确定受影响耳朵的侧边。结果：共有1197个症状注释和2861个上下文注释，其中24%的患者表现为听力损失。症状提取模型的宏观F1得分为0.73。Bi-LSTM模型在情境化任务上的平均宏观F1得分为0.69。结论：成功创建了用于症状提取和情境化的NLP模型，并在现实生活数据上表现良好。需要精化来生成可以在没有人工审查的情况下运行的模型。这些模型的下游应用包括电子健康记录的深度语义搜索、临床试验的队列识别和促进听力损失表型的研究。需要进一步测试所开发模型的外部有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.