Jocelyn Dunstan, Thomas Vakili, Luis Miranda, Fabián Villena, Claudio Aracena, Tamara Quiroga, Paulina Vera, Sebastián Viteri Valenzuela, Victor Rocco
{"title":"A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish.","authors":"Jocelyn Dunstan, Thomas Vakili, Luis Miranda, Fabián Villena, Claudio Aracena, Tamara Quiroga, Paulina Vera, Sebastián Viteri Valenzuela, Victor Rocco","doi":"10.1186/s12911-024-02609-w","DOIUrl":null,"url":null,"abstract":"<p><p>Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11267746/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02609-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
尽管创建成本高昂,但附加注释的语料库对于强大的自然语言处理系统来说是不可或缺的。在临床领域,除了注释医学实体外,语料库创建者还必须删除个人身份信息(PII)。在大型语言模型时代,这一点变得越来越重要,因为在大型语言模型中可能会出现不必要的记忆。本文介绍了一个语料库,该语料库注释了 1,787 份与工作相关的事故和疾病的西班牙语病历,对其中的个人身份信息进行了匿名处理。此外,我们还应用了之前发布的一个命名实体识别(NER)模型,该模型以初级保健医生的转诊为基础进行训练,以识别这些与工作相关的文本中的疾病、身体部位和药物。我们详细分析了这些模型与医生策划的黄金标准之间的差异。此外,我们还比较了 NER 模型在原始叙述中、在个人信息被掩盖的叙述中以及在个人数据被另一个类似的替代值(化名)取代的文本中的性能。在本出版物中,我们分享了注释指南和注释语料库。
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.