An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2024-09-16 DOI:10.1093/database/baae101

Ander Martinez, Nuria García-Santa

{"title":"An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.","authors":"Ander Martinez, Nuria García-Santa","doi":"10.1093/database/baae101","DOIUrl":null,"url":null,"abstract":"<p><p>This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403810/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae101","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.

查看原文本刊更多论文

FRE @ BC8 SympTEMIST 赛道分析：命名实体识别。

本文是对我们提交的论文（Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track：命名实体识别 Zenodo.）提交给 "BioCreative 2023 "的 "SympTEMIST "命名实体识别（NER）共享子任务。我们参加了这项挑战，提交了两个基于 RoBERTa 架构 LLM 的系统，该 LLM 在 "HuggingFace "模型库中的西班牙语临床数据上进行了训练。在选择提交的系统之前，我们尝试了本文所述技术的不同组合：条件随机场和字节对编码剔除。在第二个系统中，我们还加入了基于子分词特征的嵌入（SSW）。挑战赛中使用的测试集现已发布（López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus：用于临床症状、体征和检查结果信息提取的黄金标准注释。Zenodo），让我们能够更深入地分析我们的方法，并衡量引入 CARMEN-I （Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: 西班牙语临床实体注释指南。Zenodo）语料库。我们的实验表明，使用基于 Sub-Subword 特征的嵌入效果适中，而纳入 CARMEN-I 数据集的症状 NER 数据则会产生影响。数据库网址：https://physionet.org/content/carmen-i/1.0/.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.