{"title":"FRE @ BC8 SympTEMIST 赛道分析:命名实体识别。","authors":"Ander Martinez, Nuria García-Santa","doi":"10.1093/database/baae101","DOIUrl":null,"url":null,"abstract":"<p><p>This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403810/pdf/","citationCount":"0","resultStr":"{\"title\":\"An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.\",\"authors\":\"Ander Martinez, Nuria García-Santa\",\"doi\":\"10.1093/database/baae101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.</p>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403810/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/database/baae101\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae101","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
本文是对我们提交的论文(Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track:命名实体识别 Zenodo.)提交给 "BioCreative 2023 "的 "SympTEMIST "命名实体识别(NER)共享子任务。我们参加了这项挑战,提交了两个基于 RoBERTa 架构 LLM 的系统,该 LLM 在 "HuggingFace "模型库中的西班牙语临床数据上进行了训练。在选择提交的系统之前,我们尝试了本文所述技术的不同组合:条件随机场和字节对编码剔除。在第二个系统中,我们还加入了基于子分词特征的嵌入(SSW)。挑战赛中使用的测试集现已发布(López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus:用于临床症状、体征和检查结果信息提取的黄金标准注释。Zenodo),让我们能够更深入地分析我们的方法,并衡量引入 CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: 西班牙语临床实体注释指南。Zenodo)语料库。我们的实验表明,使用基于 Sub-Subword 特征的嵌入效果适中,而纳入 CARMEN-I 数据集的症状 NER 数据则会产生影响。数据库网址:https://physionet.org/content/carmen-i/1.0/.
An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.
This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.