我们还没有准备好:最先进的疾病命名实体识别器的局限性。

IF 1.6 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Lisa Kühnel, Juliane Fluck
{"title":"我们还没有准备好:最先进的疾病命名实体识别器的局限性。","authors":"Lisa Kühnel,&nbsp;Juliane Fluck","doi":"10.1186/s13326-022-00280-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.</p><p><strong>Results: </strong>Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.</p><p><strong>Conclusions: </strong>We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":"26"},"PeriodicalIF":1.6000,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9612606/pdf/","citationCount":"5","resultStr":"{\"title\":\"We are not ready yet: limitations of state-of-the-art disease named entity recognizers.\",\"authors\":\"Lisa Kühnel,&nbsp;Juliane Fluck\",\"doi\":\"10.1186/s13326-022-00280-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.</p><p><strong>Results: </strong>Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.</p><p><strong>Conclusions: </strong>We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.</p>\",\"PeriodicalId\":15055,\"journal\":{\"name\":\"Journal of Biomedical Semantics\",\"volume\":\" \",\"pages\":\"26\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2022-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9612606/pdf/\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Semantics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1186/s13326-022-00280-6\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-022-00280-6","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 5

摘要

背景:生物医学领域的自然语言处理已经得到了广泛的研究。自从基于迁移学习的方法取得突破以来,BERT模型被用于各种生物医学和临床应用。对于可用的数据集,这些模型显示出出色的结果-部分超过了注释者之间的协议。但是,在COVID-19预印本上应用生物医学命名实体识别的性能与测试数据的结果相比有所下降。问题来了,训练有素的模型如何能够在全新的数据上进行预测,即泛化。结果:基于疾病命名实体识别的例子,我们研究了不同的基于机器学习的方法(即迁移学习)的鲁棒性,并表明当前最先进的方法对于给定的训练和相应的测试集工作良好,但在应用于新数据时明显缺乏泛化。结论:我们认为需要更大的带注释的数据集来进行训练和测试。因此,我们预见到进一步的数据集的管理,而且,基于机器学习的模型的持续学习过程的调查。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

Background: Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.

Results: Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.

Conclusions: We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Biomedical Semantics
Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
4.20
自引率
5.30%
发文量
28
审稿时长
30 weeks
期刊介绍: Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信