Enhancing named entity recognition with a novel BERT-BiLSTM-CRF-RC joint training model for biomedical materials database

Mufei Li, Yan Zhuang, Ke Chen, Lin Han, Xiangfeng Li, Yongtao wei, Xiangdong Zhu, Mingli Yang, Guangfu Yin, Jiangli Lin, Xingdong Zhang
{"title":"Enhancing named entity recognition with a novel BERT-BiLSTM-CRF-RC joint training model for biomedical materials database","authors":"Mufei Li,&nbsp;Yan Zhuang,&nbsp;Ke Chen,&nbsp;Lin Han,&nbsp;Xiangfeng Li,&nbsp;Yongtao wei,&nbsp;Xiangdong Zhu,&nbsp;Mingli Yang,&nbsp;Guangfu Yin,&nbsp;Jiangli Lin,&nbsp;Xingdong Zhang","doi":"10.1002/mgea.70001","DOIUrl":null,"url":null,"abstract":"<p>In this study, we propose a novel joint training model for named entity recognition (NER) that combines BERT, BiLSTM, CRF, and a reading comprehension (RC) mechanism. Traditional BERT-BiLSTM-CRF models often struggle with inaccurate boundary detection and excessive fragmentation of named entities due to their lack of specialized vocabulary. Our model addresses these issues by integrating an RC mechanism, which helps refine fragmented results by enabling the model to more precisely identify entity boundaries without relying on an expert-annotated dictionary. Additionally, segmentation issues are further mitigated through a segmented combined voting- and positive-sample-coverage technique. We applied this model to develop a database for mesoporous bioactive glass (MBG). Furthermore, a classifier was developed to automatically detect the presence of pertinent information within paragraphs. For this study, 200 articles were searched using MBG-related keywords, and the data were split into a training set and a test set in a 9:1 ratio. A total of 492 paragraphs were automatically extracted for training, and 50 paragraphs were extracted for testing the model. The results demonstrate that our joint training model achieves an accuracy of 92.8% in named entity recognition, which is 4.3% higher than the 88.5% accuracy of the traditional BERT-BiLSTM-CRF model.</p>","PeriodicalId":100889,"journal":{"name":"Materials Genome Engineering Advances","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/mgea.70001","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Genome Engineering Advances","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/mgea.70001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this study, we propose a novel joint training model for named entity recognition (NER) that combines BERT, BiLSTM, CRF, and a reading comprehension (RC) mechanism. Traditional BERT-BiLSTM-CRF models often struggle with inaccurate boundary detection and excessive fragmentation of named entities due to their lack of specialized vocabulary. Our model addresses these issues by integrating an RC mechanism, which helps refine fragmented results by enabling the model to more precisely identify entity boundaries without relying on an expert-annotated dictionary. Additionally, segmentation issues are further mitigated through a segmented combined voting- and positive-sample-coverage technique. We applied this model to develop a database for mesoporous bioactive glass (MBG). Furthermore, a classifier was developed to automatically detect the presence of pertinent information within paragraphs. For this study, 200 articles were searched using MBG-related keywords, and the data were split into a training set and a test set in a 9:1 ratio. A total of 492 paragraphs were automatically extracted for training, and 50 paragraphs were extracted for testing the model. The results demonstrate that our joint training model achieves an accuracy of 92.8% in named entity recognition, which is 4.3% higher than the 88.5% accuracy of the traditional BERT-BiLSTM-CRF model.

Abstract Image

基于BERT-BiLSTM-CRF-RC联合训练模型的生物医学材料数据库命名实体识别
在这项研究中,我们提出了一种新的命名实体识别(NER)联合训练模型,该模型结合了BERT、BiLSTM、CRF和阅读理解(RC)机制。由于缺乏专门的词汇表,传统的BERT-BiLSTM-CRF模型经常存在不准确的边界检测和命名实体过度碎片化的问题。我们的模型通过集成RC机制解决了这些问题,RC机制使模型能够更精确地识别实体边界,而不依赖于专家注释的字典,从而帮助改进碎片化的结果。此外,通过分段组合投票和正样本覆盖技术,进一步缓解了分割问题。我们应用该模型建立了介孔生物活性玻璃(MBG)数据库。此外,还开发了一个分类器来自动检测段落中相关信息的存在。本研究使用mbg相关关键词检索了200篇文章,并将数据按9:1的比例分割为训练集和测试集。自动抽取492段进行训练,抽取50段进行模型测试。结果表明,我们的联合训练模型在命名实体识别方面达到了92.8%的准确率,比传统BERT-BiLSTM-CRF模型的88.5%准确率提高了4.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信