Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2024-08-28 DOI:10.1093/database/baae079

M Janina Sarol, Gibong Hong, Evan Guerra, Halil Kilicoglu

{"title":"Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.","authors":"M Janina Sarol, Gibong Hong, Evan Guerra, Halil Kilicoglu","doi":"10.1093/database/baae079","DOIUrl":null,"url":null,"abstract":"<p><p>Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11352595/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae079","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

查看原文本刊更多论文

整合深度学习架构以增强生物医学关系提取：一种流水线方法。

从科学出版物中提取生物医学关系是生物医学自然语言处理（NLP）中的一项关键任务，可以促进大型知识库的创建、提高知识发现的效率并加快证据合成。在本文中，我们以之前在 BioCreative VIII BioRED Track 上所做的努力为基础，提出了一种用于生物医学关系提取（RE）和新颖性检测（ND）的增强型端到端流水线方法，该方法有效地利用了现有数据集，并集成了最先进的深度学习方法。我们的管道包括依次执行的四项任务：命名实体识别（NER）、实体链接（EL）、RE 和 ND。我们使用 BioRED 基准语料库训练模型，该语料库是共享任务的基础。我们为每项任务探索了几种方法及其组合：对于 NER，我们比较了基于 BERT 的序列标注模型（使用 BIO 方案）和跨度分类模型。对于 EL，我们为疾病和化学品训练了一个卷积神经网络模型，并使用现有工具 PubTator 3.0 来映射其他实体类型。对于 RE 和 ND，我们将基于 BERT 的句子绑定 PURE 模型调整为双向和文档级提取。我们还进行了大量超参数调整，以提高模型性能。我们在 NER、RE 和 ND 中使用了基于 BERT 的模型，在 EL 中使用了混合方法，从而获得了最佳性能。与我们提交的共享任务相比，我们的增强和优化管道显示出了实质性的改进：NER：93.53（+3.09）；EL：83.87（+9.73）；RE：46.18（+15.67）；ND：38.86（+14.9）。虽然 NER 和 EL 模型的性能相当高，但 RE 和 ND 任务在文档级别上仍然具有挑战性。对数据集的进一步改进可以为实际使用提供更准确、更有用的模型。我们在 https://github.com/janinaj/e2eBioMedRE/ 上提供了我们的模型和代码。数据库网址：https://github.com/janinaj/e2eBioMedRE/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.