{"title":"RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.","authors":"Katerina Nastou, Farrokh Mehryary, Tomoko Ohta, Jouni Luoma, Sampo Pyysalo, Lars Juhl Jensen","doi":"10.1093/database/baae095","DOIUrl":null,"url":null,"abstract":"<p><p>In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11394941/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae095","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
在生物医学文本挖掘领域,从文献中提取关系的能力对于推进理论研究和实际应用都至关重要。目前,旨在加强多种类型关系提取的语料库明显不足,尤其是针对蛋白质和含蛋白质实体(如复合物和族)以及化学物质的语料库。在这项工作中,我们提出了 RegulaTome,它是一个克服了现有几个生物医学关系提取(RE)语料库局限性的语料库,其中许多语料库都集中在句子层面的单一类型关系上。RegulaTome 通过在超过 2500 篇文档中提供 16 961 种关系注释而脱颖而出,成为迄今为止同类数据中最广泛的数据集。该语料库专门设计用于涵盖超过 40 种关系类型,超出了传统的探索范围,为生物医学 RE 任务的复杂性和深度树立了新的标杆。我们的语料库既扩大了检测关系的范围,又使 RE 达到了显著的准确性。在该语料库上训练的基于转换器的模型在如此复杂的任务中表现出了令人满意的 F1 分数(66.6%),这突出表明了我们的方法在准确识别和分类各种生物关系方面的有效性。这一成就彰显了 RegulaTome 的潜力,它将为开发更复杂、更高效、更准确的 RE 系统以解决生物医学任务做出重大贡献。最后,在所有 PubMed 摘要和 PMC Open Access 全文文档上运行训练有素的 RE 系统后,从整个生物医学文献中提取了超过 1800 万条关系。