GDReCo:细粒度基因疾病关系提取语料库

IF 4.9 2区 医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Hui Yu , Jing Wu , Suyan Bian , Sheng Zhang , Yibin Wu , Ziyan Zhou , Qian Jia , Yuan Ni , Zhengxing Huang , Huiyu Yan , Weidong Wang , Kunlun He , Jinlong Shi
{"title":"GDReCo:细粒度基因疾病关系提取语料库","authors":"Hui Yu ,&nbsp;Jing Wu ,&nbsp;Suyan Bian ,&nbsp;Sheng Zhang ,&nbsp;Yibin Wu ,&nbsp;Ziyan Zhou ,&nbsp;Qian Jia ,&nbsp;Yuan Ni ,&nbsp;Zhengxing Huang ,&nbsp;Huiyu Yan ,&nbsp;Weidong Wang ,&nbsp;Kunlun He ,&nbsp;Jinlong Shi","doi":"10.1016/j.cmpb.2025.108773","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and objective</h3><div>Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.</div></div><div><h3>Methods</h3><div>This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.</div></div><div><h3>Results</h3><div>We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for \"event\" and \"rel\" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.</div></div><div><h3>Conclusions</h3><div>GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"266 ","pages":"Article 108773"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GDReCo: Fine-grained gene-disease relationship extraction corpus\",\"authors\":\"Hui Yu ,&nbsp;Jing Wu ,&nbsp;Suyan Bian ,&nbsp;Sheng Zhang ,&nbsp;Yibin Wu ,&nbsp;Ziyan Zhou ,&nbsp;Qian Jia ,&nbsp;Yuan Ni ,&nbsp;Zhengxing Huang ,&nbsp;Huiyu Yan ,&nbsp;Weidong Wang ,&nbsp;Kunlun He ,&nbsp;Jinlong Shi\",\"doi\":\"10.1016/j.cmpb.2025.108773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and objective</h3><div>Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.</div></div><div><h3>Methods</h3><div>This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.</div></div><div><h3>Results</h3><div>We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for \\\"event\\\" and \\\"rel\\\" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.</div></div><div><h3>Conclusions</h3><div>GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"266 \",\"pages\":\"Article 108773\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260725001907\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725001907","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

背景与目的了解基因与疾病的关系对于医学研究、药物发现、临床诊断和其他领域至关重要。然而,目前还没有高质量、细粒度的语料库可用于训练自然语言处理(NLP)模型,这些模型已被证明在知识提取方面是有效的。方法本研究引入了一种新的基因-疾病关联本体框架,解决了缺乏正式描述系统和NLP模型训练语料库的问题。我们开发了基因疾病关系提取语料库(GDReCo),这是一个超过24,000个病例的精细化数据集,其中包括2300多个手动注释和22,000多个模型预测的实例。在这些数据上训练的基于bert的模型在“事件”和“真实”关系上获得了很高的f1分,验证了其在基因-疾病关系提取(GDRE)任务中的有效性。结论尽管ChatGPT在细粒度关系提取方面存在局限性,但sgdreco是生物医学研究的宝贵资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
GDReCo: Fine-grained gene-disease relationship extraction corpus

Background and objective

Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.

Methods

This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.

Results

We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for "event" and "rel" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.

Conclusions

GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer methods and programs in biomedicine
Computer methods and programs in biomedicine 工程技术-工程:生物医学
CiteScore
12.30
自引率
6.60%
发文量
601
审稿时长
135 days
期刊介绍: To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信