Hui Yu , Jing Wu , Suyan Bian , Sheng Zhang , Yibin Wu , Ziyan Zhou , Qian Jia , Yuan Ni , Zhengxing Huang , Huiyu Yan , Weidong Wang , Kunlun He , Jinlong Shi
{"title":"GDReCo:细粒度基因疾病关系提取语料库","authors":"Hui Yu , Jing Wu , Suyan Bian , Sheng Zhang , Yibin Wu , Ziyan Zhou , Qian Jia , Yuan Ni , Zhengxing Huang , Huiyu Yan , Weidong Wang , Kunlun He , Jinlong Shi","doi":"10.1016/j.cmpb.2025.108773","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and objective</h3><div>Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.</div></div><div><h3>Methods</h3><div>This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.</div></div><div><h3>Results</h3><div>We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for \"event\" and \"rel\" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.</div></div><div><h3>Conclusions</h3><div>GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"266 ","pages":"Article 108773"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GDReCo: Fine-grained gene-disease relationship extraction corpus\",\"authors\":\"Hui Yu , Jing Wu , Suyan Bian , Sheng Zhang , Yibin Wu , Ziyan Zhou , Qian Jia , Yuan Ni , Zhengxing Huang , Huiyu Yan , Weidong Wang , Kunlun He , Jinlong Shi\",\"doi\":\"10.1016/j.cmpb.2025.108773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and objective</h3><div>Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.</div></div><div><h3>Methods</h3><div>This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.</div></div><div><h3>Results</h3><div>We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for \\\"event\\\" and \\\"rel\\\" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.</div></div><div><h3>Conclusions</h3><div>GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"266 \",\"pages\":\"Article 108773\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260725001907\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725001907","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
GDReCo: Fine-grained gene-disease relationship extraction corpus
Background and objective
Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction.
Methods
This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models.
Results
We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for "event" and "rel" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks.
Conclusions
GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.