RDscan：基于预训练模型从文献中提取 RNA 与疾病的关系

IF 4.2 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Methods Pub Date : 2024-05-22 DOI:10.1016/j.ymeth.2024.05.012

Yang Zhang , Yu Yang , Liping Ren , Lin Ning , Quan Zou , Nanchao Luo , Yinghui Zhang , Ruijun Liu

{"title":"RDscan：基于预训练模型从文献中提取 RNA 与疾病的关系","authors":"Yang Zhang , Yu Yang , Liping Ren , Lin Ning , Quan Zou , Nanchao Luo , Yinghui Zhang , Ruijun Liu","doi":"10.1016/j.ymeth.2024.05.012","DOIUrl":null,"url":null,"abstract":"<div><p>With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at <span>https://cellknowledge.com.cn/RDscan/</span><svg><path></path></svg>.</p></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"228 ","pages":"Pages 48-54"},"PeriodicalIF":4.2000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RDscan: Extracting RNA-disease relationship from the literature based on pre-training model\",\"authors\":\"Yang Zhang , Yu Yang , Liping Ren , Lin Ning , Quan Zou , Nanchao Luo , Yinghui Zhang , Ruijun Liu\",\"doi\":\"10.1016/j.ymeth.2024.05.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at <span>https://cellknowledge.com.cn/RDscan/</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":390,\"journal\":{\"name\":\"Methods\",\"volume\":\"228 \",\"pages\":\"Pages 48-54\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Methods\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1046202324001312\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202324001312","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

随着分子生物学和基因组学的飞速发展，人们发现了 RNA 与疾病之间的多种联系，因此从大量生物医学文献中高效、准确地提取 RNA 与疾病（RD）的关系对于推动该领域的研究至关重要。本研究介绍的 RDscan 是一种基于预训练和微调策略开发的新型文本挖掘方法，旨在利用预训练的生物医学大语言模型（LLM）从大量文献中自动提取 RD 相关信息。最初，我们通过人工从文献中整理出一个专门的 RD 语料库，其中包括 2,082 个正面句子和 2,000 个负面句子，以及一个独立的测试数据集（包括 500 个正面句子和 500 个负面句子），用于训练和评估 RDscan。随后，通过对 Bioformer 和 BioBERT 预训练模型进行微调，RDscan 在文本分类和命名实体识别（NER）任务中表现出卓越的性能。在 5 倍交叉验证中，RDscan 的表现明显优于传统的机器学习方法（支持向量机、逻辑回归和随机森林）。此外，我们还开发了一个可访问的网络服务器，帮助用户从文本中提取 RD 关系。总之，RDscan 是首个专为提取 RD 关系而设计的文本挖掘工具，有望成为致力于探索 RNA 与疾病之间错综复杂的相互作用的研究人员的宝贵工具。RDscan 的网络服务器可在 https://cellknowledge.com.cn/RDscan/ 免费获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RDscan: Extracting RNA-disease relationship from the literature based on pre-training model

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Methods 生物-生化研究方法

CiteScore

9.80

自引率

2.10%

发文量

222

审稿时长

11.3 weeks

期刊介绍： Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.