Interpretation knowledge extraction for genetic testing via question-answer model.

IF 3.5 2区生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

BMC Genomics Pub Date : 2024-11-09 DOI:10.1186/s12864-024-10978-9

Wenjun Wang, Huanxin Chen, Hui Wang, Lin Fang, Huan Wang, Yi Ding, Yao Lu, Qingyao Wu

{"title":"Interpretation knowledge extraction for genetic testing via question-answer model.","authors":"Wenjun Wang, Huanxin Chen, Hui Wang, Lin Fang, Huan Wang, Yi Ding, Yao Lu, Qingyao Wu","doi":"10.1186/s12864-024-10978-9","DOIUrl":null,"url":null,"abstract":"Background: Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results.Method: We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume.Results: Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method.Conclusions: Our method is effective and valuable for extracting pathogenic microorganism knowledge.","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1062"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549790/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-10978-9","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results.

Method: We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume.

Results: Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method.

Conclusions: Our method is effective and valuable for extracting pathogenic microorganism knowledge.

查看原文本刊更多论文

通过问答模型提取基因检测的解释知识。

背景：基于测序的基因检测被广泛应用于生物医学研究，包括利用元基因组新一代测序（mNGS）检测病原微生物。将测序结果应用于临床诊断和治疗依赖于各种解释知识库。目前，现有的知识库主要是通过人工知识提取建立的。这种方法需要专业人员阅读大量文献并从中提取相关知识，耗时长、成本高。此外，人工提取不可避免地会引入主观偏见。在本研究中，我们旨在自动提取用于解释 mNGS 结果的知识：方法：我们提出了一种基于问答（QA）模型自动提取病原微生物知识的新方法。首先，由于没有可用的病原微生物 QA 数据集来训练模型，我们构建了一个 MicrobeDB 数据集。创建的数据集包含来自 618 篇已发表论文的 3,161 个样本，涵盖 224 种病原微生物。然后，我们基于 MicrobeDB 对选定的基线模型进行微调。最后，我们利用 ChatGPT 来增强训练数据的多样性，并利用数据扩展来增加训练数据量：我们的方法在 MicrobeDB 测试集上的精确匹配（EM）和 F1 分数分别达到了 88.39% 和 93.18%。我们还对所提出的数据增强方法进行了消融研究。此外，我们还与基于 ChatGPT API 的 ChatPDF 工具进行了对比实验，以证明所提方法的有效性：我们的方法对于提取病原微生物知识是有效且有价值的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Genomics 生物-生物工程与应用微生物

CiteScore

7.40

自引率

4.50%

发文量

769

审稿时长

6.4 months

期刊介绍： BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.