Wenjun Wang, Huanxin Chen, Hui Wang, Lin Fang, Huan Wang, Yi Ding, Yao Lu, Qingyao Wu
{"title":"Interpretation knowledge extraction for genetic testing via question-answer model.","authors":"Wenjun Wang, Huanxin Chen, Hui Wang, Lin Fang, Huan Wang, Yi Ding, Yao Lu, Qingyao Wu","doi":"10.1186/s12864-024-10978-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results.</p><p><strong>Method: </strong>We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume.</p><p><strong>Results: </strong>Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method.</p><p><strong>Conclusions: </strong>Our method is effective and valuable for extracting pathogenic microorganism knowledge.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1062"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549790/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-10978-9","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results.
Method: We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume.
Results: Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method.
Conclusions: Our method is effective and valuable for extracting pathogenic microorganism knowledge.
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.