Automated information extraction model enhancing traditional Chinese medicine RCT evidence extraction (Evi-BERT): algorithm development and validation.

IF 3 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Frontiers in Artificial Intelligence Pub Date : 2024-08-15 eCollection Date: 2024-01-01 DOI:10.3389/frai.2024.1454945

Yizhen Li, Zhongzhi Luan, Yixing Liu, Heyuan Liu, Jiaxing Qi, Dongran Han

{"title":"Automated information extraction model enhancing traditional Chinese medicine RCT evidence extraction (Evi-BERT): algorithm development and validation.","authors":"Yizhen Li, Zhongzhi Luan, Yixing Liu, Heyuan Liu, Jiaxing Qi, Dongran Han","doi":"10.3389/frai.2024.1454945","DOIUrl":null,"url":null,"abstract":"Background: In the field of evidence-based medicine, randomized controlled trials (RCTs) are of critical importance for writing clinical guidelines and providing guidance to practicing physicians. Currently, RCTs rely heavily on manual extraction, but this method has data breadth limitations and is less efficient.Objectives: To expand the breadth of data and improve the efficiency of obtaining clinical evidence, here, we introduce an automated information extraction model for traditional Chinese medicine (TCM) RCT evidence extraction.Methods: We adopt the Evidence-Bidirectional Encoder Representation from Transformers (Evi-BERT) for automated information extraction, which is combined with rule extraction. Eleven disease types and 48,523 research articles from the China National Knowledge Infrastructure (CNKI), WanFang Data, and VIP databases were selected as the data source for extraction. We then constructed a manually annotated dataset of TCM clinical literature to train the model, including ten evidence elements and 24,244 datapoints. We chose two models, BERT-CRF and BiLSTM-CRF, as the baseline, and compared the training effects with Evi-BERT and Evi-BERT combined with rule expression (RE).Results: We found that Evi-BERT combined with RE achieved the best performance (precision score = 0.926, Recall = 0.952, F1 score = 0.938) and had the best robustness. We totally summarized 113 pieces of rule datasets in the regulation extraction procedure. Our model dramatically expands the amount of data that can be searched and greatly improves efficiency without losing accuracy.Conclusion: Our work provided an intelligent approach to extracting clinical evidence for TCM RCT data. Our model can help physicians reduce the time spent reading journals and rapidly speed up the screening of clinical trial evidence to help generate accurate clinical reference guidelines. Additionally, we hope the structured clinical evidence and structured knowledge extracted from this study will help other researchers build large language models in TCM.","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"7 ","pages":"1454945"},"PeriodicalIF":3.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358118/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2024.1454945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In the field of evidence-based medicine, randomized controlled trials (RCTs) are of critical importance for writing clinical guidelines and providing guidance to practicing physicians. Currently, RCTs rely heavily on manual extraction, but this method has data breadth limitations and is less efficient.

Objectives: To expand the breadth of data and improve the efficiency of obtaining clinical evidence, here, we introduce an automated information extraction model for traditional Chinese medicine (TCM) RCT evidence extraction.

Methods: We adopt the Evidence-Bidirectional Encoder Representation from Transformers (Evi-BERT) for automated information extraction, which is combined with rule extraction. Eleven disease types and 48,523 research articles from the China National Knowledge Infrastructure (CNKI), WanFang Data, and VIP databases were selected as the data source for extraction. We then constructed a manually annotated dataset of TCM clinical literature to train the model, including ten evidence elements and 24,244 datapoints. We chose two models, BERT-CRF and BiLSTM-CRF, as the baseline, and compared the training effects with Evi-BERT and Evi-BERT combined with rule expression (RE).

Results: We found that Evi-BERT combined with RE achieved the best performance (precision score = 0.926, Recall = 0.952, F1 score = 0.938) and had the best robustness. We totally summarized 113 pieces of rule datasets in the regulation extraction procedure. Our model dramatically expands the amount of data that can be searched and greatly improves efficiency without losing accuracy.

Conclusion: Our work provided an intelligent approach to extracting clinical evidence for TCM RCT data. Our model can help physicians reduce the time spent reading journals and rapidly speed up the screening of clinical trial evidence to help generate accurate clinical reference guidelines. Additionally, we hope the structured clinical evidence and structured knowledge extracted from this study will help other researchers build large language models in TCM.

查看原文本刊更多论文

加强中医药 RCT 证据提取的自动信息提取模型（Evi-BERT）：算法开发与验证。

背景：在循证医学领域，随机对照试验（RCT）对于编写临床指南和为执业医师提供指导至关重要。目前，RCT 主要依靠人工提取，但这种方法存在数据广度的局限性，效率较低：为扩大数据广度，提高获取临床证据的效率，我们在此介绍一种用于中医药 RCT 证据提取的自动化信息提取模型：方法：我们采用变压器证据双向编码器表示法（Evi-BERT）进行自动信息提取，并将其与规则提取相结合。我们从中国国家知识基础设施（CNKI）、万方数据和VIP数据库中选取了11种疾病类型和48523篇研究文章作为提取数据源。然后，我们构建了一个人工标注的中医临床文献数据集来训练模型，其中包括 10 个证据元素和 24,244 个数据点。我们选择了BERT-CRF和BiLSTM-CRF两个模型作为基线，并比较了Evi-BERT和Evi-BERT结合规则表达（RE）的训练效果：结果：我们发现，Evi-BERT与RE相结合的训练效果最好（精确度=0.926，召回率=0.952，F1得分=0.938），鲁棒性也最好。在规则提取过程中，我们共总结了 113 个规则数据集。我们的模型极大地扩展了可搜索的数据量，并在不损失准确性的情况下大大提高了效率：我们的工作为中医 RCT 数据的临床证据提取提供了一种智能方法。我们的模型可以帮助医生减少阅读期刊的时间，迅速加快临床试验证据的筛选，从而帮助生成准确的临床参考指南。此外，我们希望本研究中提取的结构化临床证据和结构化知识能够帮助其他研究人员建立中医大语言模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊