基于变压器的自动事实检查工具:在线健康信息的试点研究。

IF 3.5 Q1 HEALTH CARE SCIENCES & SERVICES
JMIR infodemiology Pub Date : 2024-12-24 DOI:10.2196/56831
Azadeh Bayani, Alexandre Ayotte, Jean Noel Nikiema
{"title":"基于变压器的自动事实检查工具:在线健康信息的试点研究。","authors":"Azadeh Bayani, Alexandre Ayotte, Jean Noel Nikiema","doi":"10.2196/56831","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.</p><p><strong>Objective: </strong>In the present pilot study, we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth employing Natural Language Processing (NLP) transformer models to enhance the process.</p><p><strong>Methods: </strong>A total of 538 health-related webpages, covering seven different disease subjects, were manually selected by Factually Health Company. The process included the following steps: i) using transformer models of Bidirectional Encoder Representations from Transformers (BERT) BioBERT and SciBERT and traditional models of random forests (RF) and support vector machines (SVM), to classify the contents of webpages into three thematic categories: semiology, epidemiology, and management, ii) for each category in the webpages, a PubMed query was automatically produced using a combination of the \"WellcomeBertMesh\" and \"KeyBERT\" models, iii) top 20 related literatures were automatically extracted from PubMed and finally, iv) the similarity checking techniques of Cosine similarity and Jaccard distance were applied to compare the content of extracted literature and webpages.</p><p><strong>Results: </strong>The BERT model for categorization of webpages contents had a good performance with the F1-scores and recall of 93% and 94% for the semiology and epidemiology respectively and 96% of for both the recall and F1-score for management. For each of the three categories in a webpage, one PubMed query was generated and with each query, 20 most related, open access and within the category of systematic reviews and meta-analysis were extracted. Less than 10% of the extracted literature were irrelevant, which were deleted. For each webpage, an average number of 23% of the sentences found to be very similar to the literature. Moreover, during the evaluation, it was found that Cosine similarity outperformed the Jaccard Distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared to accurate similarities as some sentences had a similarity score exceeding 80%, but they could not be considered as similar sentences.</p><p><strong>Conclusions: </strong>In the present pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.</p>","PeriodicalId":73554,"journal":{"name":"JMIR infodemiology","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-Based Tool for Automated Fact-Checking: A Pilot Study on Online Health Information.\",\"authors\":\"Azadeh Bayani, Alexandre Ayotte, Jean Noel Nikiema\",\"doi\":\"10.2196/56831\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.</p><p><strong>Objective: </strong>In the present pilot study, we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth employing Natural Language Processing (NLP) transformer models to enhance the process.</p><p><strong>Methods: </strong>A total of 538 health-related webpages, covering seven different disease subjects, were manually selected by Factually Health Company. The process included the following steps: i) using transformer models of Bidirectional Encoder Representations from Transformers (BERT) BioBERT and SciBERT and traditional models of random forests (RF) and support vector machines (SVM), to classify the contents of webpages into three thematic categories: semiology, epidemiology, and management, ii) for each category in the webpages, a PubMed query was automatically produced using a combination of the \\\"WellcomeBertMesh\\\" and \\\"KeyBERT\\\" models, iii) top 20 related literatures were automatically extracted from PubMed and finally, iv) the similarity checking techniques of Cosine similarity and Jaccard distance were applied to compare the content of extracted literature and webpages.</p><p><strong>Results: </strong>The BERT model for categorization of webpages contents had a good performance with the F1-scores and recall of 93% and 94% for the semiology and epidemiology respectively and 96% of for both the recall and F1-score for management. For each of the three categories in a webpage, one PubMed query was generated and with each query, 20 most related, open access and within the category of systematic reviews and meta-analysis were extracted. Less than 10% of the extracted literature were irrelevant, which were deleted. For each webpage, an average number of 23% of the sentences found to be very similar to the literature. Moreover, during the evaluation, it was found that Cosine similarity outperformed the Jaccard Distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared to accurate similarities as some sentences had a similarity score exceeding 80%, but they could not be considered as similar sentences.</p><p><strong>Conclusions: </strong>In the present pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.</p>\",\"PeriodicalId\":73554,\"journal\":{\"name\":\"JMIR infodemiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR infodemiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/56831\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR infodemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/56831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:许多人在网上寻找与健康相关的信息。由于错误信息的潜在危险,可靠信息的重要性变得尤为明显。因此,从虚假信息中辨别真实可靠的信息变得越来越具有挑战性。目的:在目前的试点研究中,我们引入了一种自动化事实核查过程的新方法,利用PubMed资源作为事实来源,采用自然语言处理(NLP)转换模型来增强这一过程。方法:由fact Health公司人工选取7个不同疾病主题的538个健康相关网页。该过程包括以下步骤:i)利用Transformers (BERT) BioBERT和SciBERT的双向编码器表示的transformer模型和随机森林(RF)和支持向量机(SVM)的传统模型,将网页内容分为三个主题类别:ii)结合“WellcomeBertMesh”和“KeyBERT”模型,对网页中的每个类别自动生成PubMed查询;iii)自动从PubMed中提取前20位相关文献;最后,iv)应用余弦相似度和Jaccard距离的相似度检查技术对提取的文献和网页内容进行比较。结果:应用BERT模型对网页内容进行分类,符号学分类和流行病学分类的召回率和召回率分别为93%和94%,管理分类的召回率和召回率分别为96%。对于网页中的三个类别中的每一个,生成一个PubMed查询,每个查询提取20个最相关的,开放获取的,属于系统评论和元分析的类别。不到10%的提取文献是不相关的,这些文献被删除。对于每个网页,发现平均有23%的句子与文献非常相似。此外,在评估过程中,当比较由BERT矢量化的网页句子和学术论文之间的相似度时,发现余弦相似度优于Jaccard距离度量。然而,与准确相似度相比,检索到的句子存在明显的假阳性问题,因为有些句子的相似度得分超过80%,但它们不能被认为是相似句。结论:在目前的试点研究中,我们提出了一种自动化健康相关在线信息事实核查的方法。将PubMed或其他科学文章数据库中的内容合并为可信赖的资源,可以自动发现健康领域中类似的可信信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Transformer-Based Tool for Automated Fact-Checking: A Pilot Study on Online Health Information.

Background: Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.

Objective: In the present pilot study, we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth employing Natural Language Processing (NLP) transformer models to enhance the process.

Methods: A total of 538 health-related webpages, covering seven different disease subjects, were manually selected by Factually Health Company. The process included the following steps: i) using transformer models of Bidirectional Encoder Representations from Transformers (BERT) BioBERT and SciBERT and traditional models of random forests (RF) and support vector machines (SVM), to classify the contents of webpages into three thematic categories: semiology, epidemiology, and management, ii) for each category in the webpages, a PubMed query was automatically produced using a combination of the "WellcomeBertMesh" and "KeyBERT" models, iii) top 20 related literatures were automatically extracted from PubMed and finally, iv) the similarity checking techniques of Cosine similarity and Jaccard distance were applied to compare the content of extracted literature and webpages.

Results: The BERT model for categorization of webpages contents had a good performance with the F1-scores and recall of 93% and 94% for the semiology and epidemiology respectively and 96% of for both the recall and F1-score for management. For each of the three categories in a webpage, one PubMed query was generated and with each query, 20 most related, open access and within the category of systematic reviews and meta-analysis were extracted. Less than 10% of the extracted literature were irrelevant, which were deleted. For each webpage, an average number of 23% of the sentences found to be very similar to the literature. Moreover, during the evaluation, it was found that Cosine similarity outperformed the Jaccard Distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared to accurate similarities as some sentences had a similarity score exceeding 80%, but they could not be considered as similar sentences.

Conclusions: In the present pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信