Artificial intelligence strategies based on random forests for detection of AI-generated content in public health

IF 3.9 3区医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH

Public Health Pub Date : 2025-04-07 DOI:10.1016/j.puhe.2025.03.029

Igor V. Pantic , Snezana Mugosa

{"title":"Artificial intelligence strategies based on random forests for detection of AI-generated content in public health","authors":"Igor V. Pantic , Snezana Mugosa","doi":"10.1016/j.puhe.2025.03.029","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>To train and test a Random Forest machine learning model with the ability to distinguish AI-generated from human-generated textual content in the domain of public health, and public health policy.</div></div><div><h3>Study design</h3><div>Supervised machine learning study.</div></div><div><h3>Methods</h3><div>A dataset comprising 1000 human-generated and 1000 AI-generated paragraphs was created. Textual features were extracted using TF-IDF vectorization which calculates term frequency (TF) and Inverse document frequency (IDF), and combines the two measures to produce a score for individual terms. The Random Forest model was trained and tested using the Scikit-Learn library and Jupyter Notebook service in the Google Colab cloud-based environment, with Google CPU hardware acceleration.</div></div><div><h3>Results</h3><div>The model achieved a classification accuracy of 81.8 % and an area under the ROC curve of 0.9. For human-generated content, precision, recall, and F1-score were 0.85, 0.78, and 0.81, respectively. For AI-generated content, these metrics were 0.79, 0.86, and 0.82. The MCC value of 0.64 indicated moderate to strong predictive power. The model demonstrated robust sensitivity (recall for AI-generated class) of 0.86 and specificity (recall for human-generated class) of 0.78.</div></div><div><h3>Conclusions</h3><div>The model exhibited acceptable performance, as measured by classification accuracy, area under the receiver operating characteristic curve, and other metrics. This approach can be further improved by incorporating additional supervised machine learning techniques and serves as a foundation for the future development of a sophisticated and innovative AI system. Such a system could play a crucial role in combating misinformation and enhancing public trust across various government platforms, media outlets, and social networks.</div></div>","PeriodicalId":49651,"journal":{"name":"Public Health","volume":"242 ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Public Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0033350625001489","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives

To train and test a Random Forest machine learning model with the ability to distinguish AI-generated from human-generated textual content in the domain of public health, and public health policy.

Study design

Supervised machine learning study.

Methods

A dataset comprising 1000 human-generated and 1000 AI-generated paragraphs was created. Textual features were extracted using TF-IDF vectorization which calculates term frequency (TF) and Inverse document frequency (IDF), and combines the two measures to produce a score for individual terms. The Random Forest model was trained and tested using the Scikit-Learn library and Jupyter Notebook service in the Google Colab cloud-based environment, with Google CPU hardware acceleration.

Results

The model achieved a classification accuracy of 81.8 % and an area under the ROC curve of 0.9. For human-generated content, precision, recall, and F1-score were 0.85, 0.78, and 0.81, respectively. For AI-generated content, these metrics were 0.79, 0.86, and 0.82. The MCC value of 0.64 indicated moderate to strong predictive power. The model demonstrated robust sensitivity (recall for AI-generated class) of 0.86 and specificity (recall for human-generated class) of 0.78.

Conclusions

The model exhibited acceptable performance, as measured by classification accuracy, area under the receiver operating characteristic curve, and other metrics. This approach can be further improved by incorporating additional supervised machine learning techniques and serves as a foundation for the future development of a sophisticated and innovative AI system. Such a system could play a crucial role in combating misinformation and enhancing public trust across various government platforms, media outlets, and social networks.

查看原文本刊更多论文

基于随机森林的人工智能策略，用于检测公共卫生中人工智能生成的内容

目的训练和测试一个随机森林机器学习模型，该模型具有在公共卫生和公共卫生政策领域区分人工智能生成的文本内容和人类生成的文本内容的能力。研究设计监督式机器学习研究。方法创建一个包含1000个人工生成段落和1000个人工生成段落的数据集。使用TF-IDF矢量化方法提取文本特征，计算词频（TF）和逆文档频率（IDF），并将两者结合起来产生单个词的分数。随机森林模型在谷歌Colab云环境下，使用Scikit-Learn库和Jupyter Notebook服务进行训练和测试，使用谷歌CPU硬件加速。结果该模型的分类准确率为81.8%，ROC曲线下面积为0.9。对于人工生成的内容，准确率、召回率和f1得分分别为0.85、0.78和0.81。对于人工智能生成的内容，这些指标分别为0.79、0.86和0.82。MCC值为0.64表明预测能力中等至较强。该模型的鲁棒性灵敏度（人工智能生成类别的召回率）为0.86，特异性（人工生成类别的召回率）为0.78。结论该模型在分类精度、受试者工作特征曲线下面积等指标上表现良好。这种方法可以通过结合额外的监督机器学习技术进一步改进，并为未来开发复杂和创新的人工智能系统奠定基础。这样的系统可以在打击错误信息和增强各种政府平台、媒体渠道和社交网络的公众信任方面发挥关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Public Health 医学-公共卫生、环境卫生与职业卫生

CiteScore

7.60

自引率

0.00%

发文量

280

审稿时长

37 days

期刊介绍： Public Health is an international, multidisciplinary peer-reviewed journal. It publishes original papers, reviews and short reports on all aspects of the science, philosophy, and practice of public health.