{"title":"基于随机森林的人工智能策略,用于检测公共卫生中人工智能生成的内容","authors":"Igor V. Pantic , Snezana Mugosa","doi":"10.1016/j.puhe.2025.03.029","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>To train and test a Random Forest machine learning model with the ability to distinguish AI-generated from human-generated textual content in the domain of public health, and public health policy.</div></div><div><h3>Study design</h3><div>Supervised machine learning study.</div></div><div><h3>Methods</h3><div>A dataset comprising 1000 human-generated and 1000 AI-generated paragraphs was created. Textual features were extracted using TF-IDF vectorization which calculates term frequency (TF) and Inverse document frequency (IDF), and combines the two measures to produce a score for individual terms. The Random Forest model was trained and tested using the Scikit-Learn library and Jupyter Notebook service in the Google Colab cloud-based environment, with Google CPU hardware acceleration.</div></div><div><h3>Results</h3><div>The model achieved a classification accuracy of 81.8 % and an area under the ROC curve of 0.9. For human-generated content, precision, recall, and F1-score were 0.85, 0.78, and 0.81, respectively. For AI-generated content, these metrics were 0.79, 0.86, and 0.82. The MCC value of 0.64 indicated moderate to strong predictive power. The model demonstrated robust sensitivity (recall for AI-generated class) of 0.86 and specificity (recall for human-generated class) of 0.78.</div></div><div><h3>Conclusions</h3><div>The model exhibited acceptable performance, as measured by classification accuracy, area under the receiver operating characteristic curve, and other metrics. This approach can be further improved by incorporating additional supervised machine learning techniques and serves as a foundation for the future development of a sophisticated and innovative AI system. Such a system could play a crucial role in combating misinformation and enhancing public trust across various government platforms, media outlets, and social networks.</div></div>","PeriodicalId":49651,"journal":{"name":"Public Health","volume":"242 ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Artificial intelligence strategies based on random forests for detection of AI-generated content in public health\",\"authors\":\"Igor V. Pantic , Snezana Mugosa\",\"doi\":\"10.1016/j.puhe.2025.03.029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div>To train and test a Random Forest machine learning model with the ability to distinguish AI-generated from human-generated textual content in the domain of public health, and public health policy.</div></div><div><h3>Study design</h3><div>Supervised machine learning study.</div></div><div><h3>Methods</h3><div>A dataset comprising 1000 human-generated and 1000 AI-generated paragraphs was created. Textual features were extracted using TF-IDF vectorization which calculates term frequency (TF) and Inverse document frequency (IDF), and combines the two measures to produce a score for individual terms. The Random Forest model was trained and tested using the Scikit-Learn library and Jupyter Notebook service in the Google Colab cloud-based environment, with Google CPU hardware acceleration.</div></div><div><h3>Results</h3><div>The model achieved a classification accuracy of 81.8 % and an area under the ROC curve of 0.9. For human-generated content, precision, recall, and F1-score were 0.85, 0.78, and 0.81, respectively. For AI-generated content, these metrics were 0.79, 0.86, and 0.82. The MCC value of 0.64 indicated moderate to strong predictive power. The model demonstrated robust sensitivity (recall for AI-generated class) of 0.86 and specificity (recall for human-generated class) of 0.78.</div></div><div><h3>Conclusions</h3><div>The model exhibited acceptable performance, as measured by classification accuracy, area under the receiver operating characteristic curve, and other metrics. This approach can be further improved by incorporating additional supervised machine learning techniques and serves as a foundation for the future development of a sophisticated and innovative AI system. Such a system could play a crucial role in combating misinformation and enhancing public trust across various government platforms, media outlets, and social networks.</div></div>\",\"PeriodicalId\":49651,\"journal\":{\"name\":\"Public Health\",\"volume\":\"242 \",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Public Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0033350625001489\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Public Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0033350625001489","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
Artificial intelligence strategies based on random forests for detection of AI-generated content in public health
Objectives
To train and test a Random Forest machine learning model with the ability to distinguish AI-generated from human-generated textual content in the domain of public health, and public health policy.
Study design
Supervised machine learning study.
Methods
A dataset comprising 1000 human-generated and 1000 AI-generated paragraphs was created. Textual features were extracted using TF-IDF vectorization which calculates term frequency (TF) and Inverse document frequency (IDF), and combines the two measures to produce a score for individual terms. The Random Forest model was trained and tested using the Scikit-Learn library and Jupyter Notebook service in the Google Colab cloud-based environment, with Google CPU hardware acceleration.
Results
The model achieved a classification accuracy of 81.8 % and an area under the ROC curve of 0.9. For human-generated content, precision, recall, and F1-score were 0.85, 0.78, and 0.81, respectively. For AI-generated content, these metrics were 0.79, 0.86, and 0.82. The MCC value of 0.64 indicated moderate to strong predictive power. The model demonstrated robust sensitivity (recall for AI-generated class) of 0.86 and specificity (recall for human-generated class) of 0.78.
Conclusions
The model exhibited acceptable performance, as measured by classification accuracy, area under the receiver operating characteristic curve, and other metrics. This approach can be further improved by incorporating additional supervised machine learning techniques and serves as a foundation for the future development of a sophisticated and innovative AI system. Such a system could play a crucial role in combating misinformation and enhancing public trust across various government platforms, media outlets, and social networks.
期刊介绍:
Public Health is an international, multidisciplinary peer-reviewed journal. It publishes original papers, reviews and short reports on all aspects of the science, philosophy, and practice of public health.