AI-Generated Versus Human Text: Introducing a New Dataset for Benchmarking and Analysis

Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen
{"title":"AI-Generated Versus Human Text: Introducing a New Dataset for Benchmarking and Analysis","authors":"Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen","doi":"10.1109/TAI.2025.3544183","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 8","pages":"2241-2252"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10896944","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10896944/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.
人工智能生成的文本与人类文本:为基准测试和分析引入新的数据集
人工智能(AI)越来越多地融入我们的日常生活。随着2022年11月OpenAI引入ChatGPT,人们现在可以要求机器人在几秒钟内生成全面的写作。这种新的变革性技术还引入了伦理、安全和其他普遍关注的问题。重要的是要利用人工智能的力量来理解一段文本是由人工智能生成的,还是有机地由人类生成的。在本文中,我们创建并管理了一个包含10000条记录的中型数据集,其中包含人类和机器生成的文本,并利用它来训练一个可靠的模型来准确区分两者。首先,我们使用带有各种输入的蒸馏gpt -2来生成机器文本。然后,我们获得一个相等的人类生成文本的样本大小。使用统一流形近似和投影(UMAP)降维技术对所有文本进行清理、探索和可视化。最后,使用几种技术将文本转换为向量,包括词包、术语频率-逆文档频率、转换器的双向编码器表示和基于神经网络的嵌入。然后使用传统模型(如逻辑回归、随机森林、XGBoost)和深度学习模型(如长短期记忆(LSTM)、卷积神经网络(CNN)、CNN-LSTM)进行机器学习实验。在所有向量化策略和机器学习算法中,我们测量准确性、精密度、召回率和F1分数。我们也为每次练习计时。每个模型在一个小时内完成训练,我们观察到分数在90%以上。然后,我们在机器学习模型上使用Shapley加性解释(SHAP)包来探索是否以及如何解释模型以进一步验证结果。最后,我们使用Streamlit框架将TF-IDF随机森林模型部署到用户友好的web应用程序中,允许没有编码专业知识的用户与模型进行交互。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.70
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信