AI-Generated Versus Human Text: Introducing a New Dataset for Benchmarking and Analysis

IEEE transactions on artificial intelligence Pub Date : 2025-02-20 DOI:10.1109/TAI.2025.3544183

Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen

{"title":"AI-Generated Versus Human Text: Introducing a New Dataset for Benchmarking and Analysis","authors":"Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen","doi":"10.1109/TAI.2025.3544183","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 8","pages":"2241-2252"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10896944","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10896944/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.

查看原文本刊更多论文

人工智能生成的文本与人类文本：为基准测试和分析引入新的数据集

人工智能（AI）越来越多地融入我们的日常生活。随着2022年11月OpenAI引入ChatGPT，人们现在可以要求机器人在几秒钟内生成全面的写作。这种新的变革性技术还引入了伦理、安全和其他普遍关注的问题。重要的是要利用人工智能的力量来理解一段文本是由人工智能生成的，还是有机地由人类生成的。在本文中，我们创建并管理了一个包含10000条记录的中型数据集，其中包含人类和机器生成的文本，并利用它来训练一个可靠的模型来准确区分两者。首先，我们使用带有各种输入的蒸馏gpt -2来生成机器文本。然后，我们获得一个相等的人类生成文本的样本大小。使用统一流形近似和投影（UMAP）降维技术对所有文本进行清理、探索和可视化。最后，使用几种技术将文本转换为向量，包括词包、术语频率-逆文档频率、转换器的双向编码器表示和基于神经网络的嵌入。然后使用传统模型（如逻辑回归、随机森林、XGBoost）和深度学习模型（如长短期记忆（LSTM）、卷积神经网络（CNN）、CNN-LSTM）进行机器学习实验。在所有向量化策略和机器学习算法中，我们测量准确性、精密度、召回率和F1分数。我们也为每次练习计时。每个模型在一个小时内完成训练，我们观察到分数在90%以上。然后，我们在机器学习模型上使用Shapley加性解释（SHAP）包来探索是否以及如何解释模型以进一步验证结果。最后，我们使用Streamlit框架将TF-IDF随机森林模型部署到用户友好的web应用程序中，允许没有编码专业知识的用户与模型进行交互。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量