Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen
{"title":"AI-Generated Versus Human Text: Introducing a New Dataset for Benchmarking and Analysis","authors":"Ali Al Bataineh;Rachel Sickler;Kerry Kurcz;Kristen Pedersen","doi":"10.1109/TAI.2025.3544183","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 8","pages":"2241-2252"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10896944","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10896944/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) is increasingly embedded in our everyday lives. With the introduction of ChatGPT in November 2022 by OpenAI, people can now ask a bot to generate comprehensive writeups in seconds. This new transformative technology also introduces ethical, safety, and other general concerns. It is important to harness the power of AI to understand whether a body of text is generated by AI or whether it is organically human. In this article, we create and curate a medium-sized dataset of 10 000 records containing both human and machine-generated text and utilize it to train a reliable model to accurately distinguish between the two. First, we use DistilGPT-2 with various inputs to generate machine text. Then, we acquire an equal sample size of human-generated text. All the text is cleaned, explored, and visualized using the uniform manifold approximation and projection (UMAP) dimensionality reduction technique. Finally, the text is transformed into vectors using several techniques, including bag of words, term frequency-inverse document frequency, bidirectional encoder representations from transformer, and neural network-based embeddings. Machine learning experiments are then performed with traditional models such as logistic regression, random forest, and XGBoost, as well as deep learning models such as long short-term memory (LSTM), convolutional neural network (CNN), and CNN-LSTM. Across all vectorization strategies and machine learning algorithms, we measure accuracy, precision, recall, and F1 scores. We also time each exercise. Each model completes its training within an hour, and we observe scores above 90%. We then use the Shapley additive explanations (SHAP) package on machine learning models to explore if and how we can explain the model to further validate results. Lastly, we deploy our TF-IDF Random Forest model to a user-friendly web application using the Streamlit framework, allowing users without coding expertise to interact with the model.