Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis

arXiv - CS - Digital Libraries Pub Date : 2024-07-15 DOI:arxiv-2408.00769

Mayowa Akinwande, Oluwaseyi Adeliyi, Toyyibat Yussuph

{"title":"Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis","authors":"Mayowa Akinwande, Oluwaseyi Adeliyi, Toyyibat Yussuph","doi":"arxiv-2408.00769","DOIUrl":null,"url":null,"abstract":"This research explores the nuanced differences in texts produced by AI and\nthose written by humans, aiming to elucidate how language is expressed\ndifferently by AI and humans. Through comprehensive statistical data analysis,\nthe study investigates various linguistic traits, patterns of creativity, and\npotential biases inherent in human-written and AI- generated texts. The\nsignificance of this research lies in its contribution to understanding AI's\ncreative capabilities and its impact on literature, communication, and societal\nframeworks. By examining a meticulously curated dataset comprising 500K essays\nspanning diverse topics and genres, generated by LLMs, or written by humans,\nthe study uncovers the deeper layers of linguistic expression and provides\ninsights into the cognitive processes underlying both AI and human-driven\ntextual compositions. The analysis revealed that human-authored essays tend to\nhave a higher total word count on average than AI-generated essays but have a\nshorter average word length compared to AI- generated essays, and while both\ngroups exhibit high levels of fluency, the vocabulary diversity of Human\nauthored content is higher than AI generated content. However, AI- generated\nessays show a slightly higher level of novelty, suggesting the potential for\ngenerating more original content through AI systems. The paper addresses\nchallenges in assessing the language generation capabilities of AI models and\nemphasizes the importance of datasets that reflect the complexities of human-AI\ncollaborative writing. Through systematic preprocessing and rigorous\nstatistical analysis, this study offers valuable insights into the evolving\nlandscape of AI-generated content and informs future developments in natural\nlanguage processing (NLP).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"56 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00769","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This research explores the nuanced differences in texts produced by AI and those written by humans, aiming to elucidate how language is expressed differently by AI and humans. Through comprehensive statistical data analysis, the study investigates various linguistic traits, patterns of creativity, and potential biases inherent in human-written and AI- generated texts. The significance of this research lies in its contribution to understanding AI's creative capabilities and its impact on literature, communication, and societal frameworks. By examining a meticulously curated dataset comprising 500K essays spanning diverse topics and genres, generated by LLMs, or written by humans, the study uncovers the deeper layers of linguistic expression and provides insights into the cognitive processes underlying both AI and human-driven textual compositions. The analysis revealed that human-authored essays tend to have a higher total word count on average than AI-generated essays but have a shorter average word length compared to AI- generated essays, and while both groups exhibit high levels of fluency, the vocabulary diversity of Human authored content is higher than AI generated content. However, AI- generated essays show a slightly higher level of novelty, suggesting the potential for generating more original content through AI systems. The paper addresses challenges in assessing the language generation capabilities of AI models and emphasizes the importance of datasets that reflect the complexities of human-AI collaborative writing. Through systematic preprocessing and rigorous statistical analysis, this study offers valuable insights into the evolving landscape of AI-generated content and informs future developments in natural language processing (NLP).

查看原文本刊更多论文

解码人工智能与人类作者身份：通过 NLP 和统计分析揭示细微差别

本研究探讨了人工智能生成的文本与人类撰写的文本之间的细微差别，旨在阐明人工智能和人类是如何以不同的方式表达语言的。通过全面的统计数据分析，本研究调查了人类撰写的文本和人工智能生成的文本中固有的各种语言特征、创造性模式和潜在偏见。这项研究的意义在于，它有助于理解人工智能的创造能力及其对文学、交流和社会框架的影响。该研究通过检查一个精心策划的数据集，其中包括 500K 篇由 LLM 生成或由人类撰写的不同主题和体裁的论文，揭示了语言表达的深层含义，并提供了对人工智能和人类文本创作的认知过程的见解。分析表明，人类撰写的文章平均总字数往往高于人工智能生成的文章，但平均字长却短于人工智能生成的文章；虽然两组文章都表现出较高的流畅性，但人类撰写的内容的词汇多样性却高于人工智能生成的内容。不过，人工智能生成的文章显示出稍高的新颖性，这表明人工智能系统有可能生成更多原创内容。本文探讨了评估人工智能模型语言生成能力的挑战，并强调了反映人类-人工智能协作写作复杂性的数据集的重要性。通过系统的预处理和严格的统计分析，本研究为了解人工智能生成内容的演变过程提供了宝贵的见解，并为自然语言处理（NLP）的未来发展提供了参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量