为区分人类撰写的文本和人工智能生成的文本而开发的大型策划和注释语料库:来自维基百科和 ChatGPT 的文本案例研究

Aakash Singh, Deepawali Sharma, Abhirup Nandy, Vivek Kumar Singh
{"title":"为区分人类撰写的文本和人工智能生成的文本而开发的大型策划和注释语料库:来自维基百科和 ChatGPT 的文本案例研究","authors":"Aakash Singh,&nbsp;Deepawali Sharma,&nbsp;Abhirup Nandy,&nbsp;Vivek Kumar Singh","doi":"10.1016/j.nlp.2023.100050","DOIUrl":null,"url":null,"abstract":"<div><p>The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100050"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S294971912300047X/pdfft?md5=48afd2554f84aa4af2b6e1f9fb5dbc60&pid=1-s2.0-S294971912300047X-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Towards a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ChatGPT\",\"authors\":\"Aakash Singh,&nbsp;Deepawali Sharma,&nbsp;Abhirup Nandy,&nbsp;Vivek Kumar Singh\",\"doi\":\"10.1016/j.nlp.2023.100050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"6 \",\"pages\":\"Article 100050\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S294971912300047X/pdfft?md5=48afd2554f84aa4af2b6e1f9fb5dbc60&pid=1-s2.0-S294971912300047X-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S294971912300047X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S294971912300047X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近推出的大型语言模型有能力生成文本,并进行类似人类的对话和问题解答。由于这些模型的能力,它们现在被广泛用于从回答问题到撰写学术文章等各种用途。这些模型产生的结果如此之好,以至于很难辨别哪些文本是由人类撰写的,哪些是由这些程序撰写的。这也导致了各种不同的问题,如断章取义、文章缺乏新意、抄袭和缺乏对原文的适当归属和引用等问题。因此,需要合适的计算资源来开发能够识别和区分人类文本和机器生成文本的算法方法。这项工作提供了一个大型的经过编辑和注释的语料库,其中包括 44,162 篇来自维基百科和 ChatGPT 的文本文章,从而为解决这一研究问题做出了贡献。一些基准模型也被应用于所开发的数据集,并对所获得的结果进行了分析和讨论。该语料库提供了宝贵的资源,可用于推动这一重要领域的研究,从而促进人工智能语言模型以负责任和合乎道德的方式融入各个领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ChatGPT

The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信