Towards a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ChatGPT
{"title":"Towards a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ChatGPT","authors":"Aakash Singh, Deepawali Sharma, Abhirup Nandy, Vivek Kumar Singh","doi":"10.1016/j.nlp.2023.100050","DOIUrl":null,"url":null,"abstract":"<div><p>The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"6 ","pages":"Article 100050"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S294971912300047X/pdfft?md5=48afd2554f84aa4af2b6e1f9fb5dbc60&pid=1-s2.0-S294971912300047X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S294971912300047X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The recently launched large language models have the capability to generate text and engage in human-like conversations and question-answering. Owing to their capabilities, these models are now being widely used for a variety of purposes, ranging from question answering to writing scholarly articles. These models are producing such good outputs that it is becoming very difficult to identify what texts are written by human beings and what by these programs. This has also led to different kinds of problems such as out-of-context literature, lack of novelty in articles, and issues of plagiarism and lack of proper attribution and citations to the original texts. Therefore, there is a need for suitable computational resources for developing algorithmic approaches that can identify and discriminate between human and machine generated texts. This work contributes towards this research problem by providing a large sized curated and annotated corpus comprising of 44,162 text articles sourced from Wikipedia and ChatGPT. Some baseline models are also applied on the developed dataset and the results obtained are analyzed and discussed. The curated corpus offers a valuable resource that can be used to advance the research in this important area and thereby contribute to the responsible and ethical integration of AI language models into various fields.