Identifying computer-generated text using statistical analysis

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2017-12-01 DOI:10.1109/APSIPA.2017.8282270

Hoang-Quoc Nguyen-Son, Ngoc-Dung T. Tieu, H. Nguyen, Junichi Yamagishi, Isao Echizen

{"title":"Identifying computer-generated text using statistical analysis","authors":"Hoang-Quoc Nguyen-Son, Ngoc-Dung T. Tieu, H. Nguyen, Junichi Yamagishi, Isao Echizen","doi":"10.1109/APSIPA.2017.8282270","DOIUrl":null,"url":null,"abstract":"Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2017.8282270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.

查看原文本刊更多论文

使用统计分析识别计算机生成的文本

基于计算机的自动生成文本用于各种应用(如文本摘要、机器翻译)，并在日常生活中发挥着重要作用。但是，计算机生成的文本可能会因为翻译错误和语言处理错误导致的措辞不当而产生混乱的信息，这在总统选举和产品广告中可能会成为关键问题。以前检测计算机生成文本的方法通常会估计文本的流畅性，但由于基于神经网络的自然语言生成的发展，这种方法在不久的将来可能不会有用，因为它产生的措辞接近于人工制作的措辞。因此，需要一种不同的方法来检测计算机生成的文本。我们假设，人工编写的措辞比计算机编写的措辞更一致。例如，齐夫定律指出，在人类书写的文本中，出现频率最高的单词大约是出现频率第二高的单词的两倍，是出现频率第三高的单词的近三倍，以此类推。我们发现，在计算机生成文本的情况下，情况并非如此。因此，我们提出了一种基于统计的方法来识别计算机生成的文本。首先，将单词分布频率与对应的Zipfian分布进行比较，提取频率特征;接下来，提取复杂的短语特征，因为人工生成的文本比计算机生成的文本包含更复杂的短语。最后，在句子层面使用短语动词和段落层面使用共指解析关系对人工生成文本的高一致性进行量化，并将其整合到一致性特征中。对100本英文原版书籍和100本芬兰语翻译的英文书籍的频率、复杂短语和一致性特征的组合进行了评估。结果表明，该方法取得了较好的效果(准确率为98.0%;等错误率= 2.9%)与使用解析树提取图书特征的最合适方法进行比较。使用另外两种语言(法语和荷兰语)进行评估也显示出类似的结果。因此，所提出的方法在各种语言中都是一致的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量