来自非英语母语者的NLP论文草稿的语料库

Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2022-12-16 DOI:10.1145/3582768.3582797

Haotong Wang, Liyan Wang, Lepage Yves

{"title":"来自非英语母语者的NLP论文草稿的语料库","authors":"Haotong Wang, Liyan Wang, Lepage Yves","doi":"10.1145/3582768.3582797","DOIUrl":null,"url":null,"abstract":"We created an English parallel corpus of 3,005 sentence pairs, each containing a well-polished text from ACL Anthology Reference Corpus (ACL-ARC) [1] and corresponding restated drafts collected from 26 non-native writers. The purpose of this paper is to explore the writing features of the drafts from non-native English speakers, so as to benefit research in Academic Writing Aid Systems. We present a feature analysis of the corpus based on handcrafted features. To assess utility, we formulate a draft identification task to automatically recognize drafts from ground truth texts based on hybrid features. We show that the combination of deep semantic features with the optimal handcrafted features improves identification accuracy on the collected data, up to 84.57%.","PeriodicalId":315721,"journal":{"name":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A corpus of drafts of NLP papers from non-native English speakers\",\"authors\":\"Haotong Wang, Liyan Wang, Lepage Yves\",\"doi\":\"10.1145/3582768.3582797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We created an English parallel corpus of 3,005 sentence pairs, each containing a well-polished text from ACL Anthology Reference Corpus (ACL-ARC) [1] and corresponding restated drafts collected from 26 non-native writers. The purpose of this paper is to explore the writing features of the drafts from non-native English speakers, so as to benefit research in Academic Writing Aid Systems. We present a feature analysis of the corpus based on handcrafted features. To assess utility, we formulate a draft identification task to automatically recognize drafts from ground truth texts based on hybrid features. We show that the combination of deep semantic features with the optimal handcrafted features improves identification accuracy on the collected data, up to 84.57%.\",\"PeriodicalId\":315721,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3582768.3582797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582768.3582797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们创建了一个包含3,005个句子对的英语平行语料库，每个语料库包含ACL文集参考语料库(ACL- arc)[1]中经过精心修饰的文本，以及从26位非母语作家那里收集的相应重述草稿。本文的目的是探讨非英语母语者的草稿的写作特点，从而有利于学术写作辅助系统的研究。我们提出了一个基于手工特征的语料库特征分析。为了评估实用性，我们制定了一个草稿识别任务，以基于混合特征从真实文本中自动识别草稿。研究表明，将深度语义特征与最优手工特征相结合，可以提高所收集数据的识别准确率，达到84.57%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A corpus of drafts of NLP papers from non-native English speakers

We created an English parallel corpus of 3,005 sentence pairs, each containing a well-polished text from ACL Anthology Reference Corpus (ACL-ARC) [1] and corresponding restated drafts collected from 26 non-native writers. The purpose of this paper is to explore the writing features of the drafts from non-native English speakers, so as to benefit research in Academic Writing Aid Systems. We present a feature analysis of the corpus based on handcrafted features. To assess utility, we formulate a draft identification task to automatically recognize drafts from ground truth texts based on hybrid features. We show that the combination of deep semantic features with the optimal handcrafted features improves identification accuracy on the collected data, up to 84.57%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量