Lexical Quantile-Based Text Complexity Measure

Recent Advances in Natural Language Processing Pub Date : 2019-10-22 DOI:10.26615/978-954-452-056-4_031

M. Eremeev, K. Vorontsov

引用次数: 11

Abstract

This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.

查看原文本刊更多论文

基于词汇分位数的文本复杂度度量

本文介绍了一种估算文本文档复杂度的新方法。常用的可读性指标是基于句子和单词的平均长度。与这些方法相比，我们建议统计文档中异常频繁出现的罕见词的数量。我们使用文本的参考语料库和分位数方法来确定哪些单词是罕见的，哪些频率是异常的。我们构建了一个通用的文本复杂性模型，该模型可以根据特定的任务进行调整，并引入了两个特殊的模型。实验设计是基于一组主题相似的维基百科文章，标记使用众包。实验证明了该方法的竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Recent Advances in Natural Language Processing

自引率

0.00%

发文量