A Large-Scale Chinese Long-Text Extractive Summarization Corpus

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2021-06-06 DOI:10.1109/ICASSP39728.2021.9414946

Kai Chen, Guanyu Fu, Qingcai Chen, Baotian Hu

引用次数: 3

Abstract

Recently, large-scale datasets have vastly facilitated the development in nearly domains of Natural Language Processing. However, lacking large scale Chinese corpus is still a critical bottleneck for further research on deep text summarization methods. In this paper, we publish a large-scale Chinese Long-text Extractive Summarization corpus named CLES. The CLES contains about 104K pairs, which is originally collected from Sina Weibo1. To verify the quality of the corpus, we also manually tagged the relevance score of 5,000 pairs. Our benchmark models on the proposed corpus include conventional deep learning based extractive models and several pre-trained Bert-based algorithms. Their performances are reported and briefly analyzed to facilitate further research on the corpus. We will release this corpus for further research2.

查看原文本刊更多论文

大型中文长文本抽取摘要语料库

近年来，大规模数据集极大地促进了自然语言处理领域的发展。然而，缺乏大规模的中文语料库仍然是制约深度文本摘要方法进一步研究的关键瓶颈。在本文中，我们发布了一个大型中文长文本抽取摘要语料库CLES。cle包含约104K双，这些数据最初来自新浪微博。为了验证语料库的质量，我们还手动标记了5000对的相关分数。我们提出的语料库的基准模型包括传统的基于深度学习的提取模型和几种预训练的基于bert的算法。本文报道并简要分析了它们的性能，以促进语料库的进一步研究。我们将发布该语料库以供进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量