Broad coverage paragraph segmentation across languages and domains

ACM Trans. Speech Lang. Process. Pub Date : 2006-07-01 DOI:10.1145/1149290.1151098

C. Sporleder, Mirella Lapata

引用次数: 21

Abstract

This article considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g., summarization) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse-related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summarizer and show that it is useful for structuring the output of automatically generated text.

查看原文本刊更多论文

广泛覆盖跨语言和领域的段落分割

本文研究了自动分段问题。该任务与语音到文本的应用程序相关，这些应用程序的输出文本通常不包含标点符号或段落缩进，因此自然难以阅读和处理。文本到文本生成应用程序(例如，摘要)也可以从自动段落分割机制中受益，该机制可以指示主题转换并为读者提供视觉目标。我们提出了一种利用多种知识来源(包括文本线索、句法和话语相关信息)的段落分割模型，并评估了其在不同语言和领域中的表现。我们的实验表明，所提出的方法显著优于我们的基线，并且在许多情况下达到人类表现的几个百分点以内。最后，我们将我们的方法与单个文档摘要器集成，并展示了它对于构建自动生成文本的输出是有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Speech Lang. Process.

自引率

0.00%

发文量