An automatic normalized cut topic segmentation approach

2010 IEEE Youth Conference on Information, Computing and Telecommunications Pub Date : 2010-11-01 DOI:10.1109/YCICT.2010.5713130

Yuanyuan Jin, Bao-jian Gao, Ziran Zhang

引用次数: 0

Abstract

This paper presents an automatic topic segmentation approach based on subwords normalized cut (Ncut) for Chinese broadcast news, since the classical Ncut has a limitation that the number of segments has to be set as a prior. We abstract a text into a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence lexical similarities at Chinese subwords level, thus the segmentation task is formalized as a graph-partitioning problem under the Ncut criterion. In order to break through the limitation, we proposed a text dotplotting inspired method, which can evaluate the segmentation results and select the optimal number of segments automatically. Lastly, we put the whole approach into a machine learning framework, learning the best arguments on train set. Our method achieved relative improvement of 3% over non-automatic subwords Ncut, also the previous best method.

查看原文本刊更多论文

一种自动规范化切分主题的方法

摘要针对传统的中文广播新闻分段归一化切割(Ncut)存在分段个数必须作为先验条件的局限性，提出了一种基于子词归一化切割的中文广播新闻自动话题分割方法。我们将文本抽象成一个加权无向图，其中节点对应句子，边的权重描述句子间在中文子词层面的词汇相似度，从而将分词任务形式化为Ncut准则下的图划分问题。为了突破这一局限，我们提出了一种文本点图启发的分割方法，该方法可以自动评估分割结果并选择最优的分割数量。最后，我们将整个方法放入机器学习框架中，学习火车集上的最佳参数。我们的方法比非自动子词Ncut(也是之前的最佳方法)实现了3%的相对改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE Youth Conference on Information, Computing and Telecommunications

自引率

0.00%

发文量