一种自动规范化切分主题的方法

2010 IEEE Youth Conference on Information, Computing and Telecommunications Pub Date : 2010-11-01 DOI:10.1109/YCICT.2010.5713130

Yuanyuan Jin, Bao-jian Gao, Ziran Zhang

{"title":"一种自动规范化切分主题的方法","authors":"Yuanyuan Jin, Bao-jian Gao, Ziran Zhang","doi":"10.1109/YCICT.2010.5713130","DOIUrl":null,"url":null,"abstract":"This paper presents an automatic topic segmentation approach based on subwords normalized cut (Ncut) for Chinese broadcast news, since the classical Ncut has a limitation that the number of segments has to be set as a prior. We abstract a text into a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence lexical similarities at Chinese subwords level, thus the segmentation task is formalized as a graph-partitioning problem under the Ncut criterion. In order to break through the limitation, we proposed a text dotplotting inspired method, which can evaluate the segmentation results and select the optimal number of segments automatically. Lastly, we put the whole approach into a machine learning framework, learning the best arguments on train set. Our method achieved relative improvement of 3% over non-automatic subwords Ncut, also the previous best method.","PeriodicalId":179847,"journal":{"name":"2010 IEEE Youth Conference on Information, Computing and Telecommunications","volume":"798 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An automatic normalized cut topic segmentation approach\",\"authors\":\"Yuanyuan Jin, Bao-jian Gao, Ziran Zhang\",\"doi\":\"10.1109/YCICT.2010.5713130\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an automatic topic segmentation approach based on subwords normalized cut (Ncut) for Chinese broadcast news, since the classical Ncut has a limitation that the number of segments has to be set as a prior. We abstract a text into a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence lexical similarities at Chinese subwords level, thus the segmentation task is formalized as a graph-partitioning problem under the Ncut criterion. In order to break through the limitation, we proposed a text dotplotting inspired method, which can evaluate the segmentation results and select the optimal number of segments automatically. Lastly, we put the whole approach into a machine learning framework, learning the best arguments on train set. Our method achieved relative improvement of 3% over non-automatic subwords Ncut, also the previous best method.\",\"PeriodicalId\":179847,\"journal\":{\"name\":\"2010 IEEE Youth Conference on Information, Computing and Telecommunications\",\"volume\":\"798 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE Youth Conference on Information, Computing and Telecommunications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/YCICT.2010.5713130\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE Youth Conference on Information, Computing and Telecommunications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/YCICT.2010.5713130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

摘要针对传统的中文广播新闻分段归一化切割(Ncut)存在分段个数必须作为先验条件的局限性，提出了一种基于子词归一化切割的中文广播新闻自动话题分割方法。我们将文本抽象成一个加权无向图，其中节点对应句子，边的权重描述句子间在中文子词层面的词汇相似度，从而将分词任务形式化为Ncut准则下的图划分问题。为了突破这一局限，我们提出了一种文本点图启发的分割方法，该方法可以自动评估分割结果并选择最优的分割数量。最后，我们将整个方法放入机器学习框架中，学习火车集上的最佳参数。我们的方法比非自动子词Ncut(也是之前的最佳方法)实现了3%的相对改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An automatic normalized cut topic segmentation approach

This paper presents an automatic topic segmentation approach based on subwords normalized cut (Ncut) for Chinese broadcast news, since the classical Ncut has a limitation that the number of segments has to be set as a prior. We abstract a text into a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence lexical similarities at Chinese subwords level, thus the segmentation task is formalized as a graph-partitioning problem under the Ncut criterion. In order to break through the limitation, we proposed a text dotplotting inspired method, which can evaluate the segmentation results and select the optimal number of segments automatically. Lastly, we put the whole approach into a machine learning framework, learning the best arguments on train set. Our method achieved relative improvement of 3% over non-automatic subwords Ncut, also the previous best method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE Youth Conference on Information, Computing and Telecommunications

自引率

0.00%

发文量