基于词与句子匹配的广播新闻故事聚类

2013 International Conference on Asian Language Processing Pub Date : 2013-08-17 DOI:10.1109/IALP.2013.62

Foong Kuin Yow, T. Tan

{"title":"基于词与句子匹配的广播新闻故事聚类","authors":"Foong Kuin Yow, T. Tan","doi":"10.1109/IALP.2013.62","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Broadcast News Story Clustering via Term and Sentence Matching\",\"authors\":\"Foong Kuin Yow, T. Tan\",\"doi\":\"10.1109/IALP.2013.62\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.\",\"PeriodicalId\":413833,\"journal\":{\"name\":\"2013 International Conference on Asian Language Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Asian Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2013.62\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.62","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们提出了一种基于规则的方法，该方法使用术语和句子匹配标准将马来语广播新闻聚类到不同的故事。提出的聚类方法不需要用户预先定义簇数。聚类的三个主要阶段是句子切分、索引以及术语和句子匹配聚类。在索引之前，抄本中的句子将被分段。索引包括标记化、停止词删除、词干提取、术语选择和术语表示。使用向量空间模型(VSM)以向量的形式表示术语和句子。然后使用术语和句子匹配阈值将句子分组成簇。该方法的精度明显优于基线方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Broadcast News Story Clustering via Term and Sentence Matching

In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 International Conference on Asian Language Processing

自引率

0.00%

发文量