Domain-Independent Unsupervised Text Segmentation for Data Management

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI:10.1109/ICDMW.2014.118

Makoto Sakahara, S. Okada, K. Nitta

{"title":"Domain-Independent Unsupervised Text Segmentation for Data Management","authors":"Makoto Sakahara, S. Okada, K. Nitta","doi":"10.1109/ICDMW.2014.118","DOIUrl":null,"url":null,"abstract":"In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover's distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"35 22","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Data Mining Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2014.118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover's distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.

查看原文本刊更多论文

面向数据管理的域独立无监督文本分割

在本研究中，我们提出了一种独立于领域的无监督文本分割方法，该方法适用于即使未见过的单个文档。该方法通过评价句子之间的相似度来分割文本文档。当领域知识不足时，通常难以计算组成句子的词之间的语义相似度。这个问题影响了分割的准确性。为了解决这个问题，我们使用word2vec来计算词之间的语义相似度。使用word2vec，我们通过使用大型领域无关的语料库进行训练，在向量空间中嵌入词之间的语义关系。此外，我们结合了语义和搭配相似性，即文档中单词之间的特征。该方法将这种组合相似度应用于亲和传播聚类。句子之间的相似度是根据所获得的主题簇的频率之间的距离来定义的。在计算句子之间的相似度后，使用动态规划自动优化分词边界。使用两个数据集的实验结果表明，该方法明显优于目前最先进的领域无关方法，并与使用主题建模的领域相关方法获得相同的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Conference on Data Mining Workshop

自引率

0.00%

发文量