具有最小丢失率的文本重复数据删除

International Conference on Machine Learning and Computing Pub Date : 2019-02-22 DOI:10.1145/3318299.3318369

Youming Ge, Jiefeng Wu, Genan Dai, Yubao Liu

{"title":"具有最小丢失率的文本重复数据删除","authors":"Youming Ge, Jiefeng Wu, Genan Dai, Yubao Liu","doi":"10.1145/3318299.3318369","DOIUrl":null,"url":null,"abstract":"Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.","PeriodicalId":164987,"journal":{"name":"International Conference on Machine Learning and Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Text Deduplication with Minimum Loss Ratio\",\"authors\":\"Youming Ge, Jiefeng Wu, Genan Dai, Yubao Liu\",\"doi\":\"10.1145/3318299.3318369\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.\",\"PeriodicalId\":164987,\"journal\":{\"name\":\"International Conference on Machine Learning and Computing\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Machine Learning and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3318299.3318369\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318299.3318369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

文本重复删除是文本文档分析应用程序的一项重要操作。给定一组文本文档，我们通常需要删除相似度值不小于指定阈值的文本文档。但是，如果要删除的类似文本文档集太大，则剩余的文本文档集可能不足以进行文本分析。在本文中，我们考虑了如何平衡文本文档的删除集和剩余集的问题。我们尽量减少重复信息，尽量减少要删除的文本文档的数量。我们提出了一种基于相似图概念的贪心算法，该算法可以表示一组文本文档的相似关系。我们还考虑了动态设置的增量算法。基于真实新闻文档数据集的实验结果表明了所提算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text Deduplication with Minimum Loss Ratio

Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Machine Learning and Computing

自引率

0.00%

发文量