Problems of detecting fuzzy duplicates

Proceedings of the 5th International Conference on Engineering and MIS Pub Date : 2019-06-06 DOI:10.1145/3330431.3330455

S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys

{"title":"Problems of detecting fuzzy duplicates","authors":"S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys","doi":"10.1145/3330431.3330455","DOIUrl":null,"url":null,"abstract":"This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to \"small\" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.","PeriodicalId":196960,"journal":{"name":"Proceedings of the 5th International Conference on Engineering and MIS","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Engineering and MIS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3330431.3330455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to "small" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.

查看原文本刊更多论文

检测模糊副本的问题

本文讨论了模糊重复的检测问题。最近，人们非常关注通过选择各种启发式方法来降低正在创建的算法的计算复杂度。使用近似方法可以观察到重复的检出率降低。在比较问题中，影响重复定义的准确性和完整性的一个重要因素是实体部分的选择。对模糊重复检测算法质量的另一个关键要求是它们对“小”数据变化的抵抗力和处理它们的能力。曼伯(U. Manber)和海因策(N. Heintze)是寻找模糊复制品领域的首批研究之一。在这些作品中，使用相邻字母的序列来构建样本。Dactogram包括固定长度的所有文本子字符串。选择完备性、准确性和f测度作为算法质量的主要指标。通过这些参数对算法进行比较，并通过模糊重复初始对集的不同算法组合来确定它们之间的相互关联和联合覆盖。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th International Conference on Engineering and MIS

自引率

0.00%

发文量