Problems of detecting fuzzy duplicates

S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys
{"title":"Problems of detecting fuzzy duplicates","authors":"S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys","doi":"10.1145/3330431.3330455","DOIUrl":null,"url":null,"abstract":"This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to \"small\" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.","PeriodicalId":196960,"journal":{"name":"Proceedings of the 5th International Conference on Engineering and MIS","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Engineering and MIS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3330431.3330455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to "small" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.
检测模糊副本的问题
本文讨论了模糊重复的检测问题。最近,人们非常关注通过选择各种启发式方法来降低正在创建的算法的计算复杂度。使用近似方法可以观察到重复的检出率降低。在比较问题中,影响重复定义的准确性和完整性的一个重要因素是实体部分的选择。对模糊重复检测算法质量的另一个关键要求是它们对“小”数据变化的抵抗力和处理它们的能力。曼伯(U. Manber)和海因策(N. Heintze)是寻找模糊复制品领域的首批研究之一。在这些作品中,使用相邻字母的序列来构建样本。Dactogram包括固定长度的所有文本子字符串。选择完备性、准确性和f测度作为算法质量的主要指标。通过这些参数对算法进行比较,并通过模糊重复初始对集的不同算法组合来确定它们之间的相互关联和联合覆盖。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信