S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys
{"title":"Problems of detecting fuzzy duplicates","authors":"S. Brimzhanova, S. K. Atanov, Moldamurat Khuralay, D. Kalmanova, T. Tabys","doi":"10.1145/3330431.3330455","DOIUrl":null,"url":null,"abstract":"This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to \"small\" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.","PeriodicalId":196960,"journal":{"name":"Proceedings of the 5th International Conference on Engineering and MIS","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Engineering and MIS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3330431.3330455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This article discusses the problem of detecting fuzzy duplicates. Recently, much attention has been paid to the development of methods for reducing the computational complexity of the algorithms being created by choosing various heuristics. With the use of approximate approaches is observed a decrease in the detection rate of duplicates. An important factor affecting the accuracy and completeness of the duplicates definition in comparison problems is the selection of the substantive part. Another key requirement for the quality of detection algorithms for fuzzy duplicates is their resistance to "small" data changes and the ability to process them. One of the first studies in the field of finding fuzzy duplicates is the work of U. Manber and N. Heintze. In these works, sequences of adjacent letters are used to construct the sample. Dactogram includes all text substrings of a fixed length. Completeness, accuracy and F-measure were chosen as the main indicators of the quality of the algorithms. It was supposed to compare the algorithms by these parameters, and also to determine their mutual correlation and joint coverage by different algorithms combinations of the initial set of pairs for the fuzzy duplicates.