{"title":"Similar Duplicate Record Detection of Big Data Based on Entropy Grouping Clustering","authors":"Ping-wei Zhang","doi":"10.1109/AEMCSE55572.2022.00131","DOIUrl":null,"url":null,"abstract":"At present, the similar duplicate records of massive data cannot be detected effectively by current methods, an algorithm of Property Entropy Grouping Clustering is proposed (EGC). The basic idea constructs an entropy metric based on similarity between objects, the importance of each property can be evaluated and a key property subset can be obtained. According to the key property to split the data sets into small data sets, the similar duplicated records are identified based on the algorithm of Sorted-Neighborhood Method. The theory an alysis and experiments show that identification accuracy and detection efficiency of the method are higher and it can effectively solve the problems of identification in similar duplicate records of the big data set.","PeriodicalId":309096,"journal":{"name":"2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","volume":"51 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEMCSE55572.2022.00131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
At present, the similar duplicate records of massive data cannot be detected effectively by current methods, an algorithm of Property Entropy Grouping Clustering is proposed (EGC). The basic idea constructs an entropy metric based on similarity between objects, the importance of each property can be evaluated and a key property subset can be obtained. According to the key property to split the data sets into small data sets, the similar duplicated records are identified based on the algorithm of Sorted-Neighborhood Method. The theory an alysis and experiments show that identification accuracy and detection efficiency of the method are higher and it can effectively solve the problems of identification in similar duplicate records of the big data set.