Hamid Naceur Benkhaled, Djamel Berrabah, F. Boufarès
{"title":"块大小控制为一个有效的实时记录链接","authors":"Hamid Naceur Benkhaled, Djamel Berrabah, F. Boufarès","doi":"10.1109/CloudTech49835.2020.9365866","DOIUrl":null,"url":null,"abstract":"Record Linkage (RL) is the process of detecting duplicates in one or several datasets. The main important phase during the RL process is blocking, it reduces the quadratic complexity of the RL process by dividing the data into several blocks, in which, matching between the records is done. Several blocking techniques were proposed in the literature, but most of them do not have a mechanism of controlling the generated block sizes, which is a very important condition in the field of real-time RL or privacy-preserving RL. In this paper, we propose a mechanism to control the block sizes generated by the K-Modes based Record Linkage. The experiments done on three real-world datasets show satisfying results where most of the duplicates records were detected while respecting the specified block sizes.","PeriodicalId":272860,"journal":{"name":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Block Sizes Control For an Efficient Real Time Record Linkage\",\"authors\":\"Hamid Naceur Benkhaled, Djamel Berrabah, F. Boufarès\",\"doi\":\"10.1109/CloudTech49835.2020.9365866\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Record Linkage (RL) is the process of detecting duplicates in one or several datasets. The main important phase during the RL process is blocking, it reduces the quadratic complexity of the RL process by dividing the data into several blocks, in which, matching between the records is done. Several blocking techniques were proposed in the literature, but most of them do not have a mechanism of controlling the generated block sizes, which is a very important condition in the field of real-time RL or privacy-preserving RL. In this paper, we propose a mechanism to control the block sizes generated by the K-Modes based Record Linkage. The experiments done on three real-world datasets show satisfying results where most of the duplicates records were detected while respecting the specified block sizes.\",\"PeriodicalId\":272860,\"journal\":{\"name\":\"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudTech49835.2020.9365866\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudTech49835.2020.9365866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Block Sizes Control For an Efficient Real Time Record Linkage
Record Linkage (RL) is the process of detecting duplicates in one or several datasets. The main important phase during the RL process is blocking, it reduces the quadratic complexity of the RL process by dividing the data into several blocks, in which, matching between the records is done. Several blocking techniques were proposed in the literature, but most of them do not have a mechanism of controlling the generated block sizes, which is a very important condition in the field of real-time RL or privacy-preserving RL. In this paper, we propose a mechanism to control the block sizes generated by the K-Modes based Record Linkage. The experiments done on three real-world datasets show satisfying results where most of the duplicates records were detected while respecting the specified block sizes.