{"title":"Scalable Matching and Clustering of Entities with FAMER","authors":"A. Saeedi, Markus Nentwig, E. Peukert, E. Rahm","doi":"10.7250/csimq.2018-16.04","DOIUrl":null,"url":null,"abstract":"Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.","PeriodicalId":416219,"journal":{"name":"Complex Syst. Informatics Model. Q.","volume":"214 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex Syst. Informatics Model. Q.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7250/csimq.2018-16.04","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34
Abstract
Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.