{"title":"Generalized classification rules for entity identification","authors":"Umesh S. Bhoskar, Arati Manjaramkar","doi":"10.1109/ICRITO.2016.7784951","DOIUrl":null,"url":null,"abstract":"One of the essential tasks in data integration is entity resolution (ER) which will recognize the records that are belonging to the same entity. The entity resolution is referred by many other terms like duplicate detection, pattern matching, etc. Now a days the activities like information integration, information retrieval, crowd sourcing, and pay-as-you-go have involved users to carry out the ER tasks such as to identify whether two entity descriptions are referred to the same entity or not. Previous work of ER involves clustering and comparison approaches which are based on some assumption. The ER gives the poorer quality when such assumptions are not correct. In our approach, we present a new set of entity rules where each rule enumerates all possibilities to identify the correct entity of the records. Additionally, we propose an extended approach (GenR) for efficient and effective rules generation by using a specialized form of term-based entropy measure. We experimentally evaluated the proposed approach using data set with a large no. of records and the data sets with different data characteristics. We report on some promising empirical results which demonstrate performance improvement by using a term-based quality measure.","PeriodicalId":377611,"journal":{"name":"2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRITO.2016.7784951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
One of the essential tasks in data integration is entity resolution (ER) which will recognize the records that are belonging to the same entity. The entity resolution is referred by many other terms like duplicate detection, pattern matching, etc. Now a days the activities like information integration, information retrieval, crowd sourcing, and pay-as-you-go have involved users to carry out the ER tasks such as to identify whether two entity descriptions are referred to the same entity or not. Previous work of ER involves clustering and comparison approaches which are based on some assumption. The ER gives the poorer quality when such assumptions are not correct. In our approach, we present a new set of entity rules where each rule enumerates all possibilities to identify the correct entity of the records. Additionally, we propose an extended approach (GenR) for efficient and effective rules generation by using a specialized form of term-based entropy measure. We experimentally evaluated the proposed approach using data set with a large no. of records and the data sets with different data characteristics. We report on some promising empirical results which demonstrate performance improvement by using a term-based quality measure.