{"title":"基于密度初始化决策树的主动学习记录匹配","authors":"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong","doi":"10.1145/3085504.3085518","DOIUrl":null,"url":null,"abstract":"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Active Learning with Density-Initialized Decision Tree for Record Matching\",\"authors\":\"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong\",\"doi\":\"10.1145/3085504.3085518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.\",\"PeriodicalId\":431308,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3085504.3085518\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Active Learning with Density-Initialized Decision Tree for Record Matching
One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.