基于密度初始化决策树的主动学习记录匹配

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI:10.1145/3085504.3085518

Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong

{"title":"基于密度初始化决策树的主动学习记录匹配","authors":"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong","doi":"10.1145/3085504.3085518","DOIUrl":null,"url":null,"abstract":"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Active Learning with Density-Initialized Decision Tree for Record Matching\",\"authors\":\"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong\",\"doi\":\"10.1145/3085504.3085518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.\",\"PeriodicalId\":431308,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3085504.3085518\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

数据管理和数据集成领域中的一个基本问题是记录匹配，它指的是识别与不同数据源中的相同实体相关的记录。在最近的文献中，主动学习已被证明对记录匹配是有效的。主动学习的关键步骤之一是建立合适的初始分类器，通过初始分类器，主动学习算法可以快速定位信息丰富的示例，以训练准确的模型。然而，在这个过程中，模型训练的示例标记通常是昂贵的。更糟糕的是，如果使用弱初始分类器，标记成本会显著增加。在本文中，我们提出了一种无监督算法来确定初始分类器。分类器初始化过程不需要标记成本。在此基础上，提出了一种主动采样方法来选择信息样本。实验表明，与其他主动学习方法相比，我们的方法以更少的标签成本获得了有竞争力的学习性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Active Learning with Density-Initialized Decision Tree for Record Matching

One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量