基于密度初始化决策树的主动学习记录匹配

Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong
{"title":"基于密度初始化决策树的主动学习记录匹配","authors":"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong","doi":"10.1145/3085504.3085518","DOIUrl":null,"url":null,"abstract":"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Active Learning with Density-Initialized Decision Tree for Record Matching\",\"authors\":\"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong\",\"doi\":\"10.1145/3085504.3085518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.\",\"PeriodicalId\":431308,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3085504.3085518\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

数据管理和数据集成领域中的一个基本问题是记录匹配,它指的是识别与不同数据源中的相同实体相关的记录。在最近的文献中,主动学习已被证明对记录匹配是有效的。主动学习的关键步骤之一是建立合适的初始分类器,通过初始分类器,主动学习算法可以快速定位信息丰富的示例,以训练准确的模型。然而,在这个过程中,模型训练的示例标记通常是昂贵的。更糟糕的是,如果使用弱初始分类器,标记成本会显著增加。在本文中,我们提出了一种无监督算法来确定初始分类器。分类器初始化过程不需要标记成本。在此基础上,提出了一种主动采样方法来选择信息样本。实验表明,与其他主动学习方法相比,我们的方法以更少的标签成本获得了有竞争力的学习性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Active Learning with Density-Initialized Decision Tree for Record Matching
One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信