数据库清理的重复记录检测

M. Rehman, Vatcharapon Esichaikul
{"title":"数据库清理的重复记录检测","authors":"M. Rehman, Vatcharapon Esichaikul","doi":"10.1109/ICMV.2009.43","DOIUrl":null,"url":null,"abstract":"Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. “duplicate record detection” which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.","PeriodicalId":315778,"journal":{"name":"2009 Second International Conference on Machine Vision","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Duplicate Record Detection for Database Cleansing\",\"authors\":\"M. Rehman, Vatcharapon Esichaikul\",\"doi\":\"10.1109/ICMV.2009.43\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. “duplicate record detection” which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.\",\"PeriodicalId\":315778,\"journal\":{\"name\":\"2009 Second International Conference on Machine Vision\",\"volume\":\"109 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Second International Conference on Machine Vision\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMV.2009.43\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Second International Conference on Machine Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMV.2009.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

摘要

许多组织收集大量的数据来支持他们的业务和决策过程。从各种来源收集的数据可能存在数据质量问题。当集成各种数据库时,这类问题变得突出。集成数据库继承了源数据库中存在的数据质量问题。为了做出正确的决策,需要对集成系统中的数据进行清理。清理数据是最关键的步骤之一。在本研究中,重点是数据清理的主要问题之一,即“重复记录检测”,当从各种来源收集数据时,会出现这种情况。通过本研究,对标准重复消除算法(SDE)、有序邻域算法(SNA)、重复消除有序邻域算法(DE-SNA)和自适应重复检测算法(ADD)进行了比较。实验结果表明,自适应重复记录检测算法是重复记录检测问题的最优解决方案。对于数据记录的近似匹配,实现了字符串匹配算法(带词库的递归算法和带字符库的递归算法),结果表明,带词库的递归算法的匹配效果要好得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Duplicate Record Detection for Database Cleansing
Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. “duplicate record detection” which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信