利用关系进行对象整合

Zhaoqi Chen, D. Kalashnikov, S. Mehrotra
{"title":"利用关系进行对象整合","authors":"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra","doi":"10.1145/1077501.1077512","DOIUrl":null,"url":null,"abstract":"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \"garbage in, garbage out\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":"{\"title\":\"Exploiting relationships for object consolidation\",\"authors\":\"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra\",\"doi\":\"10.1145/1077501.1077512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \\\"garbage in, garbage out\\\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.\",\"PeriodicalId\":306187,\"journal\":{\"name\":\"Information Quality in Information Systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Quality in Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1077501.1077512\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 68

摘要

在将算法应用于实际数据集之前,数据挖掘从业者经常需要花费大量的项目时间进行数据预处理。这样的预处理是必需的,因为许多现实世界的数据集并不完美,而是包含丢失、错误、重复数据和其他数据清理问题。一个公认的事实是,通常情况下,如果这些数据问题没有得到纠正,那么应用数据挖掘算法可能会导致错误的结果。后者被称为“垃圾输入,垃圾输出”原则。鉴于这个问题的重要性,过去已经设计了许多数据清理技术来解决上述数据问题。在本文中,我们将讨论数据清理中的一个挑战,即对象整合。之所以会出现这个重要的挑战,是因为数据集中的对象经常通过描述(一组实例化的属性)来表示,仅凭描述可能并不总是唯一地标识对象。对象整合的目标是正确整合(即,对数据集中的每个对象进行分组/确定)相同对象的所有表示。与传统的领域独立数据清理技术相比,我们的方法不仅分析对象特征,还分析额外的语义信息:对象间关系,以实现对象整合。该方法将数据集视为对象表示(节点)的属性关系图(arg),通过关系(边)连接。然后,该方法应用图划分技术来准确地聚类对象表示。我们对真实数据集的实证研究表明,分析关系显著提高了结果的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Exploiting relationships for object consolidation
Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the "garbage in, garbage out" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信