利用关系进行对象整合

Information Quality in Information Systems Pub Date : 2005-06-17 DOI:10.1145/1077501.1077512

Zhaoqi Chen, D. Kalashnikov, S. Mehrotra

{"title":"利用关系进行对象整合","authors":"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra","doi":"10.1145/1077501.1077512","DOIUrl":null,"url":null,"abstract":"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \"garbage in, garbage out\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":"{\"title\":\"Exploiting relationships for object consolidation\",\"authors\":\"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra\",\"doi\":\"10.1145/1077501.1077512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \\\"garbage in, garbage out\\\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.\",\"PeriodicalId\":306187,\"journal\":{\"name\":\"Information Quality in Information Systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Quality in Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1077501.1077512\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

摘要

在将算法应用于实际数据集之前，数据挖掘从业者经常需要花费大量的项目时间进行数据预处理。这样的预处理是必需的，因为许多现实世界的数据集并不完美，而是包含丢失、错误、重复数据和其他数据清理问题。一个公认的事实是，通常情况下，如果这些数据问题没有得到纠正，那么应用数据挖掘算法可能会导致错误的结果。后者被称为“垃圾输入，垃圾输出”原则。鉴于这个问题的重要性，过去已经设计了许多数据清理技术来解决上述数据问题。在本文中，我们将讨论数据清理中的一个挑战，即对象整合。之所以会出现这个重要的挑战，是因为数据集中的对象经常通过描述(一组实例化的属性)来表示，仅凭描述可能并不总是唯一地标识对象。对象整合的目标是正确整合(即，对数据集中的每个对象进行分组/确定)相同对象的所有表示。与传统的领域独立数据清理技术相比，我们的方法不仅分析对象特征，还分析额外的语义信息:对象间关系，以实现对象整合。该方法将数据集视为对象表示(节点)的属性关系图(arg)，通过关系(边)连接。然后，该方法应用图划分技术来准确地聚类对象表示。我们对真实数据集的实证研究表明，分析关系显著提高了结果的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploiting relationships for object consolidation

Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the "garbage in, garbage out" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Quality in Information Systems

自引率

0.00%

发文量