Information Quality in Information Systems最新文献

ETL queues for active data warehousing 用于活动数据仓库的ETL队列

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077509

Alexandros Karakasidis, Panos Vassiliadis, E. Pitoura

引用次数: 109

Data cleaning using belief propagation 使用信念传播的数据清理

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077518

F. Chu, Yizhou Wang, D. S. Parker, C. Zaniolo

引用次数: 20

Methods and analyses for determining quality 质量测定方法及分析

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077505

W. Winkler

{"title":"Methods and analyses for determining quality","authors":"W. Winkler","doi":"10.1145/1077501.1077505","DOIUrl":"https://doi.org/10.1145/1077501.1077505","url":null,"abstract":"In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a \"true income\" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating \"truth\", can replace exceptionally large amounts of clerical review and allow many uses of the \"cleaned\" files.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Approximate matching of textual domain attributes for information source integration 信息源集成中文本域属性的近似匹配

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077516

A. Koeller, Vinay Keelara

引用次数: 3

Blocking-aware private record linkage 阻塞感知私有记录链接

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077513

A. Al-Lawati, Dongwon Lee, P. Mcdaniel

引用次数: 96

Data quality inference 数据质量推断

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077519

R. K. Pon, A. F. Cardenas

引用次数: 13

Making quality count in biological data sources 使生物数据来源的质量计数

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077508

Alexandra Martínez, J. Hammer

引用次数: 24

Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example 聚类混合数值和低质量分类数据:酵母示例上的显著性度量

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077517

Bill Andreopoulos, Aijun An, Xiaogang Wang

引用次数: 7

Handling data quality in entity resolution 处理实体解析中的数据质量

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077503

H. Garcia-Molina

{"title":"Handling data quality in entity resolution","authors":"H. Garcia-Molina","doi":"10.1145/1077501.1077503","DOIUrl":"https://doi.org/10.1145/1077501.1077503","url":null,"abstract":"Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources.Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields.An ER algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can.In many ER applications the input data has data quality or uncertainty values associated with it. Furthermore, the ER process itself introduces additional uncertainties, e.g., we may only be 90% confident that two given records actually correspond to the same real-world entity.In this talk Hector Garcia-Molina will discuss the challenges in representing quality/uncertainty/confidences in a way that is useful for the ER process.He will also present some preliminary ideas on how to perform ER with uncertain data. (This work is joint with Omar Benjelloun, David Menestrina, Qi Su, and Jennifer Widom).","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting relationships for object consolidation 利用关系进行对象整合

Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077512

Zhaoqi Chen, D. Kalashnikov, S. Mehrotra

{"title":"Exploiting relationships for object consolidation","authors":"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra","doi":"10.1145/1077501.1077512","DOIUrl":"https://doi.org/10.1145/1077501.1077512","url":null,"abstract":"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \"garbage in, garbage out\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127821480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68