Information Quality in Information Systems最新文献

筛选
英文 中文
ETL queues for active data warehousing 用于活动数据仓库的ETL队列
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077509
Alexandros Karakasidis, Panos Vassiliadis, E. Pitoura
{"title":"ETL queues for active data warehousing","authors":"Alexandros Karakasidis, Panos Vassiliadis, E. Pitoura","doi":"10.1145/1077501.1077509","DOIUrl":"https://doi.org/10.1145/1077501.1077509","url":null,"abstract":"Traditionally, the refreshment of data warehouses has been performed in an off-line fashion. Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data. In this paper, we propose a framework for the implementation of active data warehousing, with the following goals: (a) minimal changes in the software configuration of the source, (b) minimal overhead for the source due to the active nature of data propagation, (c) the possibility of smoothly regulating the overall configuration of the environment in a principled way. In our framework, we have implemented ETL activities over queue networks and employ queue theory for the prediction of the performance and the tuning of the operation of the overall refreshment process. Due to the performance overheads incurred, we explore different architectural choices for this task and discuss the issues that arise for each of them.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127116165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Data cleaning using belief propagation 使用信念传播的数据清理
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077518
F. Chu, Yizhou Wang, D. S. Parker, C. Zaniolo
{"title":"Data cleaning using belief propagation","authors":"F. Chu, Yizhou Wang, D. S. Parker, C. Zaniolo","doi":"10.1145/1077501.1077518","DOIUrl":"https://doi.org/10.1145/1077501.1077518","url":null,"abstract":"Effective data cleaning is critical in many applications where the quality of data is poor due to missing values or inaccurate values. Fortunately, a wide spectrum of applications exhibit strong dependencies between data samples, and such dependencies can be used very effectively for cleaning the data. For example, the readings of nearby sensors are generally correlated, and proteins interact with each other when performing crucial functions. We propose a data cleaning approach, based on modeling data dependencies with Markov networks. Belief propagation is used to efficiently compute the marginal or maximum posterior probabilities, so as to infer missing values or to correct errors. To illustrate the benefits and generality of the technique, we discuss its use in several applications and report on the data quality and improvements so obtained.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122511684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Methods and analyses for determining quality 质量测定方法及分析
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077505
W. Winkler
{"title":"Methods and analyses for determining quality","authors":"W. Winkler","doi":"10.1145/1077501.1077505","DOIUrl":"https://doi.org/10.1145/1077501.1077505","url":null,"abstract":"In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a \"true income\" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating \"truth\", can replace exceptionally large amounts of clerical review and allow many uses of the \"cleaned\" files.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Approximate matching of textual domain attributes for information source integration 信息源集成中文本域属性的近似匹配
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077516
A. Koeller, Vinay Keelara
{"title":"Approximate matching of textual domain attributes for information source integration","authors":"A. Koeller, Vinay Keelara","doi":"10.1145/1077501.1077516","DOIUrl":"https://doi.org/10.1145/1077501.1077516","url":null,"abstract":"A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114458671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Blocking-aware private record linkage 阻塞感知私有记录链接
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077513
A. Al-Lawati, Dongwon Lee, P. Mcdaniel
{"title":"Blocking-aware private record linkage","authors":"A. Al-Lawati, Dongwon Lee, P. Mcdaniel","doi":"10.1145/1077501.1077513","DOIUrl":"https://doi.org/10.1145/1077501.1077513","url":null,"abstract":"In this paper, the problem of quickly matching records (i.e., record linkage problem) from two autonomous sources without revealing privacy to the other parties is considered. In particular, our focus is to devise secure blocking scheme to improve the performance of record linkage significantly while being secure. Although there have been works on private record linkage, none has considered adopting the blocking framework. Therefore, our proposed blocking-aware private record linkage can perform large-scale record linkage without revealing privacy. Preliminary experimental results showing the potential of the proposal are reported.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133541849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
Data quality inference 数据质量推断
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077519
R. K. Pon, A. F. Cardenas
{"title":"Data quality inference","authors":"R. K. Pon, A. F. Cardenas","doi":"10.1145/1077501.1077519","DOIUrl":"https://doi.org/10.1145/1077501.1077519","url":null,"abstract":"In the field of sensor networks, data integration and collaboration, and intelligence gathering efforts, information on the quality of data sources are important but are often not available. We describe a technique to rank data sources by observing and comparing their behavior (i.e., the data produced by data sources) to rank. Intuitively, our measure characterizes data sources that agree with accurate or high-quality data sources as likely accurate. Furthermore, our measure includes a temporal component that takes into account a data source's past accuracy in evaluating its current accuracy. Initial experimental results based on simulation data to support our hypothesis demonstrate high precision and recall on identifying the most accurate data sources.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131335355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Making quality count in biological data sources 使生物数据来源的质量计数
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077508
Alexandra Martínez, J. Hammer
{"title":"Making quality count in biological data sources","authors":"Alexandra Martínez, J. Hammer","doi":"10.1145/1077501.1077508","DOIUrl":"https://doi.org/10.1145/1077501.1077508","url":null,"abstract":"We propose an extension to the semistructured data model that captures and integrates information about the quality of the stored data. Specifically, we describe the main challenges involved in measuring and representing data quality, and how we addressed them. These challenges include extending an existing data model to include quality metadata, identifying useful quality measures, and devising a way to compute and update the value of the quality measures as data is queried and updated. Although our approach can be generalized to various other domains, it is currently aimed at describing the quality of biological data sources. We illustrate the benefits of our model using several examples from biological databases.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134241233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example 聚类混合数值和低质量分类数据:酵母示例上的显著性度量
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077517
Bill Andreopoulos, Aijun An, Xiaogang Wang
{"title":"Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example","authors":"Bill Andreopoulos, Aijun An, Xiaogang Wang","doi":"10.1145/1077501.1077517","DOIUrl":"https://doi.org/10.1145/1077501.1077517","url":null,"abstract":"We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121894289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Handling data quality in entity resolution 处理实体解析中的数据质量
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077503
H. Garcia-Molina
{"title":"Handling data quality in entity resolution","authors":"H. Garcia-Molina","doi":"10.1145/1077501.1077503","DOIUrl":"https://doi.org/10.1145/1077501.1077503","url":null,"abstract":"Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources.Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields.An ER algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can.In many ER applications the input data has data quality or uncertainty values associated with it. Furthermore, the ER process itself introduces additional uncertainties, e.g., we may only be 90% confident that two given records actually correspond to the same real-world entity.In this talk Hector Garcia-Molina will discuss the challenges in representing quality/uncertainty/confidences in a way that is useful for the ER process.He will also present some preliminary ideas on how to perform ER with uncertain data. (This work is joint with Omar Benjelloun, David Menestrina, Qi Su, and Jennifer Widom).","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting relationships for object consolidation 利用关系进行对象整合
Information Quality in Information Systems Pub Date : 2005-06-17 DOI: 10.1145/1077501.1077512
Zhaoqi Chen, D. Kalashnikov, S. Mehrotra
{"title":"Exploiting relationships for object consolidation","authors":"Zhaoqi Chen, D. Kalashnikov, S. Mehrotra","doi":"10.1145/1077501.1077512","DOIUrl":"https://doi.org/10.1145/1077501.1077512","url":null,"abstract":"Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \"garbage in, garbage out\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127821480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信