{"title":"Provider issues in quality-constrained data provisioning","authors":"P. Missier, Suzanne M. Embury","doi":"10.1145/1077501.1077507","DOIUrl":"https://doi.org/10.1145/1077501.1077507","url":null,"abstract":"Formal frameworks exist that allow service providers and users to negotiate the quality of a service. While these agreements usually include non-functional service properties, the quality of the information offered by a provider is neglected. Yet, in important application scenarios, notably in those based on the Service-Oriented computing paradigm, the outcome of complex workflows is directly affected by the quality of the data involved. In this paper, we propose a model for formal data quality agreements between data providers and data consumers, and analyze its feasibility by showing how a provider may take data quality constraints into account as part of its data provisioning process. Our analysis of the technical issues involved suggests that this is a complex problem in general, although satisfactory algorithmic and architectural solutions can be found under certain assumptions. To support this claim, we describe an algorithm for dealing with constraints on the completeness of a query result with respect to a reference data source, and outline an initial provider architecture for managing more general data quality constraints.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122870674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph Bugajski, R. Grossman, E. Sumner, Zhao Tang
{"title":"An event based framework for improving information quality that integrates baseline models, causal models and formal reference models","authors":"Joseph Bugajski, R. Grossman, E. Sumner, Zhao Tang","doi":"10.1145/1077501.1077510","DOIUrl":"https://doi.org/10.1145/1077501.1077510","url":null,"abstract":"We introduce a framework for improving information quality in complex distributed systems that integrates: 1) Analytic models that describe baseline values for attributes and combinations of attributes and components that detect statistically significant changes from baselines. These models determine whether a significant change has occurred, and if so, when. 2) Casual models that help determine why a statistically significant change has occurred and what its impact is. These models focus on the reasons for a change. 3) Formal business and technical reference models so that data and information quality problems are less likely to occur in the future. In this note, we focus on the first two types of models and describe how this framework applies to data quality problems associated with electronic payments transactions and highway traffic patterns.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124361096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongwon Lee, Byung-Won On, Jaewoo Kang, Sanghyun Park
{"title":"Effective and scalable solutions for mixed and split citation problems in digital libraries","authors":"Dongwon Lee, Byung-Won On, Jaewoo Kang, Sanghyun Park","doi":"10.1145/1077501.1077514","DOIUrl":"https://doi.org/10.1145/1077501.1077514","url":null,"abstract":"In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale. After formally defining the problems and accompanying challenges, we present an effective solution that is based on the state-of-the-art sampling-based approximate join algorithm. Our claim is verified through preliminary experimental results.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131839646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A generalized cost optimal decision model for record matching","authors":"Vassilios S. Verykios, G. Moustakides","doi":"10.1145/1012453.1012457","DOIUrl":"https://doi.org/10.1145/1012453.1012457","url":null,"abstract":"Record (or entity) matching or linkage is the process of identifying records in one or more data sources, that refer to the same real world entity or object. In record linkage, the ultimate goal of a decision model is to provide the decision maker with a tool for making decisions upon the actual matching status of a pair of records (i.e., documents, events, persons, cases, etc.). Existing models of record linkage rely on decision rules that minimize the probability of subjecting a case to clerical review, conditional on the probabilities of erroneous matches and erroneous non-matches. In practice though, (a) the value of an erroneous match is, in many applications, quite different from the value of an erroneous non-match, and (b) the cost and the probability of a misclassification, which is associated with the clerical review, is ignored in this way. In this paper, we present a decision model which is optimal, based on the cost of the record linkage operation, and general enough to accommodate multi-class or multi-decision case studies. We also present an example along with the results from applying the proposed model to large comparison spaces.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"240 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122539892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for analysis of data freshness","authors":"M. Bouzeghoub, Verónika Peralta","doi":"10.1145/1012453.1012464","DOIUrl":"https://doi.org/10.1145/1012453.1012464","url":null,"abstract":"Data freshness has been identified as one of the most important data quality attributes in information systems. This importance increases particularly in the context of distributed systems, composed of a large set of autonomous data sources, where integrating data having different freshness may lead to semantic problems. There are various definitions of data freshness in the literature, depending on the applications where they are used, as well as different metrics to measure them. This paper presents an analysis of these definitions and metrics and proposes a taxonomy based upon the nature of the data, the type of application and the synchronization policies underlying the multi-source information system. We analyze, in terms of the taxonomy, the way freshness is defined and used in several types of systems and we present some open research problems in the field of data freshness evaluation.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132263186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting duplicate objects in XML documents","authors":"Melanie Herschel, Felix Naumann","doi":"10.1145/1012453.1012456","DOIUrl":"https://doi.org/10.1145/1012453.1012456","url":null,"abstract":"The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115063588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utility-based resolution of data inconsistencies","authors":"Amihai Motro, P. Anokhin, A. Acar","doi":"10.1145/1012453.1012460","DOIUrl":"https://doi.org/10.1145/1012453.1012460","url":null,"abstract":"A virtual database system is software that provides unified access to multiple information sources. If the sources are overlapping in their contents and independently maintained, then the likelihood of inconsistent answers is high. Solutions are often based on ranking (which sorts the different answers according to recurrence) and on fusion (which synthesizes a new value from the different alternatives according to a specific formula). In this paper we argue that both methods are flawed, and we offer alternative solutions that are based on knowledge about the performance of the source data; including features such as recentness, availability, accuracy and cost. These features are combined in a flexible utility function that expresses the overall value of a data item to the user. Utility allows us to (1) define meaningful ranking on the inconsistent set of answers, and offer the topranked answer as a preferred answer; (2) determine whether a fusion value is indeed better than the initial values, by calculating its utility and comparing it to the utilities of the initial values; and (3) discover the best fusion: the fusion formula that optimizes the utility. The advantages of such performance-based and utility-driven ranking and fusion are considerable.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"38 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131809355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Execution of data mappers","authors":"Paulo Carreira, H. Galhardas","doi":"10.1145/1012453.1012455","DOIUrl":"https://doi.org/10.1145/1012453.1012455","url":null,"abstract":"Data mappers are essential operators for implementing data transformations supporting schema mapping and integration scenarios such as legacy data migration, ETL processes for data warehousing, data cleaning activities, and business integration initiatives. Despite their widespread use, no formalization of this important operation has been proposed so far. In this paper we propose the data mapper operator as an extension to the relational algebra. We supply a set of algebraic rewriting rules for optimizing queries that combine standard relational operators with data mappers. Finally, we propose algorithms for their efficient physical execution.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117075214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining for patterns in contradictory data","authors":"Heiko Müller, U. Leser, J. Freytag","doi":"10.1145/1012453.1012463","DOIUrl":"https://doi.org/10.1145/1012453.1012463","url":null,"abstract":"Information integration is often faced with the problem that different data sources represent the same set of the real-world objects, but give conflicting values for specific properties of these objects. Within this paper we present a model of such conflicts and describe an algorithm for efficiently detecting patterns of conflicts in a pair of overlapping data sources. The contradiction patterns we can find are a special kind of association rules, describing regularities in conflicts occurring together with certain attribute values, paris of attribute values, or with other conflicts. Therefore, we adapt existing association rule mining algorithms for mining contradiction patterns. Such patterns are an important tool for human experts that try to find and resolve problems in data quality using domain knowledge. We present the results of applying our method on a real world data set from the life science domain and show how it helps to generate clean data for integrated data warehouses.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127777671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe De Giacomo, D. Lembo, M. Lenzerini, R. Rosati
{"title":"Tackling inconsistencies in data integration through source preferences","authors":"Giuseppe De Giacomo, D. Lembo, M. Lenzerini, R. Rosati","doi":"10.1145/1012453.1012459","DOIUrl":"https://doi.org/10.1145/1012453.1012459","url":null,"abstract":"Dealing with inconsistencies is one the main challenges in data integration systems, where data stored in the local sources may violate integrity constraints specified at the global level. Recently, declarative approaches have been proposed to deal with such a problem. Existing declarative proposals do not take into account preference assertions specified between sources when trying to solve inconsistency. On the other hand, the designer of an integration system may often include in the specification preference rules indicating the quality of data sources. In this paper, we consider Local-As-View integration systems, and propose a method that allows one to assign formal semantics to a data integration system whose declarative specification includes information on source preferences. To the best of our knowledge, our approach is the first one to consider in a declarative way information on source quality for dealing with inconsistent data in Local-As-View integration systems.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}