{"title":"Challenges of Open Data Quality","authors":"D. Corsar, P. Edwards","doi":"10.1145/3110291","DOIUrl":"https://doi.org/10.1145/3110291","url":null,"abstract":"The research described here was supported by the award made by the RCUK Digital Economy programme to the dot.rural Digital Economy Hub, award reference: EP/G066051/1; and by the Innovate UK award reference: 102615.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"22 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90520458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenge Paper","authors":"P. Arbuckle, E. Kahn, Adam Kriesberg","doi":"10.1145/3106236","DOIUrl":"https://doi.org/10.1145/3106236","url":null,"abstract":"Life Cycle Assessment is a modeling approach to assess the environmental aspects and potential environmental impacts (e.g., use of resources and the environmental consequences of releases) throughout a product’s life cycle from raw material acquisition through production, use, end-oflife treatment, recycling and final disposal (i.e., cradle-to-grave) (ISO 14040). It has been employed in recent years by industry and governments to address growing interest about the true costs of resource use, environmental impact, and other externalities of economic activity. Inherently multidisciplinary, LCA draws and synthesizes information from the social and physical sciences. This breadth within LCA models (often referred to as “data” by the community of practitioners) can make collecting and synthesizing information the most expensive component of an analysis and drives the need for model reuse. However, the LCA community is faced with a major challenge in its capacity to produce sufficient documentation and metadata to determine representation of these models and to reuse them correctly, an issue broadly affecting researchers across disciplines. Tenopir et al. (2011, 2015) found in each of two surveys of scientific data management and sharing practices that researchers do not feel equipped to generate metadata to facilitate reuse of their data. Furthermore, some researchers reported limited knowledge of available standards to describe data. The challenge in capacity in the LCA community is driven by two factors: the nascent state of standardization in LCA modeling and the strong focus on research and results for funded LCA work. Standardization serves to create a foundational set of rules and guidelines to support","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"15 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78750286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Exploratory Case Study to Understand Primary Care Users and Their Data Quality Tradeoffs","authors":"J. St-Maurice, C. Burns","doi":"10.1145/3058750","DOIUrl":"https://doi.org/10.1145/3058750","url":null,"abstract":"Primary care data is an important part of the evolving healthcare ecosystem. Generally, users in primary care are expected to provide excellent patient care and record high-quality data. In practice, users must balance sets of priorities regarding care and data. The goal of this study was to understand data quality tradeoffs between timeliness, validity, completeness, and use among primary care users. As a case study, data quality measures and metrics are developed through a focus group session with managers. After calculating and extracting measurements of data quality from six years of historic data, each measure was modeled with logit binomial regression to show correlations, characterize tradeoffs, and investigate data quality interactions. Measures and correlations for completeness, use, and timeliness were calculated for 196,967 patient encounters. Based on the analysis, there was a positive relationship between validity and completeness, and a negative relationship between timeliness and use. Use of data and reductions in entry delay were positively associated with completeness and validity. Our results suggest that if users are not provided with sufficient time to record data as part of their regular workflow, they will prioritize spending available time with patients. As a measurement of a primary care system's effectiveness, the negative correlation between use and timeliness points to a self-reinforcing relationship that provides users with little external value. In the future, additional data can be generated from comparable organizations to test several new hypotheses about primary care users.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"34 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79830650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Challenge of Quality in Social Computation","authors":"Milan Markovic, P. Edwards","doi":"10.1145/3041762","DOIUrl":"https://doi.org/10.1145/3041762","url":null,"abstract":"Interactive web technologies now enable a host of so-called social computations, which can address challenges that are beyond the capabilities of machines alone. Notable examples of such social computation systems include Galaxy Zoo,1 BeeWatch,2 and Ushahidi,3 operating in fields as diverse as classification of newly discovered galaxies, monitoring of bee populations, and disaster management. A system for earthquake prediction using social media [Sakaki et al. 2010] illustrates how such computations can also emerge on social networking platforms. Social computations can be modeled as a complex collection of structured activities (i.e. workflows) that represent a blend of human and machine tasks, with associated objectives and reward mechanisms. In our previous work [Markovic et al. 2013; Markovic 2016] we argued that recording provenance of social computation workflows would enhance decision-making support for all associated stakeholders; these include initiators, participants, and beneficiaries of such computations. In the next section, we will briefly introduce the key characteristics of complex social computation systems before discussing why quality assessments in such a context are challenging.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"20 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87712900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Data Repurposing Challenge","authors":"Philip Woodall","doi":"10.1145/3022698","DOIUrl":"https://doi.org/10.1145/3022698","url":null,"abstract":"When data is collected for the first time, the data collector has in mind the data quality requirements that must be satisfied before it can be used successfully—that is, the data collector ensures “fitness for use”—the commonly agreed upon definition of data quality [Wang and Strong 1996]. However, data that is repurposed [Woodall and Wainman 2015], as opposed to reused, must be managed with multiple different fitness for use requirements in mind, which complicates any data quality enhancements [Ballou and Pazer 1985]. While other work has considered context in relation to data quality requirements, including the need to meet multiple fitness for use requirements [Watts et al. 2009; Bertossi et al. 2011], in the current fast-paced environment of data repurposing for analytics and business intelligence, there are new challenges for dealing with multiple fitness for use requirements in the context of:","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"97 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81661955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experience","authors":"Leena Al-Hussaini","doi":"10.1145/3092700","DOIUrl":"https://doi.org/10.1145/3092700","url":null,"abstract":"Hunspell is a morphological spell checker and automatic corrector for Macintosh 10.6 and later versions. Aspell is a general spell checker and automatic corrector for the GNU operating system. In this experience article, we present a benchmarking study of the performance of Hunspell and Aspell. Ginger is a general grammatical spell checker that is used as a baseline to compare the performance of Hunspell and Aspell. A benchmark dataset was carefully selected to be a mixture of different error types at different word length levels. Further, the benchmarking data are from very bad spellers and will challenge any spell checker. The extensive study described in this work will characterize the respective softwares and benchmarking data from multiple perspectives and will consider many error statistics. Overall, Hunspell can correct 415/469 words and Aspell can correct 414/469 words. The baseline Ginger can correct 279/469 words. We recommend this dataset as the preferred benchmark dataset for evaluating newly developed “isolated word” spell checkers.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"106 1","pages":"1 - 10"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81227339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependable Data Repairing with Fixing Rules","authors":"Jiannan Wang, N. Tang","doi":"10.1145/3041761","DOIUrl":"https://doi.org/10.1145/3041761","url":null,"abstract":"One of the main challenges that data-cleaning systems face is to automatically identify and repair data errors in a dependable manner. Though data dependencies (also known as integrity constraints) have been widely studied to capture errors in data, automated and dependable data repairing on these errors has remained a notoriously difficult problem. In this work, we introduce an automated approach for dependably repairing data errors, based on a novel class of fixing rules. A fixing rule contains an evidence pattern, a set of negative patterns, and a fact value. The heart of fixing rules is deterministic: given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely capture which attribute is wrong, and the fact indicates how to correct this error. We study several fundamental problems associated with fixing rules and establish their complexity. We develop efficient algorithms to check whether a set of fixing rules are consistent and discuss approaches to resolve inconsistent fixing rules. We also devise efficient algorithms for repairing data errors using fixing rules. Moreover, we discuss approaches on how to generate a large number of fixing rules from examples or available knowledge bases. We experimentally demonstrate that our techniques outperform other automated algorithms in terms of the accuracy of repairing data errors, using both real-life and synthetic data.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"27 1","pages":"1 - 34"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74181624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QDflows","authors":"Sabrina Abdellaoui, Fahima Nader, R. Chalal","doi":"10.1145/3064173","DOIUrl":"https://doi.org/10.1145/3064173","url":null,"abstract":"In the big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the data integration system (DIS). Designing data flows is facing several challenges. In this article, we deal with data quality issues such as (1) specifying a set of quality rules, (2) enforcing them on the data flow pipeline to detect violations, and (3) producing accurate repairs for the detected violations. We propose QDflows, a system for designing quality-aware data flows that considers the following as input: (1) a high-quality knowledge base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users’ requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on validated users’ requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from sources to populate the DIS. It detects violations of quality rules enforced on the data flows, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, the DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80759605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ontological Multidimensional Data Models and Contextual Data Quality","authors":"L. Bertossi, Mostafa Milani","doi":"10.1145/3148239","DOIUrl":"https://doi.org/10.1145/3148239","url":null,"abstract":"Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment are mapped into the context for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context, we include a generalized multidimensional data model and a Datalog± ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, and dimensional rules and define predicates for quality data specification. Query answering relies on and triggers navigation through dimension hierarchies and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se beyond applications to data quality. It allows for a logic-based and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"32 1","pages":"1 - 36"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81010519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction","authors":"S. Goldberg, D. Wang, Christan Earl Grant","doi":"10.1145/3012003","DOIUrl":"https://doi.org/10.1145/3012003","url":null,"abstract":"The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-the-art structured prediction methods, however, are likely to contain errors and it is important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. In this article, we introduce pi-CASTLE, a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving Conditional Random Fields (CRFs). We propose strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we show an order of magnitude improvement in accuracy gain over baseline methods.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"39 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2017-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79859169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}